happynodetokenizer

A simple, Twitter-aware tokenizer.

Usage no npm install needed!

<script type="module">
  import happynodetokenizer from 'https://cdn.skypack.dev/happynodetokenizer';
</script>

README

😄 HappyNodeTokenizer

A basic Twitter aware tokenizer.

A Javascript port of HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz.

Install

  npm install --save happynodetokenizer

UMD, IIFE, CJS and ESM builds are available in the ./dist directory.

Usage

HappyNodeTokenizer exposes a function called tokenize() which takes up to 3 arguments: the input string to tokenize, an optional 'options' object (see below), and an optional callback function.

Example

import { tokenize } from 'happynodetokenizer'; // const tokenizer = require('happynodetokenizer'); can also be used
const text = 'A big long string of text...';
const opts = {
  'mode': 'stanford',
  'normalize': false,
  'preserveCase': false,
  'tag': false,
};
const tokens = tokenize(text, opts);
console.log(tokens)

Output Examples

Default (opts.tag = false)

Input = "RT @ #happyfuncoding: this is a typical Twitter tweet :-)"

Output =

['rt', '@', '#happyfuncoding', ':', 'this', 'is', 'a', 'typical', 'twitter', 'tweet', ':-)']

opts.tag = true

Input = "RT @ #happyfuncoding: this is a typical Twitter tweet :-)"

Output =

[
  { value: 'rt', tag: 'word' },
  { value: '@', tag: 'punct' },
  { value: '#happyfuncoding', tag: 'hashtag' },
  { value: ':', tag: 'punct' },
  { value: 'this', tag: 'word' },
  { value: 'is', tag: 'word' },
  { value: 'a', tag: 'word' },
  { value: 'typical', tag: 'word' },
  { value: 'twitter', tag: 'word' },
  { value: 'tweet', tag: 'word' },
  { value: ':-)', tag: 'emoticon' }
]

See tags below for more detail.

The Options Object

The options object and its properties are optional. The defaults are:

{
  'mode': 'stanford',
  'normalize': false,
  'preserveCase': false,
  'tag': false,
};

mode

string - valid options: stanford (default), or dlatk

stanford mode uses the original HappyFunTokenizer pattern. See Github.

dlatk mode uses the modified HappierFunTokenizing pattern. See Github.

normalize

boolean - valid options: "NFC" | "NFD" | "NFKC" | "NFKD" (default = undefined)

Normalize strings. E.g. when set, mañana becomes manana.

preserveCase

boolean - valid options: true, or false (default)

Preserves the case of the input string. Does not affect emoticons.

tag

boolean - valid options: true, or false (default)

Return an array of tagged token objects instead of just an array of tokens

  • true = return an array of token objects: [ {value: 'token', tag: 'word' }, ... ]
  • false = return an array of tokens: [ 'token', 'another', 'word', ... ]

Tags

When opts.tag === true, HappyNodeTokenizer will output an array of token objects. Each token object has two properties: value and tag. The value is the token itself, the tag is a descriptor based on one of the following depending on which opt.mode you are using:

Tag Stanford DLATK Example
phone :heavy_check_mark: :heavy_check_mark: +1 (800) 123-4567
url :x: :heavy_check_mark: http://www.youtube.com
url_scheme :x: :heavy_check_mark: http://
url_authority :x: :heavy_check_mark: [0-3]
url_path_query :x: :heavy_check_mark: /index.html?s=search
htmltag :x: :heavy_check_mark: <em class='grumpy'>
emoticon :heavy_check_mark: :heavy_check_mark: >:(
username :heavy_check_mark: :heavy_check_mark: @phughesmcr
hashtag :heavy_check_mark: :heavy_check_mark: #tokenizing
punct :heavy_check_mark: :heavy_check_mark: ,
word :heavy_check_mark: :heavy_check_mark: hello
<UNK> :heavy_check_mark: :heavy_check_mark: (anything left unmatched)

Testing

To compare the results of HappyNodeTokenizer against HappyFunTokenizer and HappierFunTokenizing, run:

npm run test

The goal of this project is to provide a Node.js port of HappyFunTokenizer and HappierFunTokenizing. Therefore, any pull requests with test failures will not be accepted.

License

(C) 2017-21 P. Hughes. All rights reserved.

Shared under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.