words-n-numbers

Tokenizing strings of text. Extracting arrays of words and optionally number, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions. Part of document processing for search-index and nowsearch.xyz.

Usage no npm install needed!

<script type="module">
  import wordsNNumbers from 'https://cdn.skypack.dev/words-n-numbers';
</script>

README

Words'n'numbers

Tokenizing strings of text. Extracting arrays of words and optionally number, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions. Part of document processing for search-index and nowsearch.xyz.

Inspired by extractwords

NPM version NPM downloads Build Status JavaScript Style Guide MIT License

Initiating

Node.js

const wnn = require('words-n-numbers')
// wnn available

Browser

<script src="wnn.js"></script>

<script>
  //wnn available
</script>

Use

The default regex should catch every unicode character from for every language.

Only words

let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords)
// returns ['A', 'dollars', 'baby']

Only words, converted to lowercase

let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { toLowercase: true })
// returns ['a', 'dollars', 'baby']

Predefined regex for words and numbers, converted to lowercase

let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbers, toLowercase: true })
// returns ['a', '1000000', 'dollars', 'baby']

Predefined regex for words and emoticons, converted to lowercase

let stringOfWords = 'A ticket to 大é˜Ē costs ÂĨ2000 👌😄 đŸ˜ĸ'
wnn.extract(stringOfWords, { regex: wnn.wordsEmojis, toLowercase: true })
// returns [ 'A', 'ticket', 'to', '大é˜Ē', 'costs', '👌😄', 'đŸ˜ĸ' ]

Predefined regex for numbers and emoticons

let stringOfWords = 'A ticket to 大é˜Ē costs ÂĨ2000 👌😄 đŸ˜ĸ'
wnn.extract(stringOfWords, { regex: wnn.numbersEmojis, toLowercase: true })
// returns [ '2000', '👌😄', 'đŸ˜ĸ' ]

Predefined regex for words, numbers and emoticons, converted to lowercase

let stringOfWords = 'A ticket to 大é˜Ē costs ÂĨ2000 👌😄 đŸ˜ĸ'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbersEmojis, toLowercase: true })
// returns [ 'a', 'ticket', 'to', '大é˜Ē', 'costs', '2000', '👌😄', 'đŸ˜ĸ' ]

Predefined regex for #tags

let stringOfWords = 'A #49ticket to #大é˜Ē or two#tickets costs ÂĨ2000 👌😄😄 đŸ˜ĸ'
wnn.extract(stringOfWords, { regex: wnn.tags, toLowercase: true })
// returns [ '#49ticket', '#大é˜Ē' ]

Predefined regex for @usernames

let stringOfWords = 'A #ticket to #大é˜Ē costs bob@bob.com, @alice and @įžŽæž— ÂĨ2000 👌😄😄 đŸ˜ĸ'
wnn.extract(stringOfWords, { regex: wnn.usernames, toLowercase: true })
// returns [ '@alice123', '@įžŽæž—' ]

Predefined regex for email addresses

let stringOfWords = 'A #ticket to #大é˜Ē costs bob@bob.com, alice.allison@alice123.com, some-name.nameson.nameson@domain.org and @įžŽæž— ÂĨ2000 👌😄😄 đŸ˜ĸ'
wnn.extract(stringOfWords, { regex: wnn.email, toLowercase: true })
// returns [ 'bob@bob.com', 'alice.allison@alice123.com', 'some-name.nameson.nameson@domain.org' ]

Custom regex

let stringOfWords = 'This happens at 5 o\'clock !!!'
wnn.extract(stringOfWords, { regex: '[a-z\'0-9]+' })
// returns ['This', 'happens', 'at', '5', 'o\'clock']

API

Extract function

Returns an array of words and optionally numbers.

wnn.extract(stringOfText, \<options-object\>)

Options object

{
  regex: '[custom or predefined regex]',  // defaults to wnn.words
  toLowercase: [true / false]             // defaults to false
}

Predefined regex'es

wnn.words              // only words, any language <-- default
wnn.numbers            // only numbers
wnn.emojis             // only emojis
wnn.wordsNumbers       // words (any language) and numbers
wnn.wordsEmojis        // words (any language) and emojis
wnn.numbersEmojis      // numbers and emojis
wnn.wordsNumbersEmojis // words (any language), numbers and emojis
wnn.tags               // #tags (any language
wnn.usernames          // @usernames (any language)
wnn.email              // email addresses. Most valid addresses,
                       //   but not to be used as a validator

Languages supported

Supports most languages supported by stopword, and others too. Some languages like Japanese and Chinese simplified needs to be tokenized. May add tokenizers at a later stage.

PR's welcome

PR's and issues are more than welcome =)