README
Words'n'numbers
Tokenizing strings of text. Extracting arrays of words and optionally number, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions. Part of document processing for search-index and nowsearch.xyz.
Inspired by extractwords
Initiating
Node.js
const wnn = require('words-n-numbers')
// wnn available
Browser
<script src="wnn.js"></script>
<script>
//wnn available
</script>
Use
The default regex should catch every unicode character from for every language.
Only words
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords)
// returns ['A', 'dollars', 'baby']
Only words, converted to lowercase
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { toLowercase: true })
// returns ['a', 'dollars', 'baby']
Predefined regex for words and numbers, converted to lowercase
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbers, toLowercase: true })
// returns ['a', '1000000', 'dollars', 'baby']
Predefined regex for words and emoticons, converted to lowercase
let stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.wordsEmojis, toLowercase: true })
// returns [ 'A', 'ticket', 'to', '大éĒ', 'costs', 'đđ', 'đĸ' ]
Predefined regex for numbers and emoticons
let stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.numbersEmojis, toLowercase: true })
// returns [ '2000', 'đđ', 'đĸ' ]
Predefined regex for words, numbers and emoticons, converted to lowercase
let stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbersEmojis, toLowercase: true })
// returns [ 'a', 'ticket', 'to', '大éĒ', 'costs', '2000', 'đđ', 'đĸ' ]
#tags
Predefined regex for let stringOfWords = 'A #49ticket to #大éĒ or two#tickets costs ÂĨ2000 đđđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.tags, toLowercase: true })
// returns [ '#49ticket', '#大éĒ' ]
@usernames
Predefined regex for let stringOfWords = 'A #ticket to #大éĒ costs bob@bob.com, @alice and @įžæ ÂĨ2000 đđđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.usernames, toLowercase: true })
// returns [ '@alice123', '@įžæ' ]
Predefined regex for email addresses
let stringOfWords = 'A #ticket to #大éĒ costs bob@bob.com, alice.allison@alice123.com, some-name.nameson.nameson@domain.org and @įžæ ÂĨ2000 đđđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.email, toLowercase: true })
// returns [ 'bob@bob.com', 'alice.allison@alice123.com', 'some-name.nameson.nameson@domain.org' ]
Custom regex
let stringOfWords = 'This happens at 5 o\'clock !!!'
wnn.extract(stringOfWords, { regex: '[a-z\'0-9]+' })
// returns ['This', 'happens', 'at', '5', 'o\'clock']
API
Extract function
Returns an array of words and optionally numbers.
wnn.extract(stringOfText, \<options-object\>)
Options object
{
regex: '[custom or predefined regex]', // defaults to wnn.words
toLowercase: [true / false] // defaults to false
}
Predefined regex'es
wnn.words // only words, any language <-- default
wnn.numbers // only numbers
wnn.emojis // only emojis
wnn.wordsNumbers // words (any language) and numbers
wnn.wordsEmojis // words (any language) and emojis
wnn.numbersEmojis // numbers and emojis
wnn.wordsNumbersEmojis // words (any language), numbers and emojis
wnn.tags // #tags (any language
wnn.usernames // @usernames (any language)
wnn.email // email addresses. Most valid addresses,
// but not to be used as a validator
Languages supported
Supports most languages supported by stopword, and others too. Some languages like Japanese and Chinese simplified needs to be tokenized. May add tokenizers at a later stage.
PR's welcome
PR's and issues are more than welcome =)