document-tfidf

A TFIDF analysis package that allows for tokens of any word length

Usage no npm install needed!

<script type="module">
  import documentTfidf from 'https://cdn.skypack.dev/document-tfidf';
</script>

README

Getting Started

Install package with:

  npm install document-tfidf

Features:

  • countTermFrequencies
  • storeTermFrequencies
  • normalizeTermFrequencies
  • identifyUniqueTerms
  • fullTFIDFAnalysis

Documentation

  • Term Frequency - Inverse Document Frequency (TFIDF) Module:
    • countTermFrequencies: function(text [, options])
      • Counts the number of times each token appears in the input text.
      • Current options include tokenLength, which dictates the number of words that comprise each token. tokenLength defaults to 1.
      • Depends on nGrams module, which can get all tokens with arbitrary length.
    • storeTermFrequencies: function(tokenSet, TFStorage)
      • Adds the tokenSet to the collectionStorage for improved analysis over time.
      • It’s recommended to save this collection in a persistent data store, although this is unnecessary.
      • If collectionStorage is not provided, it will create it as an object and return that object.
    • normalizeTermFrequencies: function(tokenSet, TFStorage)
      • For each token in tokenSet, normalizeTermFrequencies will divide its count by the total number found in TFStorage and return the token set with normalized counts.
    • identifyUniqueTerms: function(normalizedTokenSet [, options])
      • From the input normalizedTokenSet, identifyUniqueTerms will return the most unique tokens, as defined by the highest TFIDF
      • Current options include uniqueThreshold. If specified, identifyUniqueTerms will return all terms with a TFIDF equal to or greater than the uniqueThreshold
    • fullTFIDAnalysis: function(text [, options])
      • Completes all of the above TFIDF calculations
      • options correspond with the options for each piece of the analysis

View the full specs and check out more text analysis in my Text Analysis Suite.