synonym-optimizer

Finds the text which has the least number of repetitions

Usage no npm install needed!

<script type="module">
  import synonymOptimizer from 'https://cdn.skypack.dev/synonym-optimizer';
</script>

README

synonym-optimizer

Gives a score to a string depending on the variety of the synonyms used.

For instance, let's compare The coffee is good. I love that coffee with The coffee is good. I love that bewerage. The second alternative is better because a synonym is used for coffee. This module will give a better score to the second alternative.

The lowest score the better.

Fully supported languages are French German English Italian and Spanish.

What it does / How it works:

  • single words are extracted thanks to a tokenizer wink-tokenizer
  • words are lowercased
  • stopwords are removed
    • for fully supported languages, a default stopwords list is included, which you can customize
    • for all other languages, no default list is included, but you can provide a custom stop words lists
  • for fully supported languages, words are stemmed using snowball-stemmer (for all other languages: no stemming)
  • when the same word appears multiples times, it raises the score depending on the distance of the two occurrences (if the occurrences are closes it raises the score a lot)

Designed primarly to test the output of a NLG (Natural Language Generation) system.

The stemmer is not perfect. For instance in Italian, cameriere and cameriera have the same stem (camerier), while camerieri and cameriera have a different one (camer and camerier).

Installation

npm install synonym-optimizer

Usage

var synOptimizer = require('synonym-optimizer');

alts = [
  'The coffee is good. I love that coffee.',
  'The coffee is good. I love that bewerage.'
]

/*
The coffee is good. I love that coffee.: 0.5
The coffee is good. I love that bewerage.: 0
*/
alts.forEach((alt) => {
  let score = synOptimizer.scoreAlternative('en_US', alt, null, null, null, null);
  console.log(`${alt}: ${score}`);
});

The main function is scoreAlternative. It takes a string and returns its score. Arguments are:

  • lang (string, mandatory): the language.
    • fully supported languages are fr_FR, en_US, de_DE, it_IT and es_ES
    • with any other language (for instance Dutch nl_NL) stemming is disabled and stopwords are not removed
  • alternative (string, mandatory): the string to score
  • stopWordsToAdd (string[], optional): list of stopwords to add to the standard stopwords list
  • stopWordsToRemove (string[], optional): list of stopwords to remove to the standard stopwords list
  • stopWordsOverride (string[], optional): replaces the standard stopword list
  • identicals (string[][], optional): list of words that should be considered as beeing identical, for instance [ ['phone', 'cellphone', 'smartphone'] ].

You can also use the getBest function. Most arguments are exactly the same, but instead of alternative, use alternatives (string[]). The output number will not be the score, but simply the index of the best alternative.

The tokenizer is wink-tokenizer, it does works with many languages (English, French, German, Hindi, Sanskrit, Marathi etc.) but not asian languages. Therefore the module will not work properly with Japanese, Chinese etc.

Adding new languages (for developpers / maintainers)

  • check for existence of stopwords module: stopwords-*
  • check for stemmer in snowball-stemmer collection (or plug another stemmer)
  • plug everything and add tests
  • find a proper tokenizer if wink-tokenizer does not work

dependencies and licences

  • wink-tokenizer to tokenize sentences in multiple languages (MIT).
  • stopwords-en/de/fs/it/es for standard stopwords lists per language (MIT).
  • snowball-stemmer to stem words per language (MIT).