stopword-sami

Sami stopword lists (South-, Lule- and North Sami) for natural language processing. Code to create and refine them. Examples usage could be search engines and machine learning.

Usage no npm install needed!

<script type="module">
  import stopwordSami from 'https://cdn.skypack.dev/stopword-sami';
</script>

README

stopword-sami

What

WIP! Project to generate stopword lists for all the Sami languages:

When the quality of the stopword lists are good enough they will be added to the stopword module. Northern Sami will most likely be the first that reaches good enough quality. Then you'll have Lule Sami and South Sami.

To crawl

Lists of IDs

npm run crawlIds

Work so far

Generating lists of IDs to crawl

Using nrk-sapmi-crawler to crawl lists of documents to crawl. These documents will later be crawled and the text content will be the basis for ongoing stopword training. The more content, the better lists.

Work ahead

Crawl content

When lists of enough content, and the nrk-sapmi-crawler also can crawl documents, crawl the actual documents

Start training stopword lists

Run the stopword-trainer on the text that is crawled. From this we'll ask for help to manually verify the lists and also come with words to add to a red-list for each Sami language. The stopword lists are black-lists, words that you don't want. Every now and then, words you want sneak into a stopword list. Adding it to a red-list makes sure it won't end up in the finished stopword list.

Help needed

We need help to verify generated list and help me understand different traits of the different Sami languages when that time comes.

Also, to generate/train stopword lists, we need text sources. For Northern Sami we will get what we need, but for Lulesami and South Sami it's a little thin. Maybe we just have to wait for NRK to create more content. For the rest of the languages, we have no source so far. If you know of a data-set or a source to generate a data set, please give us a hint!

Why stopword lists for sami languages?

To i.e. be able to create good search engines or do machine learning based on content written in the different sami langauges.