chinese-tokenizer

Simple algorithm to tokenize Chinese texts into words using CC-CEDICT.

Usage no npm install needed!

<script type="module">
  import chineseTokenizer from 'https://cdn.skypack.dev/chinese-tokenizer';
</script>

README

chinese-tokenizer Build Status

Simple algorithm to tokenize Chinese texts into words using CC-CEDICT. You can try it out at the demo page. The code for the demo page can be found in the gh-pages branch of this repository.

How this works

This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.

Installation

Use npm to install:

npm install chinese-tokenizer --save

Usage

Make sure to provide the CC-CEDICT data.

const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')

console.log(JSON.stringify(tokenize('我是中国人。'), null, '  '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, '  '))

Output:

[
  {
    "text": "我",
    "traditional": "我",
    "simplified": "我",
    "position": { "offset": 0, "line": 1, "column": 1 },
    "matches": [
      {
        "pinyin": "wo3",
        "pinyinPretty": "wǒ",
        "english": "I/me/my"
      }
    ]
  },
  {
    "text": "是",
    "traditional": "是",
    "simplified": "是",
    "position": { "offset": 1, "line": 1, "column": 2 },
    "matches": [
      {
        "pinyin": "shi4",
        "pinyinPretty": "shì",
        "english": "is/are/am/yes/to be"
      }
    ]
  },
  {
    "text": "中國人",
    "traditional": "中國人",
    "simplified": "中国人",
    "position": { "offset": 2, "line": 1, "column": 3 },
    "matches": [
      {
        "pinyin": "Zhong1 guo2 ren2",
        "pinyinPretty": "Zhōng guó rén",
        "english": "Chinese person"
      }
    ]
  },
  {
    "text": "。",
    "traditional": "。",
    "simplified": "。",
    "position": { "offset": 5, "line": 1, "column": 6 },
    "matches": []
  }
]

API

loadFile(path)

Reads the CC-CEDICT file from given path and returns a tokenize function based on the dictionary.

load(content)

Parses CC-CEDICT string content from content and returns a tokenize function based on the dictionary.