Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chinese

Litsea supports Chinese word segmentation covering both Simplified and Traditional Chinese.

Character Types

CodeNamePatternExamples
FFunction WordsHigh-frequency grammatical words的, 了, 在, 是, 和
CCJK UnifiedU+4E00–U+9FFF中, 国, 人
XCJK Extension AU+3400–U+4DBFRare characters
RCJK RadicalsU+2E80–U+2FDFKangxi radicals
PPunctuationCJK Symbols + Full-width。, ,, 《, 》
BBopomofoU+3100–U+312F, U+31A0–U+31BFZhuyin symbols
AASCII/Latin[a-zA-Za-zA-Z]A, z
NDigits[0-90-9]0, 5, 5
OOtherFallback@, #, $

Chinese Function Words (虚词)

The “F” type captures high-frequency grammatical words that are critical for segmentation:

CategoryCharacters
Structural particles的, 地, 得
Aspect/modal particles了, 着, 过, 吗, 呢, 吧, 啊, 嘛
Conjunctions和, 与, 或, 但, 而, 且, 及
Prepositions在, 从, 到, 把, 被, 对, 向, 给
Grammatical verbs/adverbs是, 有, 不, 也, 都, 就, 要, 会, 能, 可

These characters appear overwhelmingly in grammatical roles and signal word boundaries differently from content words.

Pre-trained Model

chinese.model

  • Training corpus: UD Chinese-GSD
  • Accuracy: 80.72%

Example

echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model