Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pre-trained Models

Litsea ships with several pre-trained models in the models/ directory.

Model Catalog

japanese.model

PropertyValue
LanguageJapanese
Training CorpusUD Japanese-GSD
Accuracy94.15%
Precision95.57%
Recall94.36%
File Size~2.9 KB

korean.model

PropertyValue
LanguageKorean
Training CorpusUD Korean-GSD
Accuracy85.08%
File Size~1.8 KB

chinese.model

PropertyValue
LanguageChinese (Simplified & Traditional)
Training CorpusUD Chinese-GSD
Accuracy80.72%
File Size~1.3 KB

RWCP.model

PropertyValue
LanguageJapanese
SourceExtracted from the original TinySegmenter
LicenseBSD 3-Clause (Taku Kudo)
File Size~22 KB

JEITA_Genpaku_ChaSen_IPAdic.model

PropertyValue
LanguageJapanese
Training CorpusJEITA Project Sugita Genpaku corpus
TokenizerChaSen with IPAdic
File Size~17 KB

POS Tagging Models

japanese_pos.model

PropertyValue
LanguageJapanese
AlgorithmAveraged Perceptron
Training CorpusUD Japanese-GSD (7,050 sentences)
Epochs10
Accuracy98.34%
Macro Precision97.87%
Macro Recall91.67%
File Size~11 MB

chinese_pos.model

PropertyValue
LanguageChinese (Simplified & Traditional)
AlgorithmAveraged Perceptron
Training CorpusUD Chinese-GSD (3,997 sentences)
Epochs10
Accuracy97.09%
Macro Precision97.31%
Macro Recall96.23%
File Size~19 MB

korean_pos.model

PropertyValue
LanguageKorean
AlgorithmAveraged Perceptron
Training CorpusUD Korean-GSD (4,400 sentences)
Epochs10
Accuracy95.33%
Macro Precision95.30%
Macro Recall87.69%
File Size~8.4 MB

Usage

echo "これはテストです。" | litsea segment --pos -l japanese models/japanese_pos.model

Output:

これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT

Choosing a Model

  • For Japanese, use japanese.model for the best accuracy, or RWCP.model for compatibility with the original TinySegmenter
  • For Chinese, use chinese.model
  • For Korean, use korean.model
  • For POS tagging, use the corresponding *_pos.model (japanese_pos.model, chinese_pos.model, korean_pos.model) for joint word segmentation and POS tagging
  • For domain-specific needs, consider training your own model or retraining an existing one

Sample Data

The resources/ directory also contains sample data:

  • bocchan.txt – Sample Japanese corpus from the novel “Botchan” by Natsume Soseki (~307 KB). Used for benchmarking.