Pre-trained Models

Litsea ships with several pre-trained models in the models/ directory.

Model Catalog

Property	Value
Language	Chinese (Simplified & Traditional)
Training Corpus	UD Chinese-GSD
Accuracy	80.72%
File Size	~1.3 KB

Property	Value
Language	Japanese
Source	Extracted from the original TinySegmenter
License	BSD 3-Clause (Taku Kudo)
File Size	~22 KB

Property	Value
Language	Japanese
Training Corpus	JEITA Project Sugita Genpaku corpus
Tokenizer	ChaSen with IPAdic
File Size	~17 KB

Property	Value
Language	Chinese (Simplified & Traditional)
Algorithm	Averaged Perceptron
Training Corpus	UD Chinese-GSD (3,997 sentences)
Epochs	10
Accuracy	97.09%
Macro Precision	97.31%
Macro Recall	96.23%
File Size	~19 MB

echo "これはテストです。" | litsea segment --pos -l japanese models/japanese_pos.model

Output:

これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT

For Japanese, use japanese.model for the best accuracy, or RWCP.model for compatibility with the original TinySegmenter
For Chinese, use chinese.model
For Korean, use korean.model
For POS tagging, use the corresponding *_pos.model (japanese_pos.model, chinese_pos.model, korean_pos.model) for joint word segmentation and POS tagging
For domain-specific needs, consider training your own model or retraining an existing one

The resources/ directory also contains sample data:

bocchan.txt – Sample Japanese corpus from the novel “Botchan” by Natsume Soseki (~307 KB). Used for benchmarking.