Training Guide
This guide walks you through training custom word segmentation and POS tagging models with Litsea.
Both workflows use Universal Dependencies (UD) Treebanks as the data source.
Word Segmentation (AdaBoost)
- Prepare a corpus from a UD Treebank:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt - Extract features from the corpus
- Train a model using AdaBoost
POS Tagging (Averaged Perceptron)
- Prepare a POS corpus from a UD Treebank:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt - Extract POS features:
litsea extract --pos -l japanese pos_corpus.txt features.txt - Train a POS model:
litsea train --pos --num-epochs 10 features.txt model.txt
Additional Topics
- Evaluating Models – assess model quality
- Retraining Models – fine-tune existing models