Training Guide

This guide walks you through training custom word segmentation and POS tagging models with Litsea.

Both workflows use Universal Dependencies (UD) Treebanks as the data source.

Word Segmentation (AdaBoost)

Prepare a corpus from a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
Extract features from the corpus
Train a model using AdaBoost

POS Tagging (Averaged Perceptron)

Prepare a POS corpus from a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
Extract POS features: litsea extract --pos -l japanese pos_corpus.txt features.txt
Train a POS model: litsea train --pos --num-epochs 10 features.txt model.txt

Additional Topics

Evaluating Models – assess model quality
Retraining Models – fine-tune existing models