Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Training Guide

This guide walks you through training custom word segmentation and POS tagging models with Litsea.

Both workflows use Universal Dependencies (UD) Treebanks as the data source.

Word Segmentation (AdaBoost)

  1. Prepare a corpus from a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
  2. Extract features from the corpus
  3. Train a model using AdaBoost

POS Tagging (Averaged Perceptron)

  1. Prepare a POS corpus from a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
  2. Extract POS features: litsea extract --pos -l japanese pos_corpus.txt features.txt
  3. Train a POS model: litsea train --pos --num-epochs 10 features.txt model.txt

Additional Topics