Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

train

Train a word segmentation model using AdaBoost.

Usage

litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

Arguments

ArgumentDescription
FEATURES_FILEPath to the input features file (output from extract)
MODEL_FILEPath to the output model file

Options

OptionDefaultDescription
-t, --threshold <THRESHOLD>0.01Weak classifier accuracy threshold for early stopping. Lower values allow more iterations
-i, --num-iterations <NUM_ITERATIONS>100Maximum number of boosting iterations
-m, --load-model-uri <LOAD_MODEL_URI>NoneURI of an existing model to resume training from (file path or HTTP/HTTPS URL)
--posoffEnable POS (Part-of-Speech) training mode using Averaged Perceptron
-e, --num-epochs <NUM_EPOCHS>10Number of training epochs (POS mode only)

Output

Training metrics are printed to stderr:

Result Metrics:
  Accuracy: 94.15% ( 564133 / 599198 )
  Precision: 95.57% ( 330454 / 345758 )
  Recall: 94.36% ( 330454 / 350215 )
  Confusion Matrix:
    True Positives: 330454
    False Positives: 15304
    False Negatives: 19761
    True Negatives: 233679

Ctrl+C Handling

Training supports graceful interruption:

  • First Ctrl+C: Stops training and saves the model at its current state
  • Second Ctrl+C: Exits immediately without saving

This allows you to stop long-running training sessions without losing progress.

Examples

Basic training:

litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model

Training with higher precision (lower threshold, more iterations):

litsea train -t 0.001 -i 5000 ./features.txt ./model.model

Retraining from an existing model:

litsea train -t 0.005 -i 1000 -m ./models/japanese.model \
    ./new_features.txt ./models/japanese_v2.model

Hyperparameter Tuning

ParameterEffect of DecreasingEffect of Increasing
thresholdMore iterations, potentially higher accuracy, longer training timeFewer iterations, faster training, may underfit
num_iterationsFewer boosting rounds, smaller model, may underfitMore rounds, larger model, potentially higher accuracy

POS Model Training

When the --pos flag is specified, train uses the Averaged Perceptron algorithm instead of AdaBoost. This trains a multiclass classifier for joint word segmentation and POS tagging.

Usage

litsea train --pos [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

POS Training Options

OptionDefaultDescription
--posoffEnable POS training mode
-e, --num-epochs <NUM_EPOCHS>10Number of training epochs

Examples

# Train a POS model from POS features
litsea train --pos -e 10 ./pos_features.txt ./models/japanese_pos.model

Output

POS training metrics are printed to stderr (macro-averaged precision and recall):

Result Metrics:
  Accuracy: 98.34%
  Macro Precision: 97.87%
  Macro Recall: 91.67%

Ctrl+C Handling

Same as AdaBoost training, POS training supports graceful interruption. The first Ctrl+C stops training and saves the model at its current state.

POS Hyperparameters

ParameterEffect of DecreasingEffect of Increasing
num_epochsFaster training, may underfitBetter accuracy, longer training, may overfit