train

Train a word segmentation model using AdaBoost.

Usage

litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

Arguments

Argument	Description
`FEATURES_FILE`	Path to the input features file (output from `extract`)
`MODEL_FILE`	Path to the output model file

Options

Option	Default	Description
`-t`, `--threshold <THRESHOLD>`	`0.01`	Weak classifier accuracy threshold for early stopping. Lower values allow more iterations
`-i`, `--num-iterations <NUM_ITERATIONS>`	`100`	Maximum number of boosting iterations
`-m`, `--load-model-uri <LOAD_MODEL_URI>`	None	URI of an existing model to resume training from (file path or HTTP/HTTPS URL)
`--pos`	off	Enable POS (Part-of-Speech) training mode using Averaged Perceptron
`-e`, `--num-epochs <NUM_EPOCHS>`	`10`	Number of training epochs (POS mode only)

Output

Training metrics are printed to stderr:

Result Metrics:
  Accuracy: 94.15% ( 564133 / 599198 )
  Precision: 95.57% ( 330454 / 345758 )
  Recall: 94.36% ( 330454 / 350215 )
  Confusion Matrix:
    True Positives: 330454
    False Positives: 15304
    False Negatives: 19761
    True Negatives: 233679

Ctrl+C Handling

Training supports graceful interruption:

First Ctrl+C: Stops training and saves the model at its current state
Second Ctrl+C: Exits immediately without saving

This allows you to stop long-running training sessions without losing progress.

Examples

Basic training:

litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model

Training with higher precision (lower threshold, more iterations):

litsea train -t 0.001 -i 5000 ./features.txt ./model.model

Retraining from an existing model:

litsea train -t 0.005 -i 1000 -m ./models/japanese.model \
    ./new_features.txt ./models/japanese_v2.model

Hyperparameter Tuning

Parameter	Effect of Decreasing	Effect of Increasing
`threshold`	More iterations, potentially higher accuracy, longer training time	Fewer iterations, faster training, may underfit
`num_iterations`	Fewer boosting rounds, smaller model, may underfit	More rounds, larger model, potentially higher accuracy

POS Model Training

When the --pos flag is specified, train uses the Averaged Perceptron algorithm instead of AdaBoost. This trains a multiclass classifier for joint word segmentation and POS tagging.

Usage

litsea train --pos [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

POS Training Options

Option	Default	Description
`--pos`	off	Enable POS training mode
`-e`, `--num-epochs <NUM_EPOCHS>`	`10`	Number of training epochs

Examples

# Train a POS model from POS features
litsea train --pos -e 10 ./pos_features.txt ./models/japanese_pos.model

Output

POS training metrics are printed to stderr (macro-averaged precision and recall):

Result Metrics:
  Accuracy: 98.34%
  Macro Precision: 97.87%
  Macro Recall: 91.67%

Ctrl+C Handling

Same as AdaBoost training, POS training supports graceful interruption. The first Ctrl+C stops training and saves the model at its current state.

POS Hyperparameters

Parameter	Effect of Decreasing	Effect of Increasing
`num_epochs`	Faster training, may underfit	Better accuracy, longer training, may overfit

Keyboard shortcuts

Litsea Documentation