Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Training Models

Once features are extracted, train a model using AdaBoost.

Command

litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

Basic Example

litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model

Training Process

flowchart TD
    A["Initialize features<br/>(read feature names)"] --> B["Initialize instances<br/>(read labels + features)"]
    B --> C["AdaBoost training loop"]
    C --> D{"Converged or<br/>max iterations?"}
    D -->|No| C
    D -->|Yes| E["Save model"]
    E --> F["Output metrics"]
  1. Initialize features – Reads the features file to build the feature index
  2. Initialize instances – Reads again to load labeled instances and initial weights
  3. Training loop – Iteratively selects the best feature, updates model weights, and reweights instances
  4. Save model – Writes non-zero feature weights to the model file
  5. Output metrics – Prints accuracy, precision, recall, and confusion matrix

Hyperparameters

ParameterFlagDefaultGuidance
Threshold-t0.01Start with 0.005. Lower values allow more iterations but increase training time
Iterations-i100Start with 1000. Increase if accuracy is still improving when training stops

Interpreting Output

Result Metrics:
  Accuracy: 94.15% ( 564133 / 599198 )
  Precision: 95.57% ( 330454 / 345758 )
  Recall: 94.36% ( 330454 / 350215 )
  Confusion Matrix:
    True Positives: 330454
    False Positives: 15304
    False Negatives: 19761
    True Negatives: 233679
  • Accuracy – Percentage of correct predictions (both boundaries and non-boundaries)
  • Precision – Of predicted boundaries, what fraction is correct
  • Recall – Of actual boundaries, what fraction was found
  • True Positives – Correctly predicted boundaries
  • False Positives – Predicted boundary where there is none
  • False Negatives – Missed actual boundaries
  • True Negatives – Correctly predicted non-boundaries

Graceful Interruption

Press Ctrl+C once during training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.

POS Model Training

For training POS tagging models, use the --pos flag. POS models use the Averaged Perceptron algorithm (multiclass classifier) instead of AdaBoost (binary classifier).

POS Training Command

litsea train --pos --num-epochs 10 <FEATURES_FILE> <MODEL_FILE>

POS Training Example

litsea train --pos --num-epochs 10 ./features.txt ./models/japanese_pos.model

Averaged Perceptron vs AdaBoost

AspectAdaBoost (Segmentation)Averaged Perceptron (POS)
ClassificationBinary (boundary / non-boundary)Multiclass (18 segment labels)
Labels1, -1B-NOUN, B-VERB, …, O
HyperparametersThreshold, IterationsNumber of epochs
Model size~1-22 KB~11 MB

POS Hyperparameters

ParameterFlagDefaultGuidance
Epochs--num-epochs10Number of passes over the training data. Start with 10 and adjust based on metrics

POS Training Output

Result Metrics:
  Accuracy: 98.34%
  Macro Precision: 97.87%
  Macro Recall: 91.67%
  • Accuracy – Percentage of correct predictions across all classes
  • Macro Precision – Average precision across all POS classes
  • Macro Recall – Average recall across all POS classes

POS Graceful Interruption

Press Ctrl+C once during POS training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.