Training Models
Once features are extracted, train a model using AdaBoost.
Command
litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>
Basic Example
litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model
Training Process
flowchart TD
A["Initialize features<br/>(read feature names)"] --> B["Initialize instances<br/>(read labels + features)"]
B --> C["AdaBoost training loop"]
C --> D{"Converged or<br/>max iterations?"}
D -->|No| C
D -->|Yes| E["Save model"]
E --> F["Output metrics"]
- Initialize features – Reads the features file to build the feature index
- Initialize instances – Reads again to load labeled instances and initial weights
- Training loop – Iteratively selects the best feature, updates model weights, and reweights instances
- Save model – Writes non-zero feature weights to the model file
- Output metrics – Prints accuracy, precision, recall, and confusion matrix
Hyperparameters
| Parameter | Flag | Default | Guidance |
|---|---|---|---|
| Threshold | -t | 0.01 | Start with 0.005. Lower values allow more iterations but increase training time |
| Iterations | -i | 100 | Start with 1000. Increase if accuracy is still improving when training stops |
Interpreting Output
Result Metrics:
Accuracy: 94.15% ( 564133 / 599198 )
Precision: 95.57% ( 330454 / 345758 )
Recall: 94.36% ( 330454 / 350215 )
Confusion Matrix:
True Positives: 330454
False Positives: 15304
False Negatives: 19761
True Negatives: 233679
- Accuracy – Percentage of correct predictions (both boundaries and non-boundaries)
- Precision – Of predicted boundaries, what fraction is correct
- Recall – Of actual boundaries, what fraction was found
- True Positives – Correctly predicted boundaries
- False Positives – Predicted boundary where there is none
- False Negatives – Missed actual boundaries
- True Negatives – Correctly predicted non-boundaries
Graceful Interruption
Press Ctrl+C once during training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.
POS Model Training
For training POS tagging models, use the --pos flag. POS models use the Averaged Perceptron algorithm (multiclass classifier) instead of AdaBoost (binary classifier).
POS Training Command
litsea train --pos --num-epochs 10 <FEATURES_FILE> <MODEL_FILE>
POS Training Example
litsea train --pos --num-epochs 10 ./features.txt ./models/japanese_pos.model
Averaged Perceptron vs AdaBoost
| Aspect | AdaBoost (Segmentation) | Averaged Perceptron (POS) |
|---|---|---|
| Classification | Binary (boundary / non-boundary) | Multiclass (18 segment labels) |
| Labels | 1, -1 | B-NOUN, B-VERB, …, O |
| Hyperparameters | Threshold, Iterations | Number of epochs |
| Model size | ~1-22 KB | ~11 MB |
POS Hyperparameters
| Parameter | Flag | Default | Guidance |
|---|---|---|---|
| Epochs | --num-epochs | 10 | Number of passes over the training data. Start with 10 and adjust based on metrics |
POS Training Output
Result Metrics:
Accuracy: 98.34%
Macro Precision: 97.87%
Macro Recall: 91.67%
- Accuracy – Percentage of correct predictions across all classes
- Macro Precision – Average precision across all POS classes
- Macro Recall – Average recall across all POS classes
POS Graceful Interruption
Press Ctrl+C once during POS training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.