Training Models

Once features are extracted, train a model using AdaBoost.

Command

litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

Basic Example

litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model

Training Process

flowchart TD
    A["Initialize features<br/>(read feature names)"] --> B["Initialize instances<br/>(read labels + features)"]
    B --> C["AdaBoost training loop"]
    C --> D{"Converged or<br/>max iterations?"}
    D -->|No| C
    D -->|Yes| E["Save model"]
    E --> F["Output metrics"]

Initialize features – Reads the features file to build the feature index
Initialize instances – Reads again to load labeled instances and initial weights
Training loop – Iteratively selects the best feature, updates model weights, and reweights instances
Save model – Writes non-zero feature weights to the model file
Output metrics – Prints accuracy, precision, recall, and confusion matrix

Hyperparameters

Parameter	Flag	Default	Guidance
Threshold	`-t`	0.01	Start with 0.005. Lower values allow more iterations but increase training time
Iterations	`-i`	100	Start with 1000. Increase if accuracy is still improving when training stops

Interpreting Output

Result Metrics:
  Accuracy: 94.15% ( 564133 / 599198 )
  Precision: 95.57% ( 330454 / 345758 )
  Recall: 94.36% ( 330454 / 350215 )
  Confusion Matrix:
    True Positives: 330454
    False Positives: 15304
    False Negatives: 19761
    True Negatives: 233679

Accuracy – Percentage of correct predictions (both boundaries and non-boundaries)
Precision – Of predicted boundaries, what fraction is correct
Recall – Of actual boundaries, what fraction was found
True Positives – Correctly predicted boundaries
False Positives – Predicted boundary where there is none
False Negatives – Missed actual boundaries
True Negatives – Correctly predicted non-boundaries

Graceful Interruption

Press Ctrl+C once during training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.

POS Model Training

For training POS tagging models, use the --pos flag. POS models use the Averaged Perceptron algorithm (multiclass classifier) instead of AdaBoost (binary classifier).

POS Training Command

litsea train --pos --num-epochs 10 <FEATURES_FILE> <MODEL_FILE>

POS Training Example

litsea train --pos --num-epochs 10 ./features.txt ./models/japanese_pos.model

Averaged Perceptron vs AdaBoost

Aspect	AdaBoost (Segmentation)	Averaged Perceptron (POS)
Classification	Binary (boundary / non-boundary)	Multiclass (18 segment labels)
Labels	`1`, `-1`	`B-NOUN`, `B-VERB`, …, `O`
Hyperparameters	Threshold, Iterations	Number of epochs
Model size	~1-22 KB	~11 MB

POS Hyperparameters

Parameter	Flag	Default	Guidance
Epochs	`--num-epochs`	10	Number of passes over the training data. Start with 10 and adjust based on metrics

POS Training Output

Result Metrics:
  Accuracy: 98.34%
  Macro Precision: 97.87%
  Macro Recall: 91.67%

Accuracy – Percentage of correct predictions across all classes
Macro Precision – Average precision across all POS classes
Macro Recall – Average recall across all POS classes

POS Graceful Interruption

Press Ctrl+C once during POS training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.

Keyboard shortcuts

Litsea Documentation