Prediction Pipeline
This chapter provides a step-by-step walkthrough of how Segmenter::segment() processes input text.
Example: Segmenting “これはテストです。”
Step 1: Initialize Arrays with Padding
chars: ["B3", "B2", "B1"]
types: ["O", "O", "O" ]
tags: ["U", "U", "U", "U"]
The tags array gets one extra “U” because tags[3] represents the first real character’s tag (set to “Unknown” since there is no prior boundary decision).
Step 2: Scan Input Characters
For each character in the input, determine its type using language-specific patterns and append to the arrays:
chars: ["B3","B2","B1", "こ","れ","は","テ","ス","ト","で","す","。"]
types: ["O", "O", "O", "I", "I", "I", "K", "K", "K", "I", "I", "P"]
Step 3: Append End Sentinels
chars: [..., "。", "E1", "E2", "E3"]
types: [..., "P", "O", "O", "O" ]
Step 4: Iterate and Predict
For each position i from 4 to len(chars) - 3:
i=4 (れ): Extract features → predict → label=-1 (O) → word="これ"
i=5 (は): Extract features → predict → label=+1 (B) → push "これ", word="は"
i=6 (テ): Extract features → predict → label=+1 (B) → push "は", word="テ"
i=7 (ス): Extract features → predict → label=-1 (O) → word="テス"
i=8 (ト): Extract features → predict → label=-1 (O) → word="テスト"
i=9 (で): Extract features → predict → label=+1 (B) → push "テスト", word="で"
i=10(す): Extract features → predict → label=-1 (O) → word="です"
i=11(。): Extract features → predict → label=+1 (B) → push "です", word="。"
Step 5: Push Final Word
Push the remaining word “。” to the result.
Result
["これ", "は", "テスト", "です", "。"]
How Prediction Works at Each Position
At each position i, the segmenter:
-
Extracts features – Calls
get_attributes(i, tags, chars, types)to build aHashSet<String>of 38–42 features -
Computes score – The AdaBoost learner sums the model weights for all matching features plus the bias:
score = bias + sum(model[feature] for feature in attributes) -
Makes decision – If
score >= 0, the character starts a new word (boundary); otherwise, it continues the current word -
Updates tags – Pushes “B” or “O” to the tags array, which affects feature extraction for subsequent positions
Training vs. Prediction
| Aspect | Training (process_corpus) | Prediction (segment) |
|---|---|---|
| Tags source | Pre-computed from the annotated corpus | Dynamically generated by the model |
| First tag | “U” (overrides “B” at position 3) | “U” (no prior decision) |
| Labels | Known from corpus (+1 or -1) | Predicted by AdaBoost |
| Features | Written to file via callback | Passed directly to predict() |
During training, tags are derived from the ground-truth corpus segmentation, so the model learns from correct boundary decisions. During prediction, tags are generated on-the-fly, meaning each decision depends on all previous predictions – this is a left-to-right greedy approach.
Performance Characteristics
The segmentation algorithm is linear in the length of the input:
- Each character position is visited once: O(n)
- Feature extraction at each position: O(1) (fixed number of features)
- Prediction at each position: O(f) where f is the number of active features (~38-42)
- Total: O(n * f) which is effectively O(n)