Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Prediction Pipeline

This chapter provides a step-by-step walkthrough of how Segmenter::segment() processes input text.

Example: Segmenting “これはテストです。”

Step 1: Initialize Arrays with Padding

chars: ["B3", "B2", "B1"]
types: ["O",  "O",  "O" ]
tags:  ["U",  "U",  "U", "U"]

The tags array gets one extra “U” because tags[3] represents the first real character’s tag (set to “Unknown” since there is no prior boundary decision).

Step 2: Scan Input Characters

For each character in the input, determine its type using language-specific patterns and append to the arrays:

chars: ["B3","B2","B1", "こ","れ","は","テ","ス","ト","で","す","。"]
types: ["O", "O", "O",  "I", "I", "I", "K", "K", "K", "I", "I", "P"]

Step 3: Append End Sentinels

chars: [..., "。", "E1", "E2", "E3"]
types: [..., "P",  "O",  "O",  "O" ]

Step 4: Iterate and Predict

For each position i from 4 to len(chars) - 3:

i=4 (れ): Extract features → predict → label=-1 (O) → word="これ"
i=5 (は): Extract features → predict → label=+1 (B) → push "これ", word="は"
i=6 (テ): Extract features → predict → label=+1 (B) → push "は", word="テ"
i=7 (ス): Extract features → predict → label=-1 (O) → word="テス"
i=8 (ト): Extract features → predict → label=-1 (O) → word="テスト"
i=9 (で): Extract features → predict → label=+1 (B) → push "テスト", word="で"
i=10(す): Extract features → predict → label=-1 (O) → word="です"
i=11(。): Extract features → predict → label=+1 (B) → push "です", word="。"

Step 5: Push Final Word

Push the remaining word “。” to the result.

Result

["これ", "は", "テスト", "です", "。"]

How Prediction Works at Each Position

At each position i, the segmenter:

  1. Extracts features – Calls get_attributes(i, tags, chars, types) to build a HashSet<String> of 38–42 features

  2. Computes score – The AdaBoost learner sums the model weights for all matching features plus the bias:

    score = bias + sum(model[feature] for feature in attributes)
    
  3. Makes decision – If score >= 0, the character starts a new word (boundary); otherwise, it continues the current word

  4. Updates tags – Pushes “B” or “O” to the tags array, which affects feature extraction for subsequent positions

Training vs. Prediction

AspectTraining (process_corpus)Prediction (segment)
Tags sourcePre-computed from the annotated corpusDynamically generated by the model
First tag“U” (overrides “B” at position 3)“U” (no prior decision)
LabelsKnown from corpus (+1 or -1)Predicted by AdaBoost
FeaturesWritten to file via callbackPassed directly to predict()

During training, tags are derived from the ground-truth corpus segmentation, so the model learns from correct boundary decisions. During prediction, tags are generated on-the-fly, meaning each decision depends on all previous predictions – this is a left-to-right greedy approach.

Performance Characteristics

The segmentation algorithm is linear in the length of the input:

  • Each character position is visited once: O(n)
  • Feature extraction at each position: O(1) (fixed number of features)
  • Prediction at each position: O(f) where f is the number of active features (~38-42)
  • Total: O(n * f) which is effectively O(n)