Feature Extraction
Litsea uses character n-gram features to capture the local context around each potential word boundary. This chapter catalogs all feature types.
Feature Categories
For each character position i in the input, the segmenter extracts features from a sliding window of characters, their type codes, and previous boundary decisions.
Base Features (38 features)
| Category | IDs | Description | Window |
|---|---|---|---|
| UW (Unary Word) | UW1–UW6 | Individual characters at positions i-3 to i+2 | 6 |
| BW (Bigram Word) | BW1–BW3 | Adjacent character pairs | 3 |
| UC (Unary Char-type) | UC1–UC6 | Character type codes at positions i-3 to i+2 | 6 |
| BC (Bigram Char-type) | BC1–BC3 | Adjacent type code pairs | 3 |
| TC (Trigram Char-type) | TC1–TC4 | Type code triples | 4 |
| UP (Unary Previous-tag) | UP1–UP3 | Previous 3 boundary decisions | 3 |
| BP (Bigram Previous-tag) | BP1–BP2 | Boundary decision pairs | 2 |
| UQ (Unary tag+type) | UQ1–UQ3 | Combined boundary decision + type code | 3 |
| BQ (Bigram tag+type) | BQ1–BQ4 | Combined decision + type code bigrams | 4 |
| TQ (Trigram tag+type) | TQ1–TQ4 | Combined decision + type code trigrams | 4 |
Language-Specific Features (4 features, Japanese and Chinese only)
| Category | IDs | Description | Count |
|---|---|---|---|
| WC (Word+Char-type) | WC1–WC4 | Character + type code mixed features | 4 |
WC1: character at i-1 + type code at iWC2: type code at i-1 + character at iWC3: character at i-1 + type code at i-1WC4: character at i + type code at i
Why no WC for Korean? Korean Hangul syllables are classified into only two types (SN and SF), so WC features would add noise rather than useful signal.
Total Feature Count
| Language | Base | WC | Total |
|---|---|---|---|
| Japanese | 38 | 4 | 42 |
| Chinese | 38 | 4 | 42 |
| Korean | 38 | 0 | 38 |
Feature Format
Each feature is represented as a string in the format PREFIX:VALUE:
UW4:は ← The character at position i is "は"
UC4:I ← The type code at position i is "I" (Hiragana)
BW2:はテ ← The bigram at position i-1..i is "はテ"
BC2:IK ← The type bigram is Hiragana + Katakana
UP3:B ← The previous boundary decision was "B" (boundary)
WC1:はK ← Character "は" combined with type "K"
Sliding Window Layout
The segmenter pads the input with sentinel characters:
Index: 0 1 2 3 4 5 ... n+2 n+3 n+4 n+5
Chars: B3 B2 B1 c1 c2 c3 ... cn E1 E2 E3
Types: O O O t1 t2 t3 ... tn O O O
Tags: U U U U ? ? ... ?
- B3, B2, B1 – Begin sentinels (padding)
- E1, E2, E3 – End sentinels (padding)
- O – “Other” type for padding positions
- U – “Unknown” tag for initial positions
- B – “Boundary” tag (word start)
- O – “Other” tag (continuation)
Features are extracted for positions 4 through len-3, where the full window of i-3 to i+2 is available.
Training Data Format
The extract command writes features to a file in this format:
1 UW1:B2 UW2:B1 UW3:L UW4:i UW5:t UC1:O UC2:O UC3:A UC4:A ...
-1 UW1:B1 UW2:L UW3:i UW4:t UW5:s UC1:O UC2:A UC3:A UC4:A ...
Each line contains:
- A label (
1for boundary,-1for non-boundary) - Tab-separated feature strings