Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Feature Extraction

Litsea uses character n-gram features to capture the local context around each potential word boundary. This chapter catalogs all feature types.

Feature Categories

For each character position i in the input, the segmenter extracts features from a sliding window of characters, their type codes, and previous boundary decisions.

Base Features (38 features)

CategoryIDsDescriptionWindow
UW (Unary Word)UW1–UW6Individual characters at positions i-3 to i+26
BW (Bigram Word)BW1–BW3Adjacent character pairs3
UC (Unary Char-type)UC1–UC6Character type codes at positions i-3 to i+26
BC (Bigram Char-type)BC1–BC3Adjacent type code pairs3
TC (Trigram Char-type)TC1–TC4Type code triples4
UP (Unary Previous-tag)UP1–UP3Previous 3 boundary decisions3
BP (Bigram Previous-tag)BP1–BP2Boundary decision pairs2
UQ (Unary tag+type)UQ1–UQ3Combined boundary decision + type code3
BQ (Bigram tag+type)BQ1–BQ4Combined decision + type code bigrams4
TQ (Trigram tag+type)TQ1–TQ4Combined decision + type code trigrams4

Language-Specific Features (4 features, Japanese and Chinese only)

CategoryIDsDescriptionCount
WC (Word+Char-type)WC1–WC4Character + type code mixed features4
  • WC1: character at i-1 + type code at i
  • WC2: type code at i-1 + character at i
  • WC3: character at i-1 + type code at i-1
  • WC4: character at i + type code at i

Why no WC for Korean? Korean Hangul syllables are classified into only two types (SN and SF), so WC features would add noise rather than useful signal.

Total Feature Count

LanguageBaseWCTotal
Japanese38442
Chinese38442
Korean38038

Feature Format

Each feature is represented as a string in the format PREFIX:VALUE:

UW4:は        ← The character at position i is "は"
UC4:I         ← The type code at position i is "I" (Hiragana)
BW2:はテ      ← The bigram at position i-1..i is "はテ"
BC2:IK        ← The type bigram is Hiragana + Katakana
UP3:B         ← The previous boundary decision was "B" (boundary)
WC1:はK       ← Character "は" combined with type "K"

Sliding Window Layout

The segmenter pads the input with sentinel characters:

Index:   0    1    2    3    4    5    ...  n+2  n+3  n+4  n+5
Chars:   B3   B2   B1   c1   c2   c3  ...  cn   E1   E2   E3
Types:   O    O    O    t1   t2   t3  ...  tn   O    O    O
Tags:    U    U    U    U    ?    ?   ...  ?
  • B3, B2, B1 – Begin sentinels (padding)
  • E1, E2, E3 – End sentinels (padding)
  • O – “Other” type for padding positions
  • U – “Unknown” tag for initial positions
  • B – “Boundary” tag (word start)
  • O – “Other” tag (continuation)

Features are extracted for positions 4 through len-3, where the full window of i-3 to i+2 is available.

Training Data Format

The extract command writes features to a file in this format:

1	UW1:B2 UW2:B1 UW3:L UW4:i UW5:t UC1:O UC2:O UC3:A UC4:A ...
-1	UW1:B1 UW2:L UW3:i UW4:t UW5:s UC1:O UC2:A UC3:A UC4:A ...

Each line contains:

  1. A label (1 for boundary, -1 for non-boundary)
  2. Tab-separated feature strings