Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Language Support Overview

Litsea supports word segmentation for three languages through a unified framework based on the Language enum.

Supported Languages

LanguageEnum VariantCLI ValuesFeature CountPre-trained Model Accuracy
JapaneseLanguage::Japanesejapanese, ja4294.15%
ChineseLanguage::Chinesechinese, zh4280.72%
KoreanLanguage::Koreankorean, ko3885.08%

The Language Enum

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum Language {
    #[default]
    Japanese,
    Chinese,
    Korean,
}
}
  • Default is Japanese
  • Implements FromStr – parses from full name or ISO 639-1 code (case-insensitive)
  • Implements Display – outputs the lowercase full name

Parsing Examples

#![allow(unused)]
fn main() {
use litsea::language::Language;

let ja: Language = "japanese".parse().unwrap();
let zh: Language = "zh".parse().unwrap();
let ko: Language = "Korean".parse().unwrap();   // case-insensitive
let err = "french".parse::<Language>();          // Err(...)
}

How Languages Differ

Each language defines its own character type patterns that classify characters into type codes. These type codes are used as features for the AdaBoost classifier.

AspectJapaneseChineseKorean
Character types8 (M, H, I, K, P, A, N, O)9 (F, C, X, R, P, B, A, N, O)10 (E, SN, SF, J, G, H, P, A, N, O)
WC featuresYes (4 extra)Yes (4 extra)No
Total features424238
Matching methodRegex onlyRegex onlyRegex + Closure

Why Korean Has Fewer Features

Korean Hangul syllables are classified into only two types: SN (without 받침/final consonant) and SF (with 받침). This binary distinction means WC features (word + character-type combinations) would produce redundant information with little discriminative power. Excluding them reduces noise and keeps the model compact.