Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Character Type Classification

Each language in Litsea defines a set of character type patterns that classify individual characters into linguistically meaningful categories. These type codes are used as features for the AdaBoost classifier.

How It Works

Language::char_type(c: char) -> &'static str classifies a character with a direct match expression on Unicode character ranges — no regex, no allocation. Match arms are tried top to bottom, so the first matching arm determines the type code. If no arm matches, the character is classified as "O" (Other).

Each language has its own classification function (japanese_char_type, chinese_char_type, korean_char_type); the classes shared by all languages — "P" (punctuation), "A" (Latin), "N" (digits) — live in a common punct_latin_digit() helper that is checked after the language-specific classes. Logic beyond plain ranges is expressed with match guards (e.g., Korean Hangul syllable structure).

Japanese Character Types

CodeNamePattern / RangeExamples
MKanji Numbers[一二三四五六七八九十百千万億兆]一, 千, 億
HKanji / CJK Ideographs[一-龠々〆ヵヶ]漢, 字, 学
IHiragana[ぁ-ん]あ, い, う
KKatakana[ァ-ヴーア-ン゙゚]ア, カ, ー
PPunctuationCJK Symbols (U+3000-303F), Full-width (U+FF01-FF65)。, 、, 「
AASCII/Latin[a-zA-Za-zA-Z]A, z, B
NDigits[0-90-9]0, 5
OOtherFallback@, #

Note: “M” (Kanji numbers) is checked before “H” (general Kanji), so characters like 一 and 百 are classified as numbers rather than generic ideographs.

Chinese Character Types

CodeNamePattern / RangeExamples
FFunction WordsHigh-frequency grammatical words的, 了, 在, 是
CCJK UnifiedU+4E00–U+9FFF中, 国, 人
XCJK Extension AU+3400–U+4DBFRare characters
RCJK RadicalsU+2E80–U+2FDFKangxi radicals
PPunctuationCJK Symbols + Full-width。, ,, 《
BBopomofoU+3100–U+312F, U+31A0–U+31BFZhuyin symbols
AASCII/Latin[a-zA-Za-zA-Z]A, z
NDigits[0-90-9]0, 5
OOtherFallback@, #

Chinese function words include:

  • Structural particles: 的, 地, 得
  • Aspect/modal particles: 了, 着, 过, 吗, 呢, 吧, 啊, 嘛
  • Conjunctions: 和, 与, 或, 但, 而, 且, 及
  • Prepositions: 在, 从, 到, 把, 被, 对, 向, 给
  • Common grammatical verbs/adverbs: 是, 有, 不, 也, 都, 就, 要, 会, 能, 可

Korean Character Types

CodeNamePattern / RangeExamples
EParticles/EndingsHigh-frequency grammatical particles은, 는, 을, 를, 의, 에
SNHangul (no batchim)Hangul Syllable without final consonant가, 나, 하
SFHangul (with batchim)Hangul Syllable with final consonant한, 글, 각
JHangul JamoU+1100–U+11FFIndividual consonants/vowels
GCompatibility JamoU+3130–U+318Fㄱ, ㅏ, ㅎ
HHanjaU+4E00–U+9FFFCJK Ideographs
PPunctuationCJK Symbols + Full-width。, ,
AASCII/Latin[a-zA-Za-zA-Z]A, z
NDigits[0-90-9]0, 5
OOtherFallback@, #

Korean Hangul Syllable Detection

Korean uses a match guard for the SN and SF types. This leverages Unicode’s systematic Hangul encoding:

  • Hangul Syllables occupy U+AC00–U+D7AF
  • Each syllable is encoded as: (initial * 21 + medial) * 28 + final + 0xAC00
  • If (codepoint - 0xAC00) % 28 == 0, the syllable has no final consonant (SN)
  • Otherwise, it has a final consonant (SF, “받침”)

This distinction is important because the presence of a final consonant (받침) affects Korean word boundary patterns and particle attachment.

Cross-Language Comparison

FeatureJapaneseChineseKorean
Total types8910
Unique typesM, H, I, KF, C, X, R, BE, SN, SF, J, G
Shared typesP, A, N, OP, A, N, OP, A, N, O (H shared with JP)
Matching methodRange matchRange matchRange match + guard
WC features usedYesYesNo