Character Type Classification
Each language in Litsea defines a set of character type patterns that classify individual characters into linguistically meaningful categories. These type codes are used as features for the AdaBoost classifier.
How It Works
Language::char_type(c: char) -> &'static str classifies a character with a direct match expression on Unicode character ranges — no regex, no allocation. Match arms are tried top to bottom, so the first matching arm determines the type code. If no arm matches, the character is classified as "O" (Other).
Each language has its own classification function (japanese_char_type, chinese_char_type, korean_char_type); the classes shared by all languages — "P" (punctuation), "A" (Latin), "N" (digits) — live in a common punct_latin_digit() helper that is checked after the language-specific classes. Logic beyond plain ranges is expressed with match guards (e.g., Korean Hangul syllable structure).
Japanese Character Types
| Code | Name | Pattern / Range | Examples |
|---|---|---|---|
| M | Kanji Numbers | [一二三四五六七八九十百千万億兆] | 一, 千, 億 |
| H | Kanji / CJK Ideographs | [一-龠々〆ヵヶ] | 漢, 字, 学 |
| I | Hiragana | [ぁ-ん] | あ, い, う |
| K | Katakana | [ァ-ヴーア-ン゙゚] | ア, カ, ー |
| P | Punctuation | CJK Symbols (U+3000-303F), Full-width (U+FF01-FF65) | 。, 、, 「 |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z, B |
| N | Digits | [0-90-9] | 0, 5 |
| O | Other | Fallback | @, # |
Note: “M” (Kanji numbers) is checked before “H” (general Kanji), so characters like 一 and 百 are classified as numbers rather than generic ideographs.
Chinese Character Types
| Code | Name | Pattern / Range | Examples |
|---|---|---|---|
| F | Function Words | High-frequency grammatical words | 的, 了, 在, 是 |
| C | CJK Unified | U+4E00–U+9FFF | 中, 国, 人 |
| X | CJK Extension A | U+3400–U+4DBF | Rare characters |
| R | CJK Radicals | U+2E80–U+2FDF | Kangxi radicals |
| P | Punctuation | CJK Symbols + Full-width | 。, ,, 《 |
| B | Bopomofo | U+3100–U+312F, U+31A0–U+31BF | Zhuyin symbols |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z |
| N | Digits | [0-90-9] | 0, 5 |
| O | Other | Fallback | @, # |
Chinese function words include:
- Structural particles: 的, 地, 得
- Aspect/modal particles: 了, 着, 过, 吗, 呢, 吧, 啊, 嘛
- Conjunctions: 和, 与, 或, 但, 而, 且, 及
- Prepositions: 在, 从, 到, 把, 被, 对, 向, 给
- Common grammatical verbs/adverbs: 是, 有, 不, 也, 都, 就, 要, 会, 能, 可
Korean Character Types
| Code | Name | Pattern / Range | Examples |
|---|---|---|---|
| E | Particles/Endings | High-frequency grammatical particles | 은, 는, 을, 를, 의, 에 |
| SN | Hangul (no batchim) | Hangul Syllable without final consonant | 가, 나, 하 |
| SF | Hangul (with batchim) | Hangul Syllable with final consonant | 한, 글, 각 |
| J | Hangul Jamo | U+1100–U+11FF | Individual consonants/vowels |
| G | Compatibility Jamo | U+3130–U+318F | ㄱ, ㅏ, ㅎ |
| H | Hanja | U+4E00–U+9FFF | CJK Ideographs |
| P | Punctuation | CJK Symbols + Full-width | 。, , |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z |
| N | Digits | [0-90-9] | 0, 5 |
| O | Other | Fallback | @, # |
Korean Hangul Syllable Detection
Korean uses a match guard for the SN and SF types. This leverages Unicode’s systematic Hangul encoding:
- Hangul Syllables occupy U+AC00–U+D7AF
- Each syllable is encoded as:
(initial * 21 + medial) * 28 + final + 0xAC00 - If
(codepoint - 0xAC00) % 28 == 0, the syllable has no final consonant (SN) - Otherwise, it has a final consonant (SF, “받침”)
This distinction is important because the presence of a final consonant (받침) affects Korean word boundary patterns and particle attachment.
Cross-Language Comparison
| Feature | Japanese | Chinese | Korean |
|---|---|---|---|
| Total types | 8 | 9 | 10 |
| Unique types | M, H, I, K | F, C, X, R, B | E, SN, SF, J, G |
| Shared types | P, A, N, O | P, A, N, O | P, A, N, O (H shared with JP) |
| Matching method | Range match | Range match | Range match + guard |
| WC features used | Yes | Yes | No |