Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Korean

Litsea supports Korean word segmentation with specialized Hangul character type detection.

Character Types

CodeNamePatternExamples
EParticles/Endings[은는을를의에]은, 는, 을, 를, 의, 에
SNHangul (no 받침)Codepoint arithmetic가, 나, 하, 모
SFHangul (with 받침)Codepoint arithmetic한, 글, 각, 붙
JHangul JamoU+1100–U+11FFIndividual consonants/vowels
GCompatibility JamoU+3130–U+318Fㄱ, ㅏ, ㅎ
HHanjaU+4E00–U+9FFFCJK Ideographs
PPunctuationCJK Symbols + Full-width。, ,
AASCII/Latin[a-zA-Za-zA-Z]A, z
NDigits[0-90-9]0, 5, 5
OOtherFallback@, #, $

Korean Particles (조사)

The “E” type captures six high-frequency grammatical particles:

CharacterRoleName
은/는Topic marker주격 조사
을/를Object marker목적격 조사
Possessive관형격 조사
Locative부사격 조사

These particles frequently appear at word boundaries and are given a distinct type code to improve segmentation accuracy.

Hangul Syllable Structure (받침 Detection)

Korean uses closure-based matching instead of regex for SN and SF types. This exploits the systematic Unicode Hangul encoding:

  • Hangul Syllables: U+AC00–U+D7AF (11,172 syllables)
  • Each syllable = (initial * 21 + medial) * 28 + final + 0xAC00
  • SN (no 받침): (codepoint - 0xAC00) % 28 == 0
  • SF (with 받침): (codepoint - 0xAC00) % 28 != 0

The 받침 (final consonant) distinction is linguistically significant because it affects how particles attach to words and where boundaries occur.

No WC Features

Korean does not use WC (word + character-type) features. Since most Hangul syllables fall into only two types (SN and SF), WC features would produce low-entropy, noisy combinations that hurt model accuracy.

Pre-trained Model

korean.model

  • Training corpus: UD Korean-GSD
  • Accuracy: 85.08%

Example

echo "한국어 단어 분할 테스트입니다." | litsea segment -l korean ./models/korean.model