Preparing a Corpus
A good training corpus is essential for model accuracy. This guide explains how to prepare one using Universal Dependencies (UD) Treebanks.
Data Source: UD Treebanks
Litsea uses UD Treebanks as the data source for both word segmentation and POS tagging. UD Treebanks provide high-quality, manually annotated data in CoNLL-U format for many languages.
Available Treebanks
| Language | Treebank | Repository |
|---|---|---|
| Japanese | UD Japanese-GSD | UD_Japanese-GSD |
| Chinese | UD Chinese-GSD | UD_Chinese-GSD |
| Korean | UD Korean-GSD | UD_Korean-GSD |
Step 1: Download a UD Treebank
Use scripts/download_udtreebank.sh to download a UD Treebank. It prints the path to the training CoNLL-U file to stdout:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
Supported languages: ja (Japanese, default), ko (Korean), zh (Chinese). Use -o to specify the output directory (default: current directory).
Corpus for Word Segmentation
For word segmentation (AdaBoost), the corpus must be a plain text file with:
- One sentence per line
- Words separated by spaces
太郎 は 走っ た 。
Litsea は コンパクト な 単語 分割 ソフトウェア です 。
Convert CoNLL-U to Word Segmentation Corpus
Use scripts/corpus_udtreebank.sh to convert a CoNLL-U file to corpus format:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
This converts the CoNLL-U data into space-separated words (one sentence per line).
Corpus for POS Tagging
For POS tagging (Averaged Perceptron), each word must be annotated with its POS tag.
POS Corpus Format
Each line represents one sentence, with words annotated as word/POS pairs separated by spaces:
これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT
Litsea/PROPN は/ADP 単語/NOUN 分割/NOUN ソフトウェア/NOUN です/AUX 。/PUNCT
The POS tags follow the Universal POS (UPOS) tagset with 17 categories: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.
Convert CoNLL-U to POS Corpus
Use scripts/corpus_udtreebank.sh with the -p flag to produce a POS corpus:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
Multi-word tokens and empty nodes are automatically handled during conversion.
Automated Corpus Preparation
Litsea includes helper scripts in the scripts/ directory that automate the UD Treebank download and conversion:
scripts/download_udtreebank.sh– Downloads a UD Treebank and prints the path to the training CoNLL-U filescripts/corpus_udtreebank.sh– Converts a CoNLL-U file to Litsea corpus format
# Download UD Treebank and get CoNLL-U file path
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
# Generate word segmentation corpus
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
# Generate POS corpus
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
Supported languages for download_udtreebank.sh: ja (Japanese, default), ko (Korean), zh (Chinese).
Corpus from Wikipedia Dump
For larger-scale training, you can build a corpus from a full Wikipedia dump using scripts/corpus_wikidump.sh. This extracts plain text with wicket, filters for actual sentences, and tokenizes with lindera.
Usage
# Japanese (default)
bash scripts/corpus_wikidump.sh jawiki-latest-pages-articles.xml.bz2 corpus_ja.txt
# Korean
bash scripts/corpus_wikidump.sh -l ko kowiki-latest-pages-articles.xml.bz2 corpus_ko.txt
# Chinese
bash scripts/corpus_wikidump.sh -l zh zhwiki-latest-pages-articles.xml.bz2 corpus_zh.txt
Options
| Option | Description | Default |
|---|---|---|
-l lang | Language code: ja, ko, zh | ja |
-n max_lines | Maximum sentence lines to process (0 = unlimited) | 100000 |
Sentence Filtering
The script applies two filters to keep only well-formed sentences:
- Sentence-ending punctuation – Lines must end with
。,.,!, or?. This excludes section headers (e.g., “参考文献”), list items, and metadata. - Minimum length – Lines must be at least 20 characters. This excludes short fragments and isolated labels.
Tokenizer Dictionaries
| Language | Dictionary | Token Filter |
|---|---|---|
Japanese (ja) | embedded://unidic | japanese_compound_word (numeral compound) |
Korean (ko) | embedded://ko-dic | None |
Chinese (zh) | embedded://cc-cedict | None |
Corpus Size Guidelines
The recommended corpus size depends on your use case:
| Size (sentence lines) | Use Case |
|---|---|
| ~10,000 | Minimum for prototyping and smoke tests |
| 50,000 – 100,000 | Practical range for model training |
| 100,000 – 500,000 | High-quality, robust models |
| Unlimited | Use full dump for maximum accuracy |
The default max_lines=100000 in corpus_wikidump.sh targets the practical-to-high-quality range.
Corpus Quality Tips
- Diversity – Include text from various domains (news, literature, web, etc.)
- Size – See Corpus Size Guidelines above for recommended sizes
- Consistency – Ensure consistent tokenization throughout the corpus
- Deduplication – Remove duplicate sentences to avoid bias
- Cleaning – Remove HTML tags, special formatting, and non-text content