Preparing a Corpus

A good training corpus is essential for model accuracy. This guide explains how to prepare one using Universal Dependencies (UD) Treebanks.

Data Source: UD Treebanks

Litsea uses UD Treebanks as the data source for both word segmentation and POS tagging. UD Treebanks provide high-quality, manually annotated data in CoNLL-U format for many languages.

Available Treebanks

Language	Treebank	Repository
Japanese	UD Japanese-GSD	`UD_Japanese-GSD`
Chinese	UD Chinese-GSD	`UD_Chinese-GSD`
Korean	UD Korean-GSD	`UD_Korean-GSD`

Step 1: Download a UD Treebank

Use scripts/download_udtreebank.sh to download a UD Treebank. It prints the path to the training CoNLL-U file to stdout:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)

Supported languages: ja (Japanese, default), ko (Korean), zh (Chinese). Use -o to specify the output directory (default: current directory).

Corpus for Word Segmentation

For word segmentation (AdaBoost), the corpus must be a plain text file with:

One sentence per line
Words separated by spaces

太郎 は 走っ た 。
Litsea は コンパクト な 単語 分割 ソフトウェア です 。

Convert CoNLL-U to Word Segmentation Corpus

Use scripts/corpus_udtreebank.sh to convert a CoNLL-U file to corpus format:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt

This converts the CoNLL-U data into space-separated words (one sentence per line).

Corpus for POS Tagging

For POS tagging (Averaged Perceptron), each word must be annotated with its POS tag.

POS Corpus Format

Each line represents one sentence, with words annotated as word/POS pairs separated by spaces:

これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT
Litsea/PROPN は/ADP 単語/NOUN 分割/NOUN ソフトウェア/NOUN です/AUX 。/PUNCT

The POS tags follow the Universal POS (UPOS) tagset with 17 categories: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.

Convert CoNLL-U to POS Corpus

Use scripts/corpus_udtreebank.sh with the -p flag to produce a POS corpus:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt

Multi-word tokens and empty nodes are automatically handled during conversion.

Automated Corpus Preparation

Litsea includes helper scripts in the scripts/ directory that automate the UD Treebank download and conversion:

scripts/download_udtreebank.sh – Downloads a UD Treebank and prints the path to the training CoNLL-U file
scripts/corpus_udtreebank.sh – Converts a CoNLL-U file to Litsea corpus format

# Download UD Treebank and get CoNLL-U file path
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)

# Generate word segmentation corpus
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt

# Generate POS corpus
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt

Supported languages for download_udtreebank.sh: ja (Japanese, default), ko (Korean), zh (Chinese).

For larger-scale training, you can build a corpus from a full Wikipedia dump using scripts/corpus_wikidump.sh. This extracts plain text with wicket, filters for actual sentences, and tokenizes with lindera.

Usage

# Japanese (default)
bash scripts/corpus_wikidump.sh jawiki-latest-pages-articles.xml.bz2 corpus_ja.txt

# Korean
bash scripts/corpus_wikidump.sh -l ko kowiki-latest-pages-articles.xml.bz2 corpus_ko.txt

# Chinese
bash scripts/corpus_wikidump.sh -l zh zhwiki-latest-pages-articles.xml.bz2 corpus_zh.txt

Options

Option	Description	Default
`-l lang`	Language code: `ja`, `ko`, `zh`	`ja`
`-n max_lines`	Maximum sentence lines to process (0 = unlimited)	`100000`

Sentence Filtering

The script applies two filters to keep only well-formed sentences:

Sentence-ending punctuation – Lines must end with 。, ., !, or ?. This excludes section headers (e.g., “参考文献”), list items, and metadata.
Minimum length – Lines must be at least 20 characters. This excludes short fragments and isolated labels.

Tokenizer Dictionaries

Language	Dictionary	Token Filter
Japanese (`ja`)	`embedded://unidic`	`japanese_compound_word` (numeral compound)
Korean (`ko`)	`embedded://ko-dic`	None
Chinese (`zh`)	`embedded://cc-cedict`	None

Corpus Size Guidelines

The recommended corpus size depends on your use case:

Size (sentence lines)	Use Case
~10,000	Minimum for prototyping and smoke tests
50,000 – 100,000	Practical range for model training
100,000 – 500,000	High-quality, robust models
Unlimited	Use full dump for maximum accuracy

The default max_lines=100000 in corpus_wikidump.sh targets the practical-to-high-quality range.

Corpus Quality Tips

Diversity – Include text from various domains (news, literature, web, etc.)
Size – See Corpus Size Guidelines above for recommended sizes
Consistency – Ensure consistent tokenization throughout the corpus
Deduplication – Remove duplicate sentences to avoid bias
Cleaning – Remove HTML tags, special formatting, and non-text content

Litsea Documentation