Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Preparing a Corpus

A good training corpus is essential for model accuracy. This guide explains how to prepare one using Universal Dependencies (UD) Treebanks.

Data Source: UD Treebanks

Litsea uses UD Treebanks as the data source for both word segmentation and POS tagging. UD Treebanks provide high-quality, manually annotated data in CoNLL-U format for many languages.

Available Treebanks

LanguageTreebankRepository
JapaneseUD Japanese-GSDUD_Japanese-GSD
ChineseUD Chinese-GSDUD_Chinese-GSD
KoreanUD Korean-GSDUD_Korean-GSD

Step 1: Download a UD Treebank

Use scripts/download_udtreebank.sh to download a UD Treebank. It prints the path to the training CoNLL-U file to stdout:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)

Supported languages: ja (Japanese, default), ko (Korean), zh (Chinese). Use -o to specify the output directory (default: current directory).

Corpus for Word Segmentation

For word segmentation (AdaBoost), the corpus must be a plain text file with:

  • One sentence per line
  • Words separated by spaces
太郎 は 走っ た 。
Litsea は コンパクト な 単語 分割 ソフトウェア です 。

Convert CoNLL-U to Word Segmentation Corpus

Use scripts/corpus_udtreebank.sh to convert a CoNLL-U file to corpus format:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt

This converts the CoNLL-U data into space-separated words (one sentence per line).

Corpus for POS Tagging

For POS tagging (Averaged Perceptron), each word must be annotated with its POS tag.

POS Corpus Format

Each line represents one sentence, with words annotated as word/POS pairs separated by spaces:

これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT
Litsea/PROPN は/ADP 単語/NOUN 分割/NOUN ソフトウェア/NOUN です/AUX 。/PUNCT

The POS tags follow the Universal POS (UPOS) tagset with 17 categories: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.

Convert CoNLL-U to POS Corpus

Use scripts/corpus_udtreebank.sh with the -p flag to produce a POS corpus:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt

Multi-word tokens and empty nodes are automatically handled during conversion.

Automated Corpus Preparation

Litsea includes helper scripts in the scripts/ directory that automate the UD Treebank download and conversion:

  • scripts/download_udtreebank.sh – Downloads a UD Treebank and prints the path to the training CoNLL-U file
  • scripts/corpus_udtreebank.sh – Converts a CoNLL-U file to Litsea corpus format
# Download UD Treebank and get CoNLL-U file path
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)

# Generate word segmentation corpus
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt

# Generate POS corpus
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt

Supported languages for download_udtreebank.sh: ja (Japanese, default), ko (Korean), zh (Chinese).

Corpus from Wikipedia Dump

For larger-scale training, you can build a corpus from a full Wikipedia dump using scripts/corpus_wikidump.sh. This extracts plain text with wicket, filters for actual sentences, and tokenizes with lindera.

Usage

# Japanese (default)
bash scripts/corpus_wikidump.sh jawiki-latest-pages-articles.xml.bz2 corpus_ja.txt

# Korean
bash scripts/corpus_wikidump.sh -l ko kowiki-latest-pages-articles.xml.bz2 corpus_ko.txt

# Chinese
bash scripts/corpus_wikidump.sh -l zh zhwiki-latest-pages-articles.xml.bz2 corpus_zh.txt

Options

OptionDescriptionDefault
-l langLanguage code: ja, ko, zhja
-n max_linesMaximum sentence lines to process (0 = unlimited)100000

Sentence Filtering

The script applies two filters to keep only well-formed sentences:

  1. Sentence-ending punctuation – Lines must end with , ., !, or ?. This excludes section headers (e.g., “参考文献”), list items, and metadata.
  2. Minimum length – Lines must be at least 20 characters. This excludes short fragments and isolated labels.

Tokenizer Dictionaries

LanguageDictionaryToken Filter
Japanese (ja)embedded://unidicjapanese_compound_word (numeral compound)
Korean (ko)embedded://ko-dicNone
Chinese (zh)embedded://cc-cedictNone

Corpus Size Guidelines

The recommended corpus size depends on your use case:

Size (sentence lines)Use Case
~10,000Minimum for prototyping and smoke tests
50,000 – 100,000Practical range for model training
100,000 – 500,000High-quality, robust models
UnlimitedUse full dump for maximum accuracy

The default max_lines=100000 in corpus_wikidump.sh targets the practical-to-high-quality range.

Corpus Quality Tips

  • Diversity – Include text from various domains (news, literature, web, etc.)
  • Size – See Corpus Size Guidelines above for recommended sizes
  • Consistency – Ensure consistent tokenization throughout the corpus
  • Deduplication – Remove duplicate sentences to avoid bias
  • Cleaning – Remove HTML tags, special formatting, and non-text content