Introduction
Litsea is an extremely compact word segmentation library implemented in Rust, inspired by TinySegmenter and TinySegmenterMaker.
Unlike traditional morphological analyzers such as MeCab and Lindera, Litsea does not rely on large-scale dictionaries. Instead, it performs word segmentation using a compact pre-trained model based on the AdaBoost binary classification algorithm. Litsea also supports joint word segmentation and POS (Part-of-Speech) tagging using the Averaged Perceptron multiclass classifier with the Universal POS (UPOS) tagset.
Key Features
- Fast and safe Rust implementation – built with Rust’s safety guarantees and performance
- Compact pre-trained models – model files are only a few kilobytes in size
- No dictionary dependency – segmentation is driven entirely by a statistical model
- POS tagging – joint segmentation and Part-of-Speech tagging with UPOS tags via Averaged Perceptron
- Multilingual support – Japanese, Chinese (Simplified/Traditional), and Korean
- Model training capabilities – train custom models using AdaBoost or Averaged Perceptron with your own corpora
- Remote model loading – load models from HTTP/HTTPS URLs or local files
- Simple and extensible API – easy to integrate into Rust projects as a library
How It Works
Litsea treats word segmentation as a binary classification problem: for each character position in a sentence, the model predicts whether it is a word boundary (+1) or not a boundary (-1). The classifier uses character n-gram features and character type information specific to each language.
Input: "LitseaはRust製です"
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
O O O O B O B O B ← boundary predictions
Output: ["Litsea", "は", "Rust製", "です"]
POS Tagging
Litsea also supports POS (Part-of-Speech) tagging in addition to word segmentation. Using the Averaged Perceptron multiclass classifier, it performs joint segmentation and POS tagging simultaneously.
For each character position, the model predicts one of 18 SegmentLabel classes:
B-NOUN,B-VERB, …,B-X(boundary labels for 17 POS tags)O(non-boundary = continuation of the current word)
The POS tags follow the Universal Dependencies UPOS tagset (17 POS tags).
Input: "今日はいい天気ですね。"
Output: 今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT
Name Origin
There is a small plant called Litsea cubeba (Aomoji) in the same Lauraceae family as Lindera (Kuromoji). This is the origin of the name Litsea.
Current Version
Litsea v0.5.0 – Rust Edition 2024, minimum Rust version 1.87.
Links
Getting Started
Welcome to Litsea! This section will help you get up and running quickly.
Litsea is a compact word segmentation library in Rust that supports both word segmentation (AdaBoost) and joint segmentation with POS tagging (Averaged Perceptron).
Next Steps
- Installation – install Litsea from source or crates.io
- Quick Start – segment your first sentence in minutes
Installation
Prerequisites
- Rust 1.87 or later (stable channel) from rust-lang.org
- Cargo (Rust’s package manager, included with Rust)
Installing the CLI Tool
From crates.io
cargo install litsea-cli
From Source
git clone https://github.com/mosuka/litsea.git
cd litsea
cargo build --release
The binary will be available at ./target/release/litsea.
Verify the installation:
./target/release/litsea --help
Using as a Library
Add Litsea to your project’s Cargo.toml:
[dependencies]
litsea = "0.5.0"
Note: Loading models from local files (
load_model_from_path) is synchronous, so no async runtime is needed. An async runtime such astokiois only required if you load models over HTTP/HTTPS with the asyncload_modelmethod (enabled by theremote_modelfeature, which is on by default).
Supported Platforms
Litsea is tested on the following platforms:
| OS | Architecture |
|---|---|
| Linux | x86_64, aarch64 |
| macOS | x86_64 (Intel), aarch64 (Apple Silicon) |
| Windows | x86_64, aarch64 |
Quick Start
CLI Quick Start
Segmenting Text
Litsea ships with pre-trained models in the models/ directory. Pipe text into the segment command:
Japanese:
echo "LitseaはTinySegmenterを参考に開発された、Rustで実装された極めてコンパクトな単語分割ソフトウェアです。" \
| litsea segment -l japanese ./models/japanese.model
Output:
Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。
Chinese:
echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model
Korean:
echo "한국어 단어 분할 테스트입니다." | litsea segment -l korean ./models/korean.model
POS Tagging
Litsea can perform joint word segmentation and POS tagging using a POS model. Add the --pos flag to the segment command:
echo "今日はいい天気ですね。" \
| litsea segment --pos -l japanese ./models/japanese_pos.model
Output:
今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT
Each token is annotated with a Universal POS (UPOS) tag.
Library Quick Start
Here is a minimal Rust program that loads a model and segments text:
use std::path::Path;
use litsea::adaboost::AdaBoost;
use litsea::language::Language;
use litsea::segmenter::Segmenter;
fn main() -> litsea::Result<()> {
// Load the pre-trained model
let mut learner = AdaBoost::new(0.01, 100);
learner.load_model_from_path(Path::new("./models/japanese.model"))?;
// Create a segmenter
let segmenter = Segmenter::new(Language::Japanese, Some(learner));
// Segment text
let tokens = segmenter.segment("これはテストです。");
println!("{}", tokens.join(" "));
// Output: これ は テスト です 。
Ok(())
}
POS Tagging with the Library
Here is a minimal Rust program that loads a POS model and segments text with POS tags:
use std::path::Path;
use litsea::language::Language;
use litsea::perceptron::AveragedPerceptron;
use litsea::segmenter::Segmenter;
fn main() -> litsea::Result<()> {
// Load the pre-trained POS model
let mut pos_learner = AveragedPerceptron::new();
pos_learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;
// Create a segmenter with POS support
let segmenter = Segmenter::with_pos_learner(Language::Japanese, pos_learner);
// Segment text with POS tags
let tokens = segmenter.segment_with_pos("今日はいい天気ですね。");
for (word, pos) in &tokens {
print!("{}/{} ", word, pos);
}
// Output: 今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT
Ok(())
}
What’s Next
- CLI Reference – learn all CLI commands and options
- Training Guide – train your own models
- Architecture – understand how Litsea works internally
Architecture Overview
Litsea is designed as a compact, dictionary-free word segmentation system. It treats word segmentation as a binary classification problem and uses AdaBoost to learn word boundary patterns from character-level features.
High-Level Data Flow
Litsea has two main workflows: training and segmentation.
Training Pipeline
flowchart LR
A["Corpus (text)"] --> B["Extractor"]
B --> C["Features File (.txt)"]
C --> D["Trainer (AdaBoost)"]
D --> E["Model File (.model)"]
- Corpus preparation – Prepare text with words separated by spaces
- Feature extraction – The
Extractorreads the corpus, classifies characters by type, and outputs labeled feature vectors - Model training – The
Trainerfeeds features into AdaBoost, which iteratively selects the most informative features and produces a compact model
Segmentation Pipeline
flowchart LR
F["Raw text"] --> G["Segmenter (AdaBoost)"]
H["Model file"] --> G
G --> I["Segmented words"]
- Model loading – Load a pre-trained model (from file or URL)
- Character classification – For each character in the input, determine its type code based on language-specific patterns
- Feature extraction – Build a feature set for each character position using a sliding window
- Prediction – AdaBoost predicts whether each position is a word boundary
Design Principles
- No dictionary dependency – Unlike MeCab or Lindera, Litsea relies solely on a statistical model learned from character patterns
- Compact models – Model files are typically 1-22 KB, containing only the feature weights that matter
- Language-agnostic framework – The core algorithm is the same for all languages; only the character type patterns differ
- Simple extensibility – Adding a new language requires only defining character type patterns and training a model
Workspace Structure
Litsea is organized as a Cargo workspace with two crates and supporting directories.
Directory Layout
litsea/
├── Cargo.toml # Workspace manifest
├── Cargo.lock # Dependency lock file
├── Makefile # Build convenience targets
├── rustfmt.toml # Rust formatting configuration
├── LICENSE # MIT
├── README.md # Project overview
├── litsea/ # Core library crate
│ ├── Cargo.toml
│ ├── src/
│ │ ├── lib.rs # Module declarations and version
│ │ ├── adaboost.rs # AdaBoost algorithm
│ │ ├── segmenter.rs # Word segmentation
│ │ ├── extractor.rs # Feature extraction from corpus
│ │ ├── trainer.rs # Training orchestration
│ │ ├── language.rs # Language definitions and char patterns
│ │ └── util.rs # URI scheme utilities
│ └── benches/
│ └── bench.rs # Criterion benchmarks
├── litsea-cli/ # CLI binary crate
│ ├── Cargo.toml
│ └── src/
│ └── main.rs # CLI entry point
├── models/ # Pre-trained models
│ ├── japanese.model
│ ├── chinese.model
│ ├── korean.model
│ ├── RWCP.model
│ └── JEITA_Genpaku_ChaSen_IPAdic.model
├── resources/ # Sample data and test fixtures
│ └── bocchan.txt # Sample corpus
├── scripts/ # Corpus preparation utilities
│ ├── download_udtreebank.sh # Download UD Treebanks (prints CoNLL-U file path)
│ ├── corpus_udtreebank.sh # Convert CoNLL-U to Litsea corpus format
│ └── wikitexts.sh # Download and prepare Wikipedia text data
├── docs/ # mdbook documentation (this book)
└── .github/
└── workflows/ # CI/CD pipelines
├── regression.yml # Test on push/PR
├── release.yml # Release builds and publishing
└── periodic.yml # Weekly stability tests
Crate Details
litsea (Core Library)
The core library provides all segmentation, training, and model I/O functionality.
| Dependency | Version | Purpose |
|---|---|---|
thiserror | 2.0 | Error type derivation |
reqwest | 0.13 | HTTP/HTTPS model loading (rustls) |
tokio | 1.49 | Async runtime for remote model loading |
criterion | 0.8 | Benchmarking (dev dependency) |
tempfile | 3.25 | Temporary files for tests (dev dependency) |
litsea-cli (CLI Binary)
The CLI provides a command-line interface to Litsea’s functionality.
| Dependency | Version | Purpose |
|---|---|---|
clap | 4.5 | Command-line argument parsing |
ctrlc | 3.5 | Graceful Ctrl+C handling during training |
tokio | 1.49 | Async runtime |
litsea | 0.4 | Core library (workspace member) |
Workspace Configuration
The workspace uses Cargo resolver version 3 (Rust Edition 2024):
[workspace]
resolver = "3"
members = ["litsea", "litsea-cli"]
[workspace.package]
version = "0.4.0"
edition = "2024"
rust-version = "1.87"
Shared dependencies are defined at the workspace level in [workspace.dependencies] and referenced by each crate with { workspace = true }.
Module Design
The litsea library crate is organized into focused modules, each with a clear responsibility.
Module Dependency Graph
graph TD
language["language.rs<br/>Character classification"]
segmenter["segmenter.rs<br/>Segmentation + POS tagging"]
adaboost["adaboost.rs<br/>AdaBoost (boundaries)"]
perceptron["perceptron.rs<br/>Averaged Perceptron (POS)"]
upos["upos.rs<br/>UPOS tags and labels"]
extractor["extractor.rs<br/>Feature extraction"]
trainer["trainer.rs<br/>Training orchestration"]
model_io["model_io.rs (private)<br/>Model URI loading"]
error["error.rs<br/>LitseaError / Result"]
metrics["metrics.rs<br/>Evaluation metrics"]
language --> segmenter
upos --> segmenter
adaboost --> segmenter
perceptron --> segmenter
segmenter --> extractor
adaboost --> trainer
perceptron --> trainer
model_io --> adaboost
model_io --> perceptron
error --> adaboost
error --> perceptron
metrics --> trainer
Module Details
language.rs – Language Definitions
Defines the Language enum and character type classification.
Language– Enum with variantsJapanese,Chinese,Korean- Implements
FromStr(parses"japanese","ja","chinese","zh","korean","ko") - Implements
Display(outputs lowercase name) char_type(c: char) -> &'static str– Classifies a character with a directmatchon character ranges (allocation-free; no regex). Language-specific functions (japanese_char_type, etc.) share apunct_latin_digit()helper for the common"P"/"A"/"N"classes.
- Implements
segmenter.rs – Word Segmentation and POS Tagging
The main user-facing module.
Segmenter– Holds aLanguage, anAdaBoostlearner, and an optionalAveragedPerceptronPOS learner (fields are private; uselanguage(),learner(),learner_mut(),pos_learner(),pos_learner_mut())new(language, learner)– Create a segmenter with an optional pre-trained modelwith_pos_learner(language, pos_learner)– Create a segmenter for joint segmentation + POS taggingsegment(sentence)– Segment text into words, returnsVec<String>segment_with_pos(sentence)– Segment and tag, returnsVec<(String, Upos)>char_type(ch)– Classify a single character into its type codeadd_corpus(corpus)/add_corpus_with_pos(corpus)– Add training dataadd_corpus_with_writer(corpus, callback)/add_corpus_with_pos_writer(corpus, callback)– Process a corpus with a custom callback
adaboost.rs – AdaBoost Algorithm
The binary classifier used for word boundary decisions.
AdaBoostnew(threshold, num_iterations)– Create with training parametersinitialize_features(path)/initialize_instances(path)– Load training datatrain(running)– Run the AdaBoost training looppredict(&attributes)– Predict boundary (+1) or non-boundary (-1)load_model(uri)(async) /load_model_from_path(path)/load_model_from_reader(reader)– Load model weightssave_model(path)– Save model weights to a filemetrics()– Calculate accuracy, precision, and recall (BinaryMetrics)bias()– Get the model’s bias term
perceptron.rs – Averaged Perceptron
The multiclass classifier used for joint segmentation + POS tagging.
AveragedPerceptronadd_instance(features, label)– Add a training instancetrain(num_epochs, running)– Train with weight averagingpredict(&features)– Predict the best class labelload_model(uri)(async) /load_model_from_path(path)/load_model_from_reader(reader)– Load model weightssave_model(path)– Save model weightsmetrics()– Macro-averaged evaluation (MulticlassMetrics)
- Weights are stored in a feature → per-class vector layout for fast inference.
upos.rs – Universal POS Tags
Upos– The 17 Universal Dependencies POS tags (NOUN,VERB, …)SegmentLabel– Combined segmentation + POS label per character position (B(Upos)orO), withDisplay/FromStrfor the"B-NOUN"/"O"string form
extractor.rs – Feature Extraction
Extracts features from a corpus for model training.
Extractor– Wraps aSegmenterto process corpus filesnew(language)– Create an extractor for a specific languageextract(corpus_path, features_path)– Read a corpus, write a features fileextract_with_pos(corpus_path, features_path)– Same for POS-tagged corpora
trainer.rs – Training Orchestration
High-level training workflows.
Trainer– Segmentation model training (AdaBoost)new(threshold, num_iterations, features_path)– Initialize from a features fileload_model(uri)– Optionally load an existing model for incremental training (async)train(running, model_path)– Train and save, returnsBinaryMetrics
PosTrainer– POS model training (Averaged Perceptron)new(num_epochs, features_path)/load_model(uri)/train(running, model_path)returningMulticlassMetrics
error.rs – Error Handling
LitseaError– Error enum (Io,InvalidData,InvalidInput,Unsupported, andDownloadwith theremote_modelfeature)Result<T>– Alias used by every fallible API
metrics.rs – Evaluation Metrics
BinaryMetrics– Accuracy, precision, recall, confusion matrix (AdaBoost)MulticlassMetrics– Accuracy and macro-averaged precision/recall (Averaged Perceptron)
model_io.rs – Model Loading I/O (private)
Internal module that resolves a model URI (plain path, file://, or http(s):// with the remote_model feature) and returns the raw model bytes. Not part of the public API.
Public Exports
The library’s lib.rs exposes the public modules and re-exports the main types:
#![allow(unused)]
fn main() {
pub mod adaboost;
pub mod error;
pub mod extractor;
pub mod language;
pub mod metrics;
mod model_io;
pub mod perceptron;
pub mod segmenter;
pub mod trainer;
pub mod upos;
pub use adaboost::AdaBoost;
pub use error::{LitseaError, Result};
pub use extractor::Extractor;
pub use language::Language;
pub use metrics::{BinaryMetrics, MulticlassMetrics};
pub use perceptron::AveragedPerceptron;
pub use segmenter::Segmenter;
pub use trainer::{PosTrainer, Trainer};
pub use upos::{SegmentLabel, Upos};
pub fn version() -> &'static str { ... }
}
AdaBoost Binary Classification
Litsea uses the AdaBoost (Adaptive Boosting) algorithm for binary classification to determine word boundaries. This chapter explains the algorithm as implemented in Litsea.
Overview
AdaBoost combines many weak learners (simple classifiers) into a strong ensemble classifier. In Litsea:
- Positive label (+1) = word boundary
- Negative label (-1) = non-boundary (continuation of the current word)
- Weak learners = individual features (each feature is a binary “stump” – present or absent)
Training Algorithm
The training loop in AdaBoost::train() works as follows:
Initialization
- Load features and instances from the training file
- Initialize instance weights uniformly (later adjusted based on initial score)
- All model weights start at zero
Iterative Boosting
For each iteration t (up to num_iterations):
Step 1: Calculate weighted errors
For each feature h, compute its weighted error over all instances:
error[h] -= D[i] * y[i] (for each instance i that has feature h)
where D[i] is the instance weight and y[i] is the true label.
Step 2: Select the best weak learner
Find the feature with the lowest weighted error rate:
error_rate(h) = (error[h] + positive_weight_sum) / instance_weight_sum
h_best = argmax_h |0.5 - error_rate(h)|
The baseline competitor is the “all-negative” classifier (always predicts -1), whose error rate equals the fraction of positive instances. Any real feature must beat this baseline.
Step 3: Check convergence
If |0.5 - best_error_rate| < threshold, stop early – no feature can significantly improve the model.
Step 4: Compute the weak learner weight
alpha = 0.5 * ln((1 - error_rate) / error_rate)
model[h_best] += alpha
A lower error rate produces a higher alpha, giving more influence to better features.
Step 5: Update instance weights
For each instance i:
prediction = +1 if h_best in features(i), else -1
if y[i] * prediction < 0: (misclassified)
D[i] *= exp(alpha) (increase weight)
else: (correctly classified)
D[i] /= exp(alpha) (decrease weight)
Normalize: D[i] /= sum(D)
This ensures subsequent iterations focus on the instances that are still difficult to classify.
Prediction
Given an input set of features (attributes), the prediction is:
score = bias + sum(model[feature] for each feature in attributes)
prediction = +1 if score >= 0, else -1
Bias Term
The bias is computed as:
bias = -sum(all model weights) / 2.0
This centers the decision boundary. The empty-string feature ("") serves as the bias bucket during training.
Model File Format
The trained model is saved as a simple text file:
feature1\tweight1
feature2\tweight2
...
bias_value
- Each line contains a feature name and its weight (tab-separated)
- Zero-weight features are omitted
- The last line contains the bias term (a single number)
See Model File Format for details.
Hyperparameters
| Parameter | Default | Description |
|---|---|---|
threshold | 0.01 | Early stopping threshold. Lower values allow more iterations, potentially improving accuracy |
num_iterations | 100 | Maximum number of boosting rounds. Higher values may improve accuracy at the cost of training time and model size |
Averaged Perceptron
Litsea uses the Averaged Perceptron algorithm for multiclass classification to perform joint word segmentation and POS tagging. This chapter explains the algorithm as implemented in Litsea.
Overview
While AdaBoost performs binary classification (boundary vs. non-boundary), the Averaged Perceptron performs multiclass classification – predicting one of 18 segment labels for each character position:
- 17 boundary labels:
B-ADJ,B-ADP,B-ADV,B-AUX,B-CCONJ,B-DET,B-INTJ,B-NOUN,B-NUM,B-PART,B-PRON,B-PROPN,B-PUNCT,B-SCONJ,B-SYM,B-VERB,B-X - 1 non-boundary label:
O(continuation of the current word)
These labels correspond to the 17 Universal POS (UPOS) tags from the Universal Dependencies project, prefixed with B- to indicate a word boundary. This enables simultaneous word boundary detection and POS estimation in a single classification step.
Algorithm
Weight Representation
The perceptron maintains a weight vector per class. Weights are stored as a sparse map:
weights: HashMap<Feature, HashMap<Class, f64>>
For example:
weights["UW4:猫"]["B-NOUN"] = 2.5
weights["UC4:H"]["B-NOUN"] = 1.8
weights["UW4:猫"]["O"] = -0.3
...
For a given feature set, the score for each class is the sum of its feature weights:
score(class) = sum(weights[feature][class] for each feature in input)
prediction = argmax(score(class) for all classes)
Update Rule
When the perceptron makes a misclassification:
For each training instance (features, truth):
guess = predict(features)
if guess != truth:
For each feature f in features:
weights[f][truth] += 1.0 # increase weight for correct class
weights[f][guess] -= 1.0 # decrease weight for predicted class
This increases the weights for the correct class and decreases them for the incorrectly predicted class, making the correct prediction more likely for similar inputs in the future.
Averaging
A key improvement over the basic perceptron is weight averaging. Rather than using the final weights (which can be unstable and tend to overfit to the tail of the training data), the model averages all weight vectors seen during training. This improves generalization to unseen data.
The implementation uses a cumulative sum approach for efficiency:
cumulative[feature][class] += weights[feature][class] * elapsed_steps
At the end of training:
averaged[feature][class] = cumulative[feature][class] / total_steps
This avoids storing all intermediate weight vectors while producing the same result. The averaging reduces dependence on the order of training data and improves generalization performance.
Training with Epochs
Training iterates over the data multiple times (epochs). Each epoch processes all training instances in order:
For each epoch (1 to num_epochs):
For each instance in training data:
features = extract_features(instance)
predicted = argmax(score(class) for all classes)
if predicted != correct_label:
update weights
accumulate weights for averaging
Training supports graceful interruption via AtomicBool – a Ctrl+C signal stops training and saves the model at its current state.
#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use litsea::perceptron::AveragedPerceptron;
let mut perceptron = AveragedPerceptron::new();
// ... add instances ...
let running = Arc::new(AtomicBool::new(true));
perceptron.train(10, running); // 10 epochs
}
Model File Format
The Averaged Perceptron model is saved as a text file with the following structure:
18
O
B-ADJ
B-ADP
...
B-X
feature1\tclass1\tweight1
feature2\tclass2\tweight2
...
- Line 1: Number of classes (18)
- Lines 2 to N+1: Class names, one per line
- Remaining lines: Feature weights, tab-separated as
feature\tclass\tweight - Zero-weight entries are omitted
Comparison with AdaBoost
| Aspect | AdaBoost | Averaged Perceptron |
|---|---|---|
| Classification | Binary (+1/-1) | Multiclass (18 classes) |
| Output | Word boundaries only | Word boundaries + POS tags |
| Weak learner | Decision stumps per feature | None (linear classifier) |
| Weight management | One weight per feature | Class x feature weight matrix |
| Generalization | Ensemble | Weight averaging |
| Training | Iterative boosting with sample reweighting | Online learning with weight averaging |
| Model size | A few KB | ~11 MB (with POS features) |
| Hyperparameters | threshold, num_iterations | num_epochs |
Hyperparameters
| Parameter | Default | Description |
|---|---|---|
num_epochs | 10 | Number of training passes over the data. More epochs can improve accuracy but may overfit |
Feature Extraction
Litsea uses character n-gram features to capture the local context around each potential word boundary. This chapter catalogs all feature types.
Feature Categories
For each character position i in the input, the segmenter extracts features from a sliding window of characters, their type codes, and previous boundary decisions.
Base Features (38 features)
| Category | IDs | Description | Window |
|---|---|---|---|
| UW (Unary Word) | UW1–UW6 | Individual characters at positions i-3 to i+2 | 6 |
| BW (Bigram Word) | BW1–BW3 | Adjacent character pairs | 3 |
| UC (Unary Char-type) | UC1–UC6 | Character type codes at positions i-3 to i+2 | 6 |
| BC (Bigram Char-type) | BC1–BC3 | Adjacent type code pairs | 3 |
| TC (Trigram Char-type) | TC1–TC4 | Type code triples | 4 |
| UP (Unary Previous-tag) | UP1–UP3 | Previous 3 boundary decisions | 3 |
| BP (Bigram Previous-tag) | BP1–BP2 | Boundary decision pairs | 2 |
| UQ (Unary tag+type) | UQ1–UQ3 | Combined boundary decision + type code | 3 |
| BQ (Bigram tag+type) | BQ1–BQ4 | Combined decision + type code bigrams | 4 |
| TQ (Trigram tag+type) | TQ1–TQ4 | Combined decision + type code trigrams | 4 |
Language-Specific Features (4 features, Japanese and Chinese only)
| Category | IDs | Description | Count |
|---|---|---|---|
| WC (Word+Char-type) | WC1–WC4 | Character + type code mixed features | 4 |
WC1: character at i-1 + type code at iWC2: type code at i-1 + character at iWC3: character at i-1 + type code at i-1WC4: character at i + type code at i
Why no WC for Korean? Korean Hangul syllables are classified into only two types (SN and SF), so WC features would add noise rather than useful signal.
Total Feature Count
| Language | Base | WC | Total |
|---|---|---|---|
| Japanese | 38 | 4 | 42 |
| Chinese | 38 | 4 | 42 |
| Korean | 38 | 0 | 38 |
Feature Format
Each feature is represented as a string in the format PREFIX:VALUE:
UW4:は ← The character at position i is "は"
UC4:I ← The type code at position i is "I" (Hiragana)
BW2:はテ ← The bigram at position i-1..i is "はテ"
BC2:IK ← The type bigram is Hiragana + Katakana
UP3:B ← The previous boundary decision was "B" (boundary)
WC1:はK ← Character "は" combined with type "K"
Sliding Window Layout
The segmenter pads the input with sentinel characters:
Index: 0 1 2 3 4 5 ... n+2 n+3 n+4 n+5
Chars: B3 B2 B1 c1 c2 c3 ... cn E1 E2 E3
Types: O O O t1 t2 t3 ... tn O O O
Tags: U U U U ? ? ... ?
- B3, B2, B1 – Begin sentinels (padding)
- E1, E2, E3 – End sentinels (padding)
- O – “Other” type for padding positions
- U – “Unknown” tag for initial positions
- B – “Boundary” tag (word start)
- O – “Other” tag (continuation)
Features are extracted for positions 4 through len-3, where the full window of i-3 to i+2 is available.
Training Data Format
The extract command writes features to a file in this format:
1 UW1:B2 UW2:B1 UW3:L UW4:i UW5:t UC1:O UC2:O UC3:A UC4:A ...
-1 UW1:B1 UW2:L UW3:i UW4:t UW5:s UC1:O UC2:A UC3:A UC4:A ...
Each line contains:
- A label (
1for boundary,-1for non-boundary) - Tab-separated feature strings
Character Type Classification
Each language in Litsea defines a set of character type patterns that classify individual characters into linguistically meaningful categories. These type codes are used as features for the AdaBoost classifier.
How It Works
Language::char_type(c: char) -> &'static str classifies a character with a direct match expression on Unicode character ranges — no regex, no allocation. Match arms are tried top to bottom, so the first matching arm determines the type code. If no arm matches, the character is classified as "O" (Other).
Each language has its own classification function (japanese_char_type, chinese_char_type, korean_char_type); the classes shared by all languages — "P" (punctuation), "A" (Latin), "N" (digits) — live in a common punct_latin_digit() helper that is checked after the language-specific classes. Logic beyond plain ranges is expressed with match guards (e.g., Korean Hangul syllable structure).
Japanese Character Types
| Code | Name | Pattern / Range | Examples |
|---|---|---|---|
| M | Kanji Numbers | [一二三四五六七八九十百千万億兆] | 一, 千, 億 |
| H | Kanji / CJK Ideographs | [一-龠々〆ヵヶ] | 漢, 字, 学 |
| I | Hiragana | [ぁ-ん] | あ, い, う |
| K | Katakana | [ァ-ヴーア-ン゙゚] | ア, カ, ー |
| P | Punctuation | CJK Symbols (U+3000-303F), Full-width (U+FF01-FF65) | 。, 、, 「 |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z, B |
| N | Digits | [0-90-9] | 0, 5 |
| O | Other | Fallback | @, # |
Note: “M” (Kanji numbers) is checked before “H” (general Kanji), so characters like 一 and 百 are classified as numbers rather than generic ideographs.
Chinese Character Types
| Code | Name | Pattern / Range | Examples |
|---|---|---|---|
| F | Function Words | High-frequency grammatical words | 的, 了, 在, 是 |
| C | CJK Unified | U+4E00–U+9FFF | 中, 国, 人 |
| X | CJK Extension A | U+3400–U+4DBF | Rare characters |
| R | CJK Radicals | U+2E80–U+2FDF | Kangxi radicals |
| P | Punctuation | CJK Symbols + Full-width | 。, ,, 《 |
| B | Bopomofo | U+3100–U+312F, U+31A0–U+31BF | Zhuyin symbols |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z |
| N | Digits | [0-90-9] | 0, 5 |
| O | Other | Fallback | @, # |
Chinese function words include:
- Structural particles: 的, 地, 得
- Aspect/modal particles: 了, 着, 过, 吗, 呢, 吧, 啊, 嘛
- Conjunctions: 和, 与, 或, 但, 而, 且, 及
- Prepositions: 在, 从, 到, 把, 被, 对, 向, 给
- Common grammatical verbs/adverbs: 是, 有, 不, 也, 都, 就, 要, 会, 能, 可
Korean Character Types
| Code | Name | Pattern / Range | Examples |
|---|---|---|---|
| E | Particles/Endings | High-frequency grammatical particles | 은, 는, 을, 를, 의, 에 |
| SN | Hangul (no batchim) | Hangul Syllable without final consonant | 가, 나, 하 |
| SF | Hangul (with batchim) | Hangul Syllable with final consonant | 한, 글, 각 |
| J | Hangul Jamo | U+1100–U+11FF | Individual consonants/vowels |
| G | Compatibility Jamo | U+3130–U+318F | ㄱ, ㅏ, ㅎ |
| H | Hanja | U+4E00–U+9FFF | CJK Ideographs |
| P | Punctuation | CJK Symbols + Full-width | 。, , |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z |
| N | Digits | [0-90-9] | 0, 5 |
| O | Other | Fallback | @, # |
Korean Hangul Syllable Detection
Korean uses a match guard for the SN and SF types. This leverages Unicode’s systematic Hangul encoding:
- Hangul Syllables occupy U+AC00–U+D7AF
- Each syllable is encoded as:
(initial * 21 + medial) * 28 + final + 0xAC00 - If
(codepoint - 0xAC00) % 28 == 0, the syllable has no final consonant (SN) - Otherwise, it has a final consonant (SF, “받침”)
This distinction is important because the presence of a final consonant (받침) affects Korean word boundary patterns and particle attachment.
Cross-Language Comparison
| Feature | Japanese | Chinese | Korean |
|---|---|---|---|
| Total types | 8 | 9 | 10 |
| Unique types | M, H, I, K | F, C, X, R, B | E, SN, SF, J, G |
| Shared types | P, A, N, O | P, A, N, O | P, A, N, O (H shared with JP) |
| Matching method | Range match | Range match | Range match + guard |
| WC features used | Yes | Yes | No |
Prediction Pipeline
This chapter provides a step-by-step walkthrough of how Segmenter::segment() processes input text.
Example: Segmenting “これはテストです。”
Step 1: Initialize Arrays with Padding
chars: ["B3", "B2", "B1"]
types: ["O", "O", "O" ]
tags: ["U", "U", "U", "U"]
The tags array gets one extra “U” because tags[3] represents the first real character’s tag (set to “Unknown” since there is no prior boundary decision).
Step 2: Scan Input Characters
For each character in the input, determine its type using language-specific patterns and append to the arrays:
chars: ["B3","B2","B1", "こ","れ","は","テ","ス","ト","で","す","。"]
types: ["O", "O", "O", "I", "I", "I", "K", "K", "K", "I", "I", "P"]
Step 3: Append End Sentinels
chars: [..., "。", "E1", "E2", "E3"]
types: [..., "P", "O", "O", "O" ]
Step 4: Iterate and Predict
For each position i from 4 to len(chars) - 3:
i=4 (れ): Extract features → predict → label=-1 (O) → word="これ"
i=5 (は): Extract features → predict → label=+1 (B) → push "これ", word="は"
i=6 (テ): Extract features → predict → label=+1 (B) → push "は", word="テ"
i=7 (ス): Extract features → predict → label=-1 (O) → word="テス"
i=8 (ト): Extract features → predict → label=-1 (O) → word="テスト"
i=9 (で): Extract features → predict → label=+1 (B) → push "テスト", word="で"
i=10(す): Extract features → predict → label=-1 (O) → word="です"
i=11(。): Extract features → predict → label=+1 (B) → push "です", word="。"
Step 5: Push Final Word
Push the remaining word “。” to the result.
Result
["これ", "は", "テスト", "です", "。"]
How Prediction Works at Each Position
At each position i, the segmenter:
-
Extracts features – Calls
get_attributes(i, tags, chars, types)to build aHashSet<String>of 38–42 features -
Computes score – The AdaBoost learner sums the model weights for all matching features plus the bias:
score = bias + sum(model[feature] for feature in attributes) -
Makes decision – If
score >= 0, the character starts a new word (boundary); otherwise, it continues the current word -
Updates tags – Pushes “B” or “O” to the tags array, which affects feature extraction for subsequent positions
Training vs. Prediction
| Aspect | Training (process_corpus) | Prediction (segment) |
|---|---|---|
| Tags source | Pre-computed from the annotated corpus | Dynamically generated by the model |
| First tag | “U” (overrides “B” at position 3) | “U” (no prior decision) |
| Labels | Known from corpus (+1 or -1) | Predicted by AdaBoost |
| Features | Written to file via callback | Passed directly to predict() |
During training, tags are derived from the ground-truth corpus segmentation, so the model learns from correct boundary decisions. During prediction, tags are generated on-the-fly, meaning each decision depends on all previous predictions – this is a left-to-right greedy approach.
Performance Characteristics
The segmentation algorithm is linear in the length of the input:
- Each character position is visited once: O(n)
- Feature extraction at each position: O(1) (fixed number of features)
- Prediction at each position: O(f) where f is the number of active features (~38-42)
- Total: O(n * f) which is effectively O(n)
Language Support Overview
Litsea supports word segmentation for three languages through a unified framework based on the Language enum.
Supported Languages
| Language | Enum Variant | CLI Values | Feature Count | Pre-trained Model Accuracy |
|---|---|---|---|---|
| Japanese | Language::Japanese | japanese, ja | 42 | 94.15% |
| Chinese | Language::Chinese | chinese, zh | 42 | 80.72% |
| Korean | Language::Korean | korean, ko | 38 | 85.08% |
The Language Enum
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum Language {
#[default]
Japanese,
Chinese,
Korean,
}
}
- Default is
Japanese - Implements
FromStr– parses from full name or ISO 639-1 code (case-insensitive) - Implements
Display– outputs the lowercase full name
Parsing Examples
#![allow(unused)]
fn main() {
use litsea::language::Language;
let ja: Language = "japanese".parse().unwrap();
let zh: Language = "zh".parse().unwrap();
let ko: Language = "Korean".parse().unwrap(); // case-insensitive
let err = "french".parse::<Language>(); // Err(...)
}
How Languages Differ
Each language defines its own character type patterns that classify characters into type codes. These type codes are used as features for the AdaBoost classifier.
| Aspect | Japanese | Chinese | Korean |
|---|---|---|---|
| Character types | 8 (M, H, I, K, P, A, N, O) | 9 (F, C, X, R, P, B, A, N, O) | 10 (E, SN, SF, J, G, H, P, A, N, O) |
| WC features | Yes (4 extra) | Yes (4 extra) | No |
| Total features | 42 | 42 | 38 |
| Matching method | Regex only | Regex only | Regex + Closure |
Why Korean Has Fewer Features
Korean Hangul syllables are classified into only two types: SN (without 받침/final consonant) and SF (with 받침). This binary distinction means WC features (word + character-type combinations) would produce redundant information with little discriminative power. Excluding them reduces noise and keeps the model compact.
Japanese
Japanese is the default language in Litsea.
Character Types
| Code | Name | Pattern | Examples |
|---|---|---|---|
| M | Kanji Numbers | [一二三四五六七八九十百千万億兆] | 一, 三, 千, 億 |
| H | Kanji / CJK | [一-龠々〆ヵヶ] | 漢, 字, 学, 々 |
| I | Hiragana | [ぁ-ん] | あ, い, う, を |
| K | Katakana | [ァ-ヴーア-ン゙゚] | ア, カ, ー, ハ |
| P | Punctuation | CJK Symbols + Full-width | 。, 、, 「, 」 |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z, B |
| N | Digits | [0-90-9] | 0, 5, 5 |
| O | Other | Fallback | @, #, $ |
Pattern Priority
Patterns are evaluated in order. Notably:
- M before H: Characters like 一 and 百 are classified as “Kanji Numbers” (M), not generic “Kanji” (H)
- This distinction helps the model learn number-specific boundary patterns
Pre-trained Models
japanese.model
- Training corpus: UD Japanese-GSD
- Accuracy: 94.15%
- Precision: 95.57%
- Recall: 94.36%
RWCP.model
- Source: Extracted from the original TinySegmenter
- License: BSD 3-Clause (Taku Kudo)
- Size: ~22 KB
JEITA_Genpaku_ChaSen_IPAdic.model
- Training corpus: JEITA Project Sugita Genpaku corpus
- Tokenizer: ChaSen with IPAdic dictionary
- Size: ~17 KB
Example
echo "LitseaはTinySegmenterを参考に開発された、Rustで実装された極めてコンパクトな単語分割ソフトウェアです。" \
| litsea segment -l japanese ./models/japanese.model
Output:
Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。
Chinese
Litsea supports Chinese word segmentation covering both Simplified and Traditional Chinese.
Character Types
| Code | Name | Pattern | Examples |
|---|---|---|---|
| F | Function Words | High-frequency grammatical words | 的, 了, 在, 是, 和 |
| C | CJK Unified | U+4E00–U+9FFF | 中, 国, 人 |
| X | CJK Extension A | U+3400–U+4DBF | Rare characters |
| R | CJK Radicals | U+2E80–U+2FDF | Kangxi radicals |
| P | Punctuation | CJK Symbols + Full-width | 。, ,, 《, 》 |
| B | Bopomofo | U+3100–U+312F, U+31A0–U+31BF | Zhuyin symbols |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z |
| N | Digits | [0-90-9] | 0, 5, 5 |
| O | Other | Fallback | @, #, $ |
Chinese Function Words (虚词)
The “F” type captures high-frequency grammatical words that are critical for segmentation:
| Category | Characters |
|---|---|
| Structural particles | 的, 地, 得 |
| Aspect/modal particles | 了, 着, 过, 吗, 呢, 吧, 啊, 嘛 |
| Conjunctions | 和, 与, 或, 但, 而, 且, 及 |
| Prepositions | 在, 从, 到, 把, 被, 对, 向, 给 |
| Grammatical verbs/adverbs | 是, 有, 不, 也, 都, 就, 要, 会, 能, 可 |
These characters appear overwhelmingly in grammatical roles and signal word boundaries differently from content words.
Pre-trained Model
chinese.model
- Training corpus: UD Chinese-GSD
- Accuracy: 80.72%
Example
echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model
Korean
Litsea supports Korean word segmentation with specialized Hangul character type detection.
Character Types
| Code | Name | Pattern | Examples |
|---|---|---|---|
| E | Particles/Endings | [은는을를의에] | 은, 는, 을, 를, 의, 에 |
| SN | Hangul (no 받침) | Codepoint arithmetic | 가, 나, 하, 모 |
| SF | Hangul (with 받침) | Codepoint arithmetic | 한, 글, 각, 붙 |
| J | Hangul Jamo | U+1100–U+11FF | Individual consonants/vowels |
| G | Compatibility Jamo | U+3130–U+318F | ㄱ, ㅏ, ㅎ |
| H | Hanja | U+4E00–U+9FFF | CJK Ideographs |
| P | Punctuation | CJK Symbols + Full-width | 。, , |
| A | ASCII/Latin | [a-zA-Za-zA-Z] | A, z |
| N | Digits | [0-90-9] | 0, 5, 5 |
| O | Other | Fallback | @, #, $ |
Korean Particles (조사)
The “E” type captures six high-frequency grammatical particles:
| Character | Role | Name |
|---|---|---|
| 은/는 | Topic marker | 주격 조사 |
| 을/를 | Object marker | 목적격 조사 |
| 의 | Possessive | 관형격 조사 |
| 에 | Locative | 부사격 조사 |
These particles frequently appear at word boundaries and are given a distinct type code to improve segmentation accuracy.
Hangul Syllable Structure (받침 Detection)
Korean uses closure-based matching instead of regex for SN and SF types. This exploits the systematic Unicode Hangul encoding:
- Hangul Syllables: U+AC00–U+D7AF (11,172 syllables)
- Each syllable =
(initial * 21 + medial) * 28 + final + 0xAC00 - SN (no 받침):
(codepoint - 0xAC00) % 28 == 0 - SF (with 받침):
(codepoint - 0xAC00) % 28 != 0
The 받침 (final consonant) distinction is linguistically significant because it affects how particles attach to words and where boundaries occur.
No WC Features
Korean does not use WC (word + character-type) features. Since most Hangul syllables fall into only two types (SN and SF), WC features would produce low-entropy, noisy combinations that hurt model accuracy.
Pre-trained Model
korean.model
- Training corpus: UD Korean-GSD
- Accuracy: 85.08%
Example
echo "한국어 단어 분할 테스트입니다." | litsea segment -l korean ./models/korean.model
Adding a New Language
Litsea’s multilingual framework is designed to be easily extensible. This guide explains how to add support for a new language.
Steps Overview
- Add a variant to the
Languageenum - Implement
DisplayandFromStrmatch arms - Create a character classification function
- Register the classification function
- Decide on WC feature inclusion
- Prepare a training corpus and train a model
- Add tests
Step 1: Add a Variant to Language
In litsea/src/language.rs, add a new variant to the Language enum:
#![allow(unused)]
fn main() {
pub enum Language {
#[default]
Japanese,
Chinese,
Korean,
Thai, // ← new language
}
}
Step 2: Implement Display and FromStr
Add match arms for the new language:
#![allow(unused)]
fn main() {
// In Display impl
Language::Thai => write!(f, "thai"),
// In FromStr impl
"thai" | "th" => Ok(Language::Thai),
}
Step 3: Create a Character Classification Function
Define a function that classifies a char into a type code for the new language. Classification is a direct match on character ranges (no regex), so each class is an arm; the first matching arm wins:
#![allow(unused)]
fn main() {
fn thai_char_type(c: char) -> &'static str {
match c {
// Thai consonants and sequential vowels (U+0E01-U+0E3A)
'\u{0E01}'..='\u{0E3A}' => "T",
// Thai vowels and tone marks (U+0E40-U+0E4E)
'\u{0E40}'..='\u{0E4E}' => "V",
// Thai digits (U+0E50-U+0E59)
'\u{0E50}'..='\u{0E59}' => "N",
// Shared classes: "P" (punctuation), "A" (Latin), "N" (digits)
_ => punct_latin_digit(c).unwrap_or("O"),
}
}
}
Design Tips for Character Types
- Identify linguistically distinct categories that correlate with word boundary patterns
- Order matters – match arms are tried top to bottom, so put more specific classes before general ones
- Consider high-frequency function words as a separate type (as Chinese does with “F”)
- Use match guards for logic beyond plain ranges (as Korean does to split syllables with/without 받침)
- Reuse the shared
punct_latin_digit()helper for the common “P”/“A”/“N” classes
Step 4: Register the Classification Function
Add a match arm in Language::char_type():
#![allow(unused)]
fn main() {
pub fn char_type(&self, c: char) -> &'static str {
match self {
Language::Japanese => japanese_char_type(c),
Language::Chinese => chinese_char_type(c),
Language::Korean => korean_char_type(c),
Language::Thai => thai_char_type(c), // ← new
}
}
}
Step 5: Decide on WC Feature Inclusion
In segmenter.rs, the internal attribute builder (write_attributes()) has a match on the language to decide whether to include WC features:
#![allow(unused)]
fn main() {
match self.language {
Language::Japanese | Language::Chinese => {
// Include WC features
attr!("WC1:{}{}", w3, c4);
attr!("WC2:{}{}", c3, w4);
attr!("WC3:{}{}", w3, c3);
attr!("WC4:{}{}", w4, c4);
}
_ => {}
}
}
If your language’s character types have enough variety to make WC features informative, add it to the match arm. If your type system is low-entropy (like Korean’s SN/SF), it is better to exclude WC features.
Step 6: Prepare Corpus and Train a Model
-
Prepare a corpus with words separated by spaces:
word1 word2 word3 word4 -
Extract features:
litsea extract -l thai ./corpus.txt ./features.txt -
Train a model:
litsea train -t 0.005 -i 1000 ./features.txt ./models/thai.model
Step 7: Add Tests
Add tests in both language.rs and segmenter.rs:
#![allow(unused)]
fn main() {
// In language.rs tests
#[test]
fn test_thai_char_types() {
let lang = Language::Thai;
assert_eq!(lang.char_type('ก'), "T"); // Thai consonant
assert_eq!(lang.char_type('A'), "A"); // ASCII
assert_eq!(lang.char_type('@'), "O"); // Other
}
// In segmenter.rs tests
#[test]
fn test_char_type_thai() {
let segmenter = Segmenter::new(Language::Thai, None);
assert_eq!(segmenter.char_type("ก"), "T");
}
}
Run all tests to verify:
cargo test --workspace
Library API Overview
The litsea crate provides a Rust API for word segmentation, model training, and feature extraction.
Installation
[dependencies]
litsea = "0.5.0"
Loading models from local files is synchronous and needs no async runtime. An async runtime such as tokio is only required when loading models over HTTP/HTTPS with the async load_model method.
Module Map
graph LR
A["litsea::segmenter"] --- B["Segmenter"]
C["litsea::adaboost"] --- D["AdaBoost"]
E["litsea::language"] --- F["Language"]
G["litsea::extractor"] --- H["Extractor"]
I["litsea::trainer"] --- J["Trainer, PosTrainer"]
K["litsea::error"] --- L["LitseaError, Result"]
M["litsea::perceptron"] --- N["AveragedPerceptron"]
O["litsea::upos"] --- P["Upos, SegmentLabel"]
Q["litsea::metrics"] --- R["BinaryMetrics, MulticlassMetrics"]
| Module | Primary Types | Purpose |
|---|---|---|
litsea::segmenter | Segmenter | Word segmentation, joint segmentation with POS tagging |
litsea::adaboost | AdaBoost | Binary classification, model I/O |
litsea::perceptron | AveragedPerceptron | Multiclass classification (POS tagging), model I/O |
litsea::upos | Upos, SegmentLabel | UPOS POS tags, segment labels |
litsea::language | Language | Language definitions, character classification |
litsea::extractor | Extractor | Feature extraction from corpus |
litsea::trainer | Trainer, PosTrainer | Training orchestration |
litsea::error | LitseaError, Result | Error type and result alias |
litsea::metrics | BinaryMetrics, MulticlassMetrics | Evaluation metrics |
All primary types are also re-exported at the crate root, so use litsea::Segmenter; works as a shorthand for use litsea::segmenter::Segmenter;.
Quick Example
use std::path::Path;
use litsea::adaboost::AdaBoost;
use litsea::language::Language;
use litsea::segmenter::Segmenter;
fn main() -> litsea::Result<()> {
let mut learner = AdaBoost::new(0.01, 100);
learner.load_model_from_path(Path::new("./models/japanese.model"))?;
let segmenter = Segmenter::new(Language::Japanese, Some(learner));
let tokens = segmenter.segment("これはテストです。");
assert_eq!(tokens, vec!["これ", "は", "テスト", "です", "。"]);
Ok(())
}
Quick Example (POS Tagging)
use std::path::Path;
use litsea::language::Language;
use litsea::perceptron::AveragedPerceptron;
use litsea::segmenter::Segmenter;
fn main() -> litsea::Result<()> {
let mut pos_learner = AveragedPerceptron::new();
pos_learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;
let segmenter = Segmenter::with_pos_learner(Language::Japanese, pos_learner);
let tokens = segmenter.segment_with_pos("これはテストです。");
for (word, pos) in &tokens {
print!("{}/{} ", word, pos);
}
println!();
Ok(())
}
API Documentation
Full API documentation is available on docs.rs/litsea.
Segmenter
The Segmenter struct is the primary interface for word segmentation.
Definition
#![allow(unused)]
fn main() {
pub struct Segmenter {
// private: language: Language,
// private: learner: AdaBoost,
// private: pos_learner: Option<AveragedPerceptron>,
}
}
The fields are private; use the accessor methods language(), learner(), learner_mut(), pos_learner(), and pos_learner_mut() to reach them.
Constructor
Segmenter::new
#![allow(unused)]
fn main() {
pub fn new(language: Language, learner: Option<AdaBoost>) -> Self
}
Creates a new segmenter.
language– The language for character type classificationlearner– An optional pre-trainedAdaBoostmodel. IfNone, a default (untrained) instance is created.
#![allow(unused)]
fn main() {
use litsea::language::Language;
use litsea::segmenter::Segmenter;
// With a pre-trained model
let segmenter = Segmenter::new(Language::Japanese, Some(learner));
// Without a model (for training or feature extraction)
let segmenter = Segmenter::new(Language::Japanese, None);
}
Methods
segment
#![allow(unused)]
fn main() {
pub fn segment(&self, sentence: &str) -> Vec<String>
}
Segments a sentence into words. Returns an empty vector for empty input.
#![allow(unused)]
fn main() {
let tokens = segmenter.segment("これはテストです。");
// ["これ", "は", "テスト", "です", "。"]
}
char_type
#![allow(unused)]
fn main() {
pub fn char_type(&self, ch: &str) -> &str
}
Classifies a single character into its type code using language-specific rules. The first character of the &str is classified; an empty string returns "O".
#![allow(unused)]
fn main() {
let segmenter = Segmenter::new(Language::Japanese, None);
assert_eq!(segmenter.char_type("あ"), "I"); // Hiragana
assert_eq!(segmenter.char_type("漢"), "H"); // Kanji
assert_eq!(segmenter.char_type("A"), "A"); // ASCII
}
add_corpus
#![allow(unused)]
fn main() {
pub fn add_corpus(&mut self, corpus: &str)
}
Processes a space-separated corpus and adds instances to the internal AdaBoost learner.
#![allow(unused)]
fn main() {
let mut segmenter = Segmenter::new(Language::Japanese, None);
segmenter.add_corpus("テスト です");
}
add_corpus_with_writer
#![allow(unused)]
fn main() {
pub fn add_corpus_with_writer<F>(&self, corpus: &str, writer: F)
where
F: FnMut(HashSet<String>, i8),
}
Processes a corpus and calls the callback for each character position with its feature set and label.
#![allow(unused)]
fn main() {
segmenter.add_corpus_with_writer("テスト です", |attrs, label| {
println!("Features: {:?}, Label: {}", attrs, label);
});
}
Accessors
#![allow(unused)]
fn main() {
pub fn language(&self) -> Language
pub fn learner(&self) -> &AdaBoost
pub fn learner_mut(&mut self) -> &mut AdaBoost
pub fn pos_learner(&self) -> Option<&AveragedPerceptron>
pub fn pos_learner_mut(&mut self) -> Option<&mut AveragedPerceptron>
}
Provide access to the segmenter’s language and its internal learners.
Feature extraction for a character position (38 features for Korean, 42 for Japanese/Chinese) is an internal detail; the former
get_attributesmethod is now private.
Extractor
The Extractor struct extracts features from a corpus file for model training.
Definition
#![allow(unused)]
fn main() {
pub struct Extractor {
segmenter: Segmenter,
}
}
Constructor
Extractor::new
#![allow(unused)]
fn main() {
pub fn new(language: Language) -> Self
}
Creates a new extractor for the specified language. Internally creates a Segmenter without a pre-trained model.
#![allow(unused)]
fn main() {
use litsea::extractor::Extractor;
use litsea::language::Language;
let mut extractor = Extractor::new(Language::Japanese);
}
Methods
extract
#![allow(unused)]
fn main() {
pub fn extract(
&mut self,
corpus_path: &Path,
features_path: &Path,
) -> litsea::Result<()>
}
Reads a corpus file (space-separated words, one sentence per line) and writes the extracted features to the output file.
#![allow(unused)]
fn main() {
use std::path::Path;
extractor.extract(
Path::new("./corpus.txt"),
Path::new("./features.txt"),
)?;
}
Pipeline
flowchart LR
A["corpus.txt<br/>(space-separated words)"] --> B["Extractor::extract()"]
B --> C["features.txt<br/>(label + features per position)"]
The extractor:
- Reads each line from the corpus file
- Calls
Segmenter::add_corpus_with_writer()to process each line - Writes the label and feature set for each character position to the output file
Trainer
The Trainer struct orchestrates the full model training pipeline.
Definition
#![allow(unused)]
fn main() {
pub struct Trainer {
learner: AdaBoost,
}
}
Constructor
Trainer::new
#![allow(unused)]
fn main() {
pub fn new(
threshold: f64,
num_iterations: usize,
features_path: &Path,
) -> litsea::Result<Self>
}
Creates a trainer and initializes it from a features file. This calls AdaBoost::initialize_features() and AdaBoost::initialize_instances().
#![allow(unused)]
fn main() {
use std::path::Path;
use litsea::trainer::Trainer;
let mut trainer = Trainer::new(
0.005, // threshold
1000, // max iterations
Path::new("./features.txt"), // features file
)?;
}
Methods
load_model
#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}
Loads an existing model for retraining. Supports file paths, file://, and (with the remote_model feature) http:// and https:// URIs.
When called after Trainer::new, the loaded weights are merged into the freshly initialized training data by feature name, so incremental training starts from the existing model without corrupting the feature index.
#![allow(unused)]
fn main() {
trainer.load_model("./models/japanese.model").await?;
}
train
#![allow(unused)]
fn main() {
pub fn train(
&mut self,
running: Arc<AtomicBool>,
model_path: &Path,
) -> litsea::Result<BinaryMetrics>
}
Trains the model and saves it to the specified path. Returns evaluation metrics.
The running flag enables graceful interruption – set it to false to stop training early.
#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::path::Path;
let running = Arc::new(AtomicBool::new(true));
let metrics = trainer.train(running, Path::new("./model.model"))?;
println!("Accuracy: {:.2}%", metrics.accuracy);
}
Full Training Example
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::path::Path;
use litsea::trainer::Trainer;
#[tokio::main]
async fn main() -> litsea::Result<()> {
let mut trainer = Trainer::new(
0.005,
1000,
Path::new("./features.txt"),
)?;
// Optionally resume from an existing model
// trainer.load_model("./models/japanese.model").await?;
let running = Arc::new(AtomicBool::new(true));
let metrics = trainer.train(running, Path::new("./model.model"))?;
println!("Accuracy: {:.2}%", metrics.accuracy);
println!("Precision: {:.2}%", metrics.precision);
println!("Recall: {:.2}%", metrics.recall);
Ok(())
}
AdaBoost
The AdaBoost struct implements binary classification for word boundary detection.
Definition
#![allow(unused)]
fn main() {
pub struct AdaBoost {
pub threshold: f64,
pub num_iterations: usize,
// internal fields: model weights, features, instances, etc.
}
}
Constructor
AdaBoost::new
#![allow(unused)]
fn main() {
pub fn new(threshold: f64, num_iterations: usize) -> Self
}
Creates a new AdaBoost instance with the specified hyperparameters.
#![allow(unused)]
fn main() {
use litsea::adaboost::AdaBoost;
let mut learner = AdaBoost::new(0.01, 100);
}
Model Loading
load_model_from_path
#![allow(unused)]
fn main() {
pub fn load_model_from_path(&mut self, path: &Path) -> litsea::Result<()>
}
Loads model weights from a local file, synchronously. This is the preferred method for local files – no async runtime is needed.
#![allow(unused)]
fn main() {
use std::path::Path;
learner.load_model_from_path(Path::new("./models/japanese.model"))?;
}
load_model_from_reader
#![allow(unused)]
fn main() {
pub fn load_model_from_reader<R: BufRead>(&mut self, reader: R) -> litsea::Result<()>
}
Loads model weights from any BufRead source, such as an in-memory buffer or an already-open file.
load_model
#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}
Loads model weights from a URI. Supports:
- Local file path:
./models/japanese.model - File URI:
file:///path/to/model - HTTP:
http://example.com/model(requires theremote_modelfeature) - HTTPS:
https://example.com/model(requires theremote_modelfeature)
#![allow(unused)]
fn main() {
learner.load_model("https://example.com/model").await?;
}
save_model
#![allow(unused)]
fn main() {
pub fn save_model(&self, filename: &Path) -> litsea::Result<()>
}
Saves model weights to a file. Returns an error if the model is empty.
Training Methods
initialize_features
#![allow(unused)]
fn main() {
pub fn initialize_features(&mut self, filename: &Path) -> litsea::Result<()>
}
Reads a features file and builds the feature index. Must be called before initialize_instances.
initialize_instances
#![allow(unused)]
fn main() {
pub fn initialize_instances(&mut self, filename: &Path) -> litsea::Result<()>
}
Reads the same features file and initializes labeled instances with their weights.
train
#![allow(unused)]
fn main() {
pub fn train(&mut self, running: Arc<AtomicBool>)
}
Runs the AdaBoost training loop. Set running to false to stop early.
add_instance
#![allow(unused)]
fn main() {
pub fn add_instance(&mut self, attributes: HashSet<String>, label: i8)
}
Adds a single training instance with its feature set and label.
Prediction
predict
#![allow(unused)]
fn main() {
pub fn predict(&self, attributes: &HashSet<String>) -> i8
}
Predicts the label for a given feature set. Returns +1 (boundary) or -1 (non-boundary).
#![allow(unused)]
fn main() {
use std::collections::HashSet;
let mut attrs = HashSet::new();
attrs.insert("UW4:は".to_string());
attrs.insert("UC4:I".to_string());
// ... more features
let label = learner.predict(&attrs);
// label == 1 (boundary) or -1 (non-boundary)
}
bias
#![allow(unused)]
fn main() {
pub fn bias(&self) -> f64
}
Returns the bias term: -sum(all model weights) / 2.0.
Evaluation
metrics
#![allow(unused)]
fn main() {
pub fn metrics(&self) -> BinaryMetrics
}
Calculates evaluation metrics on the training data.
BinaryMetrics
Defined in litsea::metrics (also re-exported as litsea::BinaryMetrics):
#![allow(unused)]
fn main() {
pub struct BinaryMetrics {
pub accuracy: f64, // Accuracy in percentage
pub precision: f64, // Precision in percentage
pub recall: f64, // Recall in percentage
pub num_instances: usize,
pub true_positives: usize,
pub false_positives: usize,
pub false_negatives: usize,
pub true_negatives: usize,
}
}
Averaged Perceptron
The AveragedPerceptron struct implements multiclass classification for joint word segmentation and POS tagging.
Definition
#![allow(unused)]
fn main() {
pub struct AveragedPerceptron {
// internal fields: weights, accumulated, timestamps, step, classes, instances
}
}
Constructor
AveragedPerceptron::new
#![allow(unused)]
fn main() {
pub fn new() -> Self
}
Creates a new empty Averaged Perceptron instance.
#![allow(unused)]
fn main() {
use litsea::perceptron::AveragedPerceptron;
let mut learner = AveragedPerceptron::new();
}
Adding Instances
add_instance
#![allow(unused)]
fn main() {
pub fn add_instance(&mut self, features: HashSet<String>, label: String)
}
Adds a training instance with a feature set and a label. Unknown classes are automatically registered.
#![allow(unused)]
fn main() {
use std::collections::HashSet;
use litsea::perceptron::AveragedPerceptron;
let mut learner = AveragedPerceptron::new();
let mut feats = HashSet::new();
feats.insert("UW4:猫".to_string());
feats.insert("UC4:H".to_string());
learner.add_instance(feats, "B-NOUN".to_string());
}
Training
train
#![allow(unused)]
fn main() {
pub fn train(&mut self, num_epochs: usize, running: Arc<AtomicBool>)
}
Runs the Averaged Perceptron training loop for the given number of epochs. Set running to false to stop early. Weights are automatically averaged at the end of training.
#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
let running = Arc::new(AtomicBool::new(true));
learner.train(10, running);
}
Prediction
predict
#![allow(unused)]
fn main() {
pub fn predict(&self, features: &HashSet<String>) -> String
}
Predicts the class label for a given feature set. Computes a score for each class and returns the class name with the highest score. Returns an empty string if no classes are registered.
#![allow(unused)]
fn main() {
use std::collections::HashSet;
let mut attrs = HashSet::new();
attrs.insert("UW4:は".to_string());
attrs.insert("UC4:I".to_string());
// ... more features
let label = learner.predict(&attrs);
// label == "B-ADP", "O", etc.
}
Model I/O
save_model
#![allow(unused)]
fn main() {
pub fn save_model(&self, path: &Path) -> litsea::Result<()>
}
Saves model weights to a file. Returns an error if the model is empty.
load_model_from_path
#![allow(unused)]
fn main() {
pub fn load_model_from_path(&mut self, path: &Path) -> litsea::Result<()>
}
Loads model weights from a local file, synchronously. This is the preferred method for local files – no async runtime is needed.
#![allow(unused)]
fn main() {
use std::path::Path;
learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;
}
load_model_from_reader
#![allow(unused)]
fn main() {
pub fn load_model_from_reader<R: BufRead>(&mut self, reader: R) -> litsea::Result<()>
}
Loads model weights from any BufRead source, such as an in-memory buffer or an already-open file.
load_model
#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}
Loads model weights from a URI. Supports the following URI schemes:
- Local file path:
./models/japanese_pos.model - File URI:
file:///path/to/model - HTTP:
http://example.com/model(requires theremote_modelfeature) - HTTPS:
https://example.com/model(requires theremote_modelfeature)
#![allow(unused)]
fn main() {
learner.load_model("https://example.com/models/japanese_pos.model").await?;
}
Evaluation
metrics
#![allow(unused)]
fn main() {
pub fn metrics(&self) -> MulticlassMetrics
}
Calculates evaluation metrics on the training data.
MulticlassMetrics
Defined in litsea::metrics (also re-exported as litsea::MulticlassMetrics):
#![allow(unused)]
fn main() {
pub struct MulticlassMetrics {
pub accuracy: f64, // Overall accuracy in percentage
pub macro_precision: f64, // Macro-averaged precision in percentage
pub macro_recall: f64, // Macro-averaged recall in percentage
pub num_instances: usize, // Number of instances
pub correct_per_class: HashMap<String, usize>, // Correct count per class
pub predicted_per_class: HashMap<String, usize>, // Predicted count per class
pub gold_per_class: HashMap<String, usize>, // Gold label count per class
}
}
UPOS
The upos module defines the Universal POS (UPOS) tagset and segment label types used for POS tagging.
Upos
Definition
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum Upos {
ADJ, // Adjective
ADP, // Adposition
ADV, // Adverb
AUX, // Auxiliary
CCONJ, // Coordinating conjunction
DET, // Determiner
INTJ, // Interjection
NOUN, // Noun
NUM, // Numeral
PART, // Particle
PRON, // Pronoun
PROPN, // Proper noun
PUNCT, // Punctuation
SCONJ, // Subordinating conjunction
SYM, // Symbol
VERB, // Verb
X, // Other
}
}
Litsea supports all 17 UPOS tags from the Universal Dependencies project:
| Tag | Description | Example (Japanese) |
|---|---|---|
ADJ | Adjective | いい, 大きい |
ADP | Adposition | は, が, を, に |
ADV | Adverb | とても, まだ |
AUX | Auxiliary | です, ます, た |
CCONJ | Coordinating conjunction | と, や |
DET | Determiner | この, その |
INTJ | Interjection | ああ, はい |
NOUN | Noun | 天気, 本 |
NUM | Numeral | 一, 二, 100 |
PART | Particle | ね, よ |
PRON | Pronoun | これ, それ |
PROPN | Proper noun | 東京, 太郎 |
PUNCT | Punctuation | 。, 、 |
SCONJ | Subordinating conjunction | ので, から |
SYM | Symbol | %, $ |
VERB | Verb | 読む, 書く |
X | Other | (unclassified tokens) |
Constant
Upos::ALL
#![allow(unused)]
fn main() {
pub const ALL: [Upos; 17]
}
Returns an array of all 17 UPOS tags.
Trait Implementations
Display: Converts to a string such as"NOUN","VERB", etc.FromStr: Parses a string intoUpos. Returns an error for invalid strings.
#![allow(unused)]
fn main() {
use litsea::upos::Upos;
let pos: Upos = "NOUN".parse().unwrap();
assert_eq!(pos.to_string(), "NOUN");
}
SegmentLabel
Definition
The SegmentLabel type combines word boundary detection with POS tagging. Each character position is assigned one of 18 labels:
B(Upos)(17 labels): Word boundary with the given UPOS tag (e.g.,B-NOUN,B-VERB)O(1 label): Non-boundary (continuation of the current word)
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum SegmentLabel {
B(Upos), // Start of a word (boundary). Carries POS information.
O, // Continuation of a word (non-boundary).
}
}
#![allow(unused)]
fn main() {
use litsea::upos::SegmentLabel;
// Segment labels for "今日は" (kyou wa)
// 今 → B-NOUN (start of "今日", tagged as NOUN)
// 日 → O (continuation of "今日")
// は → B-ADP (start of "は", tagged as ADP)
}
Methods
all_labels
#![allow(unused)]
fn main() {
pub fn all_labels() -> Vec<SegmentLabel>
}
Returns a vector of all 18 segment label strings.
is_boundary
#![allow(unused)]
fn main() {
pub fn is_boundary(&self) -> bool
}
Returns whether this is a boundary label (B-*).
pos
#![allow(unused)]
fn main() {
pub fn pos(&self) -> Option<Upos>
}
Returns the UPOS tag. Returns None for the non-boundary label (O).
Trait Implementations
Display: Converts to a string such as"B-NOUN","O", etc.FromStr: Parses a string intoSegmentLabel.
#![allow(unused)]
fn main() {
use litsea::upos::{SegmentLabel, Upos};
let label: SegmentLabel = "B-NOUN".parse().unwrap();
assert!(label.is_boundary());
assert_eq!(label.pos(), Some(Upos::NOUN));
let label_o: SegmentLabel = "O".parse().unwrap();
assert!(!label_o.is_boundary());
assert_eq!(label_o.pos(), None);
}
Language
The Language enum defines language-specific behavior, including character type classification.
Language Enum
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum Language {
#[default]
Japanese,
Chinese,
Korean,
}
}
Traits
Default– ReturnsLanguage::JapaneseDisplay– Returns lowercase name ("japanese","chinese","korean")FromStr– Parses from full name or ISO 639-1 code (case-insensitive)
Parsing
#![allow(unused)]
fn main() {
use litsea::language::Language;
// Full names
let ja: Language = "japanese".parse().unwrap();
let zh: Language = "chinese".parse().unwrap();
let ko: Language = "korean".parse().unwrap();
// ISO 639-1 codes
let ja: Language = "ja".parse().unwrap();
let zh: Language = "zh".parse().unwrap();
let ko: Language = "ko".parse().unwrap();
// Case-insensitive
let ko: Language = "KOREAN".parse().unwrap();
// Invalid
assert!("french".parse::<Language>().is_err());
}
char_type
#![allow(unused)]
fn main() {
pub fn char_type(&self, c: char) -> &'static str
}
Classifies a character into its language-specific type code. Returns "O" (Other) if the character does not belong to any class.
Classification is a direct match on character ranges – allocation-free, O(1), and with no regex involved.
#![allow(unused)]
fn main() {
use litsea::language::Language;
let lang = Language::Japanese;
assert_eq!(lang.char_type('あ'), "I");
assert_eq!(lang.char_type('漢'), "H");
assert_eq!(lang.char_type('@'), "O");
}
Internally, char_type dispatches to a private per-language function (japanese_char_type, chinese_char_type, korean_char_type). The classes common to all languages – "P" (punctuation), "A" (Latin), and "N" (digits) – are handled by a shared helper that is checked after the language-specific classes.
CLI Reference Overview
The litsea CLI provides commands for word segmentation, model training, and text processing.
Usage
litsea <COMMAND> [OPTIONS] [ARGS]
Commands
| Command | Description |
|---|---|
extract | Extract features from a corpus for training |
train | Train a word segmentation model |
segment | Segment text into words using a trained model |
Global Options
| Option | Description |
|---|---|
-h, --help | Show help information |
-V, --version | Show version number |
Typical Workflow
AdaBoost Workflow (Word Segmentation Only)
flowchart LR
A["1. scripts/download_udtreebank.sh"] --> B["2. scripts/corpus_udtreebank.sh"]
B --> C["3. litsea extract"]
C --> D["4. litsea train"]
D --> E["5. litsea segment"]
- Download a UD Treebank:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) - Convert to corpus format:
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt - Extract features:
litsea extract -l japanese corpus.txt features.txt - Train a model:
litsea train -t 0.005 -i 1000 features.txt model.model - Segment text:
echo "text" | litsea segment -l japanese model.model
POS Workflow (Word Segmentation with POS Tagging)
flowchart LR
A["1. scripts/download_udtreebank.sh"] --> B["2. scripts/corpus_udtreebank.sh -p"]
B --> C["3. litsea extract --pos"]
C --> D["4. litsea train --pos"]
D --> E["5. litsea segment --pos"]
- Download a UD Treebank:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) - Convert to POS corpus format:
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt - Extract POS features:
litsea extract --pos -l japanese pos_corpus.txt features_pos.txt - Train a POS model:
litsea train --pos --num-epochs 10 features_pos.txt model_pos.model - Segment with POS tags:
echo "text" | litsea segment --pos -l japanese model_pos.model
extract
Extract features from a corpus file for model training.
Usage
litsea extract [OPTIONS] <CORPUS_FILE> <FEATURES_FILE>
Arguments
| Argument | Description |
|---|---|
CORPUS_FILE | Path to the input corpus file (words separated by spaces, one sentence per line) |
FEATURES_FILE | Path to the output features file |
Options
| Option | Default | Description |
|---|---|---|
-l, --language <LANGUAGE> | japanese | Language for character type classification. Accepts: japanese / ja, chinese / zh, korean / ko |
--pos | off | Enable POS (Part-of-Speech) feature extraction mode. Requires a POS corpus as input |
Corpus Format
The input corpus must have words separated by spaces, one sentence per line:
Litsea は TinySegmenter を 参考 に 開発 さ れ た 。
Rust で 実装 さ れ た コンパクト な 単語 分割 ソフトウェア です 。
Output Format
The features file contains one line per character position:
1 UW1:B2 UW2:B1 UW3:L UW4:i UW5:t UC1:O UC2:O UC3:A UC4:A ...
-1 UW1:B1 UW2:L UW3:i UW4:t UW5:s UC1:O UC2:A UC3:A UC4:A ...
1= word boundary-1= non-boundary- Features are tab-separated
Examples
# Japanese
litsea extract -l japanese ./corpus.txt ./features.txt
# Chinese
litsea extract -l zh ./corpus_zh.txt ./features_zh.txt
# Korean
litsea extract -l ko ./corpus_ko.txt ./features_ko.txt
Output to stderr on success:
Feature extraction completed successfully.
POS Feature Extraction
When the --pos flag is specified, extract expects a POS corpus instead of a plain word-separated corpus. Each line contains words annotated with UPOS tags in the format word/POS:
POS Corpus Format
これ/PRON は/PART テスト/NOUN です/AUX 。/PUNCT
今日/NOUN は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT
POS Feature Output Format
In POS mode, the label column uses segment labels (B-NOUN, B-VERB, …, B-X, O) instead of binary 1/-1:
B-NOUN UW1:B2 UW2:B1 UW3:こ UW4:れ UW5:は UC1:O UC2:O UC3:I UC4:I ...
O UW1:B1 UW2:こ UW3:れ UW4:は UW5:テ UC1:O UC2:I UC3:I UC4:I ...
POS Extraction Example
litsea extract --pos -l japanese ./pos_corpus.txt ./pos_features.txt
train
Train a word segmentation model using AdaBoost.
Usage
litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>
Arguments
| Argument | Description |
|---|---|
FEATURES_FILE | Path to the input features file (output from extract) |
MODEL_FILE | Path to the output model file |
Options
| Option | Default | Description |
|---|---|---|
-t, --threshold <THRESHOLD> | 0.01 | Weak classifier accuracy threshold for early stopping. Lower values allow more iterations |
-i, --num-iterations <NUM_ITERATIONS> | 100 | Maximum number of boosting iterations |
-m, --load-model-uri <LOAD_MODEL_URI> | None | URI of an existing model to resume training from (file path or HTTP/HTTPS URL) |
--pos | off | Enable POS (Part-of-Speech) training mode using Averaged Perceptron |
-e, --num-epochs <NUM_EPOCHS> | 10 | Number of training epochs (POS mode only) |
Output
Training metrics are printed to stderr:
Result Metrics:
Accuracy: 94.15% ( 564133 / 599198 )
Precision: 95.57% ( 330454 / 345758 )
Recall: 94.36% ( 330454 / 350215 )
Confusion Matrix:
True Positives: 330454
False Positives: 15304
False Negatives: 19761
True Negatives: 233679
Ctrl+C Handling
Training supports graceful interruption:
- First Ctrl+C: Stops training and saves the model at its current state
- Second Ctrl+C: Exits immediately without saving
This allows you to stop long-running training sessions without losing progress.
Examples
Basic training:
litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model
Training with higher precision (lower threshold, more iterations):
litsea train -t 0.001 -i 5000 ./features.txt ./model.model
Retraining from an existing model:
litsea train -t 0.005 -i 1000 -m ./models/japanese.model \
./new_features.txt ./models/japanese_v2.model
Hyperparameter Tuning
| Parameter | Effect of Decreasing | Effect of Increasing |
|---|---|---|
threshold | More iterations, potentially higher accuracy, longer training time | Fewer iterations, faster training, may underfit |
num_iterations | Fewer boosting rounds, smaller model, may underfit | More rounds, larger model, potentially higher accuracy |
POS Model Training
When the --pos flag is specified, train uses the Averaged Perceptron algorithm instead of AdaBoost. This trains a multiclass classifier for joint word segmentation and POS tagging.
Usage
litsea train --pos [OPTIONS] <FEATURES_FILE> <MODEL_FILE>
POS Training Options
| Option | Default | Description |
|---|---|---|
--pos | off | Enable POS training mode |
-e, --num-epochs <NUM_EPOCHS> | 10 | Number of training epochs |
Examples
# Train a POS model from POS features
litsea train --pos -e 10 ./pos_features.txt ./models/japanese_pos.model
Output
POS training metrics are printed to stderr (macro-averaged precision and recall):
Result Metrics:
Accuracy: 98.34%
Macro Precision: 97.87%
Macro Recall: 91.67%
Ctrl+C Handling
Same as AdaBoost training, POS training supports graceful interruption. The first Ctrl+C stops training and saves the model at its current state.
POS Hyperparameters
| Parameter | Effect of Decreasing | Effect of Increasing |
|---|---|---|
num_epochs | Faster training, may underfit | Better accuracy, longer training, may overfit |
segment
Segment text into words using a trained model.
Usage
echo "text" | litsea segment [OPTIONS] <MODEL_URI>
Arguments
| Argument | Description |
|---|---|
MODEL_URI | Path or URL to the trained model file. Supports: local file paths, file://, http://, https:// |
Options
| Option | Default | Description |
|---|---|---|
-l, --language <LANGUAGE> | japanese | Language for character type classification. Accepts: japanese / ja, chinese / zh, korean / ko |
--pos | off | Enable POS-tagged segmentation output. Requires a POS model trained with train --pos |
Input / Output
- Input: Reads from stdin, one sentence per line. Empty lines are skipped.
- Output: Writes to stdout, space-separated tokens, one line per input line.
Examples
Japanese:
echo "LitseaはTinySegmenterを参考に開発された。" \
| litsea segment -l japanese ./models/japanese.model
Litsea は TinySegmenter を 参考 に 開発 さ れ た 。
Chinese:
echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model
Korean:
echo "한국어 단어 분할 테스트입니다." \
| litsea segment -l korean ./models/korean.model
Processing a file:
cat input.txt | litsea segment -l japanese ./models/japanese.model > output.txt
Loading a model from a URL:
echo "テスト文です。" \
| litsea segment -l japanese https://example.com/models/japanese.model
POS-Tagged Segmentation (--pos)
When the --pos flag is specified, segmentation and POS tagging are performed simultaneously using an Averaged Perceptron model.
Usage
echo "text" | litsea segment --pos [OPTIONS] <MODEL_URI>
Output Format
Each token is output in word/POS format. POS tags conform to the UPOS tag set.
echo "今日はいい天気ですね。" \
| litsea segment --pos -l japanese ./models/japanese_pos.model
今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT
Processing a File
cat input.txt | litsea segment --pos -l japanese ./models/japanese_pos.model > output.txt
Notes
- The
--languageflag must match the language the model was trained for - Model loading is asynchronous and supports HTTP/HTTPS with TLS (rustls)
- The model URI is not restricted to file paths – any valid URL is accepted
- When using
--pos, the model must be a POS model trained withtrain --pos
Training Guide
This guide walks you through training custom word segmentation and POS tagging models with Litsea.
Both workflows use Universal Dependencies (UD) Treebanks as the data source.
Word Segmentation (AdaBoost)
- Prepare a corpus from a UD Treebank:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt - Extract features from the corpus
- Train a model using AdaBoost
POS Tagging (Averaged Perceptron)
- Prepare a POS corpus from a UD Treebank:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt - Extract POS features:
litsea extract --pos -l japanese pos_corpus.txt features.txt - Train a POS model:
litsea train --pos --num-epochs 10 features.txt model.txt
Additional Topics
- Evaluating Models – assess model quality
- Retraining Models – fine-tune existing models
Preparing a Corpus
A good training corpus is essential for model accuracy. This guide explains how to prepare one using Universal Dependencies (UD) Treebanks.
Data Source: UD Treebanks
Litsea uses UD Treebanks as the data source for both word segmentation and POS tagging. UD Treebanks provide high-quality, manually annotated data in CoNLL-U format for many languages.
Available Treebanks
| Language | Treebank | Repository |
|---|---|---|
| Japanese | UD Japanese-GSD | UD_Japanese-GSD |
| Chinese | UD Chinese-GSD | UD_Chinese-GSD |
| Korean | UD Korean-GSD | UD_Korean-GSD |
Step 1: Download a UD Treebank
Use scripts/download_udtreebank.sh to download a UD Treebank. It prints the path to the training CoNLL-U file to stdout:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
Supported languages: ja (Japanese, default), ko (Korean), zh (Chinese). Use -o to specify the output directory (default: current directory).
Corpus for Word Segmentation
For word segmentation (AdaBoost), the corpus must be a plain text file with:
- One sentence per line
- Words separated by spaces
太郎 は 走っ た 。
Litsea は コンパクト な 単語 分割 ソフトウェア です 。
Convert CoNLL-U to Word Segmentation Corpus
Use scripts/corpus_udtreebank.sh to convert a CoNLL-U file to corpus format:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
This converts the CoNLL-U data into space-separated words (one sentence per line).
Corpus for POS Tagging
For POS tagging (Averaged Perceptron), each word must be annotated with its POS tag.
POS Corpus Format
Each line represents one sentence, with words annotated as word/POS pairs separated by spaces:
これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT
Litsea/PROPN は/ADP 単語/NOUN 分割/NOUN ソフトウェア/NOUN です/AUX 。/PUNCT
The POS tags follow the Universal POS (UPOS) tagset with 17 categories: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.
Convert CoNLL-U to POS Corpus
Use scripts/corpus_udtreebank.sh with the -p flag to produce a POS corpus:
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
Multi-word tokens and empty nodes are automatically handled during conversion.
Automated Corpus Preparation
Litsea includes helper scripts in the scripts/ directory that automate the UD Treebank download and conversion:
scripts/download_udtreebank.sh– Downloads a UD Treebank and prints the path to the training CoNLL-U filescripts/corpus_udtreebank.sh– Converts a CoNLL-U file to Litsea corpus format
# Download UD Treebank and get CoNLL-U file path
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
# Generate word segmentation corpus
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
# Generate POS corpus
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
Supported languages for download_udtreebank.sh: ja (Japanese, default), ko (Korean), zh (Chinese).
Corpus from Wikipedia Dump
For larger-scale training, you can build a corpus from a full Wikipedia dump using scripts/corpus_wikidump.sh. This extracts plain text with wicket, filters for actual sentences, and tokenizes with lindera.
Usage
# Japanese (default)
bash scripts/corpus_wikidump.sh jawiki-latest-pages-articles.xml.bz2 corpus_ja.txt
# Korean
bash scripts/corpus_wikidump.sh -l ko kowiki-latest-pages-articles.xml.bz2 corpus_ko.txt
# Chinese
bash scripts/corpus_wikidump.sh -l zh zhwiki-latest-pages-articles.xml.bz2 corpus_zh.txt
Options
| Option | Description | Default |
|---|---|---|
-l lang | Language code: ja, ko, zh | ja |
-n max_lines | Maximum sentence lines to process (0 = unlimited) | 100000 |
Sentence Filtering
The script applies two filters to keep only well-formed sentences:
- Sentence-ending punctuation – Lines must end with
。,.,!, or?. This excludes section headers (e.g., “参考文献”), list items, and metadata. - Minimum length – Lines must be at least 20 characters. This excludes short fragments and isolated labels.
Tokenizer Dictionaries
| Language | Dictionary | Token Filter |
|---|---|---|
Japanese (ja) | embedded://unidic | japanese_compound_word (numeral compound) |
Korean (ko) | embedded://ko-dic | None |
Chinese (zh) | embedded://cc-cedict | None |
Corpus Size Guidelines
The recommended corpus size depends on your use case:
| Size (sentence lines) | Use Case |
|---|---|
| ~10,000 | Minimum for prototyping and smoke tests |
| 50,000 – 100,000 | Practical range for model training |
| 100,000 – 500,000 | High-quality, robust models |
| Unlimited | Use full dump for maximum accuracy |
The default max_lines=100000 in corpus_wikidump.sh targets the practical-to-high-quality range.
Corpus Quality Tips
- Diversity – Include text from various domains (news, literature, web, etc.)
- Size – See Corpus Size Guidelines above for recommended sizes
- Consistency – Ensure consistent tokenization throughout the corpus
- Deduplication – Remove duplicate sentences to avoid bias
- Cleaning – Remove HTML tags, special formatting, and non-text content
Extracting Features
After preparing a corpus, the next step is to extract features for model training.
Command
litsea extract -l <LANGUAGE> <CORPUS_FILE> <FEATURES_FILE>
Example
litsea extract -l japanese ./corpus.txt ./features.txt
Output:
Feature extraction completed successfully.
What Happens Internally
flowchart TD
A["Read corpus line by line"] --> B["Split line into words"]
B --> C["Build chars, types, and tags arrays"]
C --> D["For each character position"]
D --> E["Extract 38-42 features"]
E --> F["Write label + features to file"]
- The
Extractorreads each line from the corpus - For each sentence, it creates a
Segmentercontext with character arrays, type arrays, and tag arrays - For each character position (except the first), it extracts features and writes them with the correct label
Feature File Format
Each line represents one character position:
1 UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:は ...
-1 UP1:U UP2:U UP3:B BP1:UB BP2:BU UW1:B1 UW2:は UW3:テ ...
- First column: label (
1= boundary,-1= non-boundary) - Remaining columns: features (tab-separated)
POS Feature Extraction
For POS tagging models, use the --pos flag to extract features with POS labels instead of binary boundary labels.
Command
litsea extract --pos -l <LANGUAGE> <CORPUS_FILE> <FEATURES_FILE>
Example
litsea extract --pos -l japanese ./corpus.txt ./features.txt
POS Labels
When extracting POS features, each character position is labeled with one of 18 segment labels instead of the binary 1/-1 labels:
- B-NOUN, B-VERB, B-ADJ, B-ADP, B-ADV, B-AUX, B-CCONJ, B-DET, B-INTJ, B-NUM, B-PART, B-PRON, B-PROPN, B-PUNCT, B-SCONJ, B-SYM, B-X – Word boundary with the corresponding POS tag
- O – Non-boundary (inside a word)
The feature template (character n-grams, type n-grams, etc.) is the same as for standard segmentation – only the label scheme differs.
POS Feature File Format
B-NOUN UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:は ...
O UP1:U UP2:U UP3:B BP1:UB BP2:BU UW1:B1 UW2:は UW3:テ ...
B-VERB UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:い ...
- First column: segment label (e.g.,
B-NOUN,O) - Remaining columns: features (tab-separated)
File Size Expectations
The features file will be significantly larger than the corpus because each character position generates 38-42 feature strings. For a 1 MB corpus, expect a features file of roughly 50-100 MB.
Training Models
Once features are extracted, train a model using AdaBoost.
Command
litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>
Basic Example
litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model
Training Process
flowchart TD
A["Initialize features<br/>(read feature names)"] --> B["Initialize instances<br/>(read labels + features)"]
B --> C["AdaBoost training loop"]
C --> D{"Converged or<br/>max iterations?"}
D -->|No| C
D -->|Yes| E["Save model"]
E --> F["Output metrics"]
- Initialize features – Reads the features file to build the feature index
- Initialize instances – Reads again to load labeled instances and initial weights
- Training loop – Iteratively selects the best feature, updates model weights, and reweights instances
- Save model – Writes non-zero feature weights to the model file
- Output metrics – Prints accuracy, precision, recall, and confusion matrix
Hyperparameters
| Parameter | Flag | Default | Guidance |
|---|---|---|---|
| Threshold | -t | 0.01 | Start with 0.005. Lower values allow more iterations but increase training time |
| Iterations | -i | 100 | Start with 1000. Increase if accuracy is still improving when training stops |
Interpreting Output
Result Metrics:
Accuracy: 94.15% ( 564133 / 599198 )
Precision: 95.57% ( 330454 / 345758 )
Recall: 94.36% ( 330454 / 350215 )
Confusion Matrix:
True Positives: 330454
False Positives: 15304
False Negatives: 19761
True Negatives: 233679
- Accuracy – Percentage of correct predictions (both boundaries and non-boundaries)
- Precision – Of predicted boundaries, what fraction is correct
- Recall – Of actual boundaries, what fraction was found
- True Positives – Correctly predicted boundaries
- False Positives – Predicted boundary where there is none
- False Negatives – Missed actual boundaries
- True Negatives – Correctly predicted non-boundaries
Graceful Interruption
Press Ctrl+C once during training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.
POS Model Training
For training POS tagging models, use the --pos flag. POS models use the Averaged Perceptron algorithm (multiclass classifier) instead of AdaBoost (binary classifier).
POS Training Command
litsea train --pos --num-epochs 10 <FEATURES_FILE> <MODEL_FILE>
POS Training Example
litsea train --pos --num-epochs 10 ./features.txt ./models/japanese_pos.model
Averaged Perceptron vs AdaBoost
| Aspect | AdaBoost (Segmentation) | Averaged Perceptron (POS) |
|---|---|---|
| Classification | Binary (boundary / non-boundary) | Multiclass (18 segment labels) |
| Labels | 1, -1 | B-NOUN, B-VERB, …, O |
| Hyperparameters | Threshold, Iterations | Number of epochs |
| Model size | ~1-22 KB | ~11 MB |
POS Hyperparameters
| Parameter | Flag | Default | Guidance |
|---|---|---|---|
| Epochs | --num-epochs | 10 | Number of passes over the training data. Start with 10 and adjust based on metrics |
POS Training Output
Result Metrics:
Accuracy: 98.34%
Macro Precision: 97.87%
Macro Recall: 91.67%
- Accuracy – Percentage of correct predictions across all classes
- Macro Precision – Average precision across all POS classes
- Macro Recall – Average recall across all POS classes
POS Graceful Interruption
Press Ctrl+C once during POS training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.
Evaluating Models
Understanding model quality is essential for producing good segmentation results.
Metrics
The train command outputs three key metrics after training:
Accuracy
Accuracy = (TP + TN) / Total Instances
The percentage of all character positions that were correctly classified (both boundaries and non-boundaries). This is the broadest measure of model quality.
Precision
Precision = TP / (TP + FP)
Of the boundaries the model predicted, what fraction was correct. High precision means few false boundaries (over-segmentation).
Recall
Recall = TP / (TP + FN)
Of the actual boundaries, what fraction did the model find. High recall means few missed boundaries (under-segmentation).
Confusion Matrix
| Predicted Boundary (+1) | Predicted Non-boundary (-1) | |
|---|---|---|
| Actual Boundary | True Positive (TP) | False Negative (FN) |
| Actual Non-boundary | False Positive (FP) | True Negative (TN) |
Pre-trained Model Benchmarks
| Model | Accuracy | Precision | Recall | Training Corpus |
|---|---|---|---|---|
| japanese.model | 94.15% | 95.57% | 94.36% | UD Japanese-GSD |
| korean.model | 85.08% | – | – | UD Korean-GSD |
| chinese.model | 80.72% | – | – | UD Chinese-GSD |
Improving Model Quality
If accuracy is unsatisfactory, consider:
- More training data – A larger and more diverse corpus
- Lower threshold – Try
-t 0.001to allow more boosting iterations - More iterations – Try
-i 5000or higher - Better corpus quality – Ensure consistent tokenization and clean text
- Retraining – Start from an existing model and train with additional data (see Retraining Models)
Retraining Models
You can improve an existing model by resuming training with new data.
Command
litsea train -t 0.005 -i 1000 -m <EXISTING_MODEL> <NEW_FEATURES_FILE> <OUTPUT_MODEL>
Example
# Extract features from new corpus
litsea extract -l japanese ./new_corpus.txt ./new_features.txt
# Retrain from existing model
litsea train -t 0.005 -i 1000 \
-m ./models/japanese.model \
./new_features.txt \
./models/japanese_v2.model
How It Works
flowchart LR
A["Existing model<br/>(weights)"] --> C["Trainer"]
B["New features"] --> C
C --> D["Retrained model<br/>(updated weights)"]
- The trainer initializes features and instances from the new features file
- It loads the existing model weights via
-m - Training continues with the loaded weights as a starting point
- The new model inherits all learned patterns and refines them with new data
Use Cases
- Domain adaptation – Fine-tune a general model on domain-specific text (e.g., medical, legal)
- Incremental improvement – Add more training data without retraining from scratch
- Error correction – Train on examples where the current model makes mistakes
Notes
- The output model can be the same path as the input model (overwrites)
- The
-mflag accepts file paths,file://,http://, andhttps://URIs - Retraining starts from the existing weights, so fewer iterations may be needed
Model File Format
Litsea models are stored as simple plain-text files.
Format Specification
<feature_name>\t<weight>
<feature_name>\t<weight>
...
<bias>
- Each line (except the last) contains a feature name and its weight, separated by a tab character
- Zero-weight features are omitted to keep the file compact
- The last line contains the bias term as a single number
Example
BC1:IK 0.3456
BC2:KI -0.1234
UW4:は 0.5678
UC4:I 0.2345
...
-0.0891
Bias Reconstruction
When loading a model, the bias is reconstructed using:
bias_bucket_weight = -bias_value * 2 - sum(all_feature_weights)
During prediction:
bias = -sum(all_model_weights) / 2.0
score = bias + sum(model[feature] for feature in input_attributes)
File Size
Model files are very compact:
| Model | Size | Features |
|---|---|---|
| japanese.model | ~2.9 KB | Wikipedia-trained |
| korean.model | ~1.8 KB | Wikipedia-trained |
| chinese.model | ~1.3 KB | Wikipedia-trained |
| RWCP.model | ~22 KB | Original TinySegmenter |
| JEITA_Genpaku_ChaSen_IPAdic.model | ~17 KB | JEITA corpus |
The compact size is a key advantage of Litsea – models can be embedded directly in applications or served over HTTP with minimal overhead.
Compatibility
- Model files are encoding-agnostic (feature names are stored as-is)
- The format is deterministic (features are sorted via BTreeMap)
- Models are forward-compatible – new features in the input that are not in the model are simply ignored during prediction
Remote Model Loading
Litsea supports loading models from HTTP/HTTPS URLs in addition to local files.
Supported URI Schemes
| Scheme | Example | Description |
|---|---|---|
| (none) | ./model.model | Local file path (default) |
file:// | file:///path/to/model | Explicit file URI |
http:// | http://example.com/model | HTTP URL |
https:// | https://example.com/model | HTTPS URL |
CLI Usage
echo "テスト" | litsea segment -l japanese https://example.com/japanese.model
Library Usage
#![allow(unused)]
fn main() {
let mut learner = AdaBoost::new(0.01, 100);
// Local file
learner.load_model_from_path(Path::new("./models/japanese.model"))?; // local, synchronous
// HTTP URL
learner.load_model("https://example.com/models/japanese.model").await?;
}
Implementation Details
- HTTP client: reqwest with rustls (no OpenSSL dependency)
- Custom User-Agent:
Litsea/<version> - The
load_modelmethod is async because HTTP loading requires an async runtime - For the CLI,
tokioprovides the async runtime
WASM Considerations
On wasm32 targets:
- Local file paths are not supported – file system access is unavailable
file://scheme is not supported- HTTP/HTTPS loading works via the browser’s fetch API (through reqwest’s WASM support)
Error messages guide users to use URLs instead of file paths when running in WASM.
Benchmarking
Litsea includes a Criterion benchmark suite for measuring performance.
Running Benchmarks
cargo bench --bench bench
Or via the Makefile:
make bench
Benchmark Suite
The benchmarks are defined in litsea/benches/bench.rs:
| Benchmark | Description |
|---|---|
segment_short/adaboost/{ja,zh,ko} | Segment a short sentence (AdaBoost) |
segment_short/averaged_perceptron/{ja,zh,ko} | Segment + POS tag a short sentence |
segment_long_japanese/{adaboost,averaged_perceptron} | Process the full Bocchan novel (~300 KB) |
get_type_hiragana | Character type classification |
add_corpus | Corpus ingestion for training |
predict_adaboost | Single AdaBoost prediction |
Models are loaded synchronously with load_model_from_path — no async runtime is involved in the benchmarks.
HTML Reports
Criterion generates detailed HTML reports with statistics and comparison graphs at:
target/criterion/report/index.html
Open this file in a browser after running benchmarks to view:
- Iteration times with confidence intervals
- Throughput measurements
- Comparison with previous runs (automatic regression detection)
Interpreting Results
Key performance factors:
- Segmentation is linear in input length (O(n))
- Character classification is a direct
matchon character ranges (a few nanoseconds; no setup cost) - Prediction at each position depends on the number of features (38-42, constant)
- Model loading time is proportional to the model file size
Pre-trained Models
Litsea ships with several pre-trained models in the models/ directory.
Model Catalog
japanese.model
| Property | Value |
|---|---|
| Language | Japanese |
| Training Corpus | UD Japanese-GSD |
| Accuracy | 94.15% |
| Precision | 95.57% |
| Recall | 94.36% |
| File Size | ~2.9 KB |
korean.model
| Property | Value |
|---|---|
| Language | Korean |
| Training Corpus | UD Korean-GSD |
| Accuracy | 85.08% |
| File Size | ~1.8 KB |
chinese.model
| Property | Value |
|---|---|
| Language | Chinese (Simplified & Traditional) |
| Training Corpus | UD Chinese-GSD |
| Accuracy | 80.72% |
| File Size | ~1.3 KB |
RWCP.model
| Property | Value |
|---|---|
| Language | Japanese |
| Source | Extracted from the original TinySegmenter |
| License | BSD 3-Clause (Taku Kudo) |
| File Size | ~22 KB |
JEITA_Genpaku_ChaSen_IPAdic.model
| Property | Value |
|---|---|
| Language | Japanese |
| Training Corpus | JEITA Project Sugita Genpaku corpus |
| Tokenizer | ChaSen with IPAdic |
| File Size | ~17 KB |
POS Tagging Models
japanese_pos.model
| Property | Value |
|---|---|
| Language | Japanese |
| Algorithm | Averaged Perceptron |
| Training Corpus | UD Japanese-GSD (7,050 sentences) |
| Epochs | 10 |
| Accuracy | 98.34% |
| Macro Precision | 97.87% |
| Macro Recall | 91.67% |
| File Size | ~11 MB |
chinese_pos.model
| Property | Value |
|---|---|
| Language | Chinese (Simplified & Traditional) |
| Algorithm | Averaged Perceptron |
| Training Corpus | UD Chinese-GSD (3,997 sentences) |
| Epochs | 10 |
| Accuracy | 97.09% |
| Macro Precision | 97.31% |
| Macro Recall | 96.23% |
| File Size | ~19 MB |
korean_pos.model
| Property | Value |
|---|---|
| Language | Korean |
| Algorithm | Averaged Perceptron |
| Training Corpus | UD Korean-GSD (4,400 sentences) |
| Epochs | 10 |
| Accuracy | 95.33% |
| Macro Precision | 95.30% |
| Macro Recall | 87.69% |
| File Size | ~8.4 MB |
Usage
echo "これはテストです。" | litsea segment --pos -l japanese models/japanese_pos.model
Output:
これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT
Choosing a Model
- For Japanese, use
japanese.modelfor the best accuracy, orRWCP.modelfor compatibility with the original TinySegmenter - For Chinese, use
chinese.model - For Korean, use
korean.model - For POS tagging, use the corresponding
*_pos.model(japanese_pos.model,chinese_pos.model,korean_pos.model) for joint word segmentation and POS tagging - For domain-specific needs, consider training your own model or retraining an existing one
Sample Data
The resources/ directory also contains sample data:
- bocchan.txt – Sample Japanese corpus from the novel “Botchan” by Natsume Soseki (~307 KB). Used for benchmarking.
License
Litsea is distributed under a dual license.
MIT License
The main Litsea codebase is licensed under the MIT License:
MIT License
Copyright (c) 2025 Minoru OSUKA
Copyright (c) 2022 ICHINOSE Shogo
BSD 3-Clause License
Code originally developed by Taku Kudo (TinySegmenter) is licensed under the BSD 3-Clause License:
Copyright (c) 2008, Taku Kudo
All rights reserved.
Full License Text
The complete license text is available in the LICENSE file in the repository.