Introduction

Litsea is an extremely compact word segmentation library implemented in Rust, inspired by TinySegmenter and TinySegmenterMaker.

Unlike traditional morphological analyzers such as MeCab and Lindera, Litsea does not rely on large-scale dictionaries. Instead, it performs word segmentation using a compact pre-trained model based on the AdaBoost binary classification algorithm. Litsea also supports joint word segmentation and POS (Part-of-Speech) tagging using the Averaged Perceptron multiclass classifier with the Universal POS (UPOS) tagset.

Key Features

Fast and safe Rust implementation – built with Rust’s safety guarantees and performance
Compact pre-trained models – model files are only a few kilobytes in size
No dictionary dependency – segmentation is driven entirely by a statistical model
POS tagging – joint segmentation and Part-of-Speech tagging with UPOS tags via Averaged Perceptron
Multilingual support – Japanese, Chinese (Simplified/Traditional), and Korean
Model training capabilities – train custom models using AdaBoost or Averaged Perceptron with your own corpora
Remote model loading – load models from HTTP/HTTPS URLs or local files
Simple and extensible API – easy to integrate into Rust projects as a library

How It Works

Litsea treats word segmentation as a binary classification problem: for each character position in a sentence, the model predicts whether it is a word boundary (+1) or not a boundary (-1). The classifier uses character n-gram features and character type information specific to each language.

Input:  "LitseaはRust製です"
         ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
         O O O O B O B O B   ← boundary predictions
Output: ["Litsea", "は", "Rust製", "です"]

POS Tagging

Litsea also supports POS (Part-of-Speech) tagging in addition to word segmentation. Using the Averaged Perceptron multiclass classifier, it performs joint segmentation and POS tagging simultaneously.

For each character position, the model predicts one of 18 SegmentLabel classes:

B-NOUN, B-VERB, …, B-X (boundary labels for 17 POS tags)
O (non-boundary = continuation of the current word)

The POS tags follow the Universal Dependencies UPOS tagset (17 POS tags).

Input:  "今日はいい天気ですね。"
Output: 今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

Name Origin

There is a small plant called Litsea cubeba (Aomoji) in the same Lauraceae family as Lindera (Kuromoji). This is the origin of the name Litsea.

Current Version

Litsea v0.5.0 – Rust Edition 2024, minimum Rust version 1.87.

Getting Started

Welcome to Litsea! This section will help you get up and running quickly.

Litsea is a compact word segmentation library in Rust that supports both word segmentation (AdaBoost) and joint segmentation with POS tagging (Averaged Perceptron).

Next Steps

Installation – install Litsea from source or crates.io
Quick Start – segment your first sentence in minutes

Installation

Prerequisites

Rust 1.87 or later (stable channel) from rust-lang.org
Cargo (Rust’s package manager, included with Rust)

Installing the CLI Tool

From crates.io

cargo install litsea-cli

From Source

git clone https://github.com/mosuka/litsea.git
cd litsea
cargo build --release

The binary will be available at ./target/release/litsea.

Verify the installation:

./target/release/litsea --help

Using as a Library

Add Litsea to your project’s Cargo.toml:

[dependencies]
litsea = "0.5.0"

Note: Loading models from local files (load_model_from_path) is synchronous, so no async runtime is needed. An async runtime such as tokio is only required if you load models over HTTP/HTTPS with the async load_model method (enabled by the remote_model feature, which is on by default).

Supported Platforms

Litsea is tested on the following platforms:

OS	Architecture
Linux	x86_64, aarch64
macOS	x86_64 (Intel), aarch64 (Apple Silicon)
Windows	x86_64, aarch64

Quick Start

CLI Quick Start

Segmenting Text

Litsea ships with pre-trained models in the models/ directory. Pipe text into the segment command:

Japanese:

echo "LitseaはTinySegmenterを参考に開発された、Rustで実装された極めてコンパクトな単語分割ソフトウェアです。" \
  | litsea segment -l japanese ./models/japanese.model

Output:

Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。

Chinese:

echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model

Korean:

echo "한국어 단어 분할 테스트입니다." | litsea segment -l korean ./models/korean.model

POS Tagging

Litsea can perform joint word segmentation and POS tagging using a POS model. Add the --pos flag to the segment command:

echo "今日はいい天気ですね。" \
  | litsea segment --pos -l japanese ./models/japanese_pos.model

Output:

今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

Each token is annotated with a Universal POS (UPOS) tag.

Library Quick Start

Here is a minimal Rust program that loads a model and segments text:

use std::path::Path;

use litsea::adaboost::AdaBoost;
use litsea::language::Language;
use litsea::segmenter::Segmenter;

fn main() -> litsea::Result<()> {
    // Load the pre-trained model
    let mut learner = AdaBoost::new(0.01, 100);
    learner.load_model_from_path(Path::new("./models/japanese.model"))?;

    // Create a segmenter
    let segmenter = Segmenter::new(Language::Japanese, Some(learner));

    // Segment text
    let tokens = segmenter.segment("これはテストです。");
    println!("{}", tokens.join(" "));
    // Output: これ は テスト です 。

    Ok(())
}

POS Tagging with the Library

Here is a minimal Rust program that loads a POS model and segments text with POS tags:

use std::path::Path;

use litsea::language::Language;
use litsea::perceptron::AveragedPerceptron;
use litsea::segmenter::Segmenter;

fn main() -> litsea::Result<()> {
    // Load the pre-trained POS model
    let mut pos_learner = AveragedPerceptron::new();
    pos_learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;

    // Create a segmenter with POS support
    let segmenter = Segmenter::with_pos_learner(Language::Japanese, pos_learner);

    // Segment text with POS tags
    let tokens = segmenter.segment_with_pos("今日はいい天気ですね。");
    for (word, pos) in &tokens {
        print!("{}/{} ", word, pos);
    }
    // Output: 今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

    Ok(())
}

What’s Next

CLI Reference – learn all CLI commands and options
Training Guide – train your own models
Architecture – understand how Litsea works internally

Architecture Overview

Litsea is designed as a compact, dictionary-free word segmentation system. It treats word segmentation as a binary classification problem and uses AdaBoost to learn word boundary patterns from character-level features.

High-Level Data Flow

Litsea has two main workflows: training and segmentation.

Training Pipeline

flowchart LR
    A["Corpus (text)"] --> B["Extractor"]
    B --> C["Features File (.txt)"]
    C --> D["Trainer (AdaBoost)"]
    D --> E["Model File (.model)"]

Corpus preparation – Prepare text with words separated by spaces
Feature extraction – The Extractor reads the corpus, classifies characters by type, and outputs labeled feature vectors
Model training – The Trainer feeds features into AdaBoost, which iteratively selects the most informative features and produces a compact model

Segmentation Pipeline

flowchart LR
    F["Raw text"] --> G["Segmenter (AdaBoost)"]
    H["Model file"] --> G
    G --> I["Segmented words"]

Model loading – Load a pre-trained model (from file or URL)
Character classification – For each character in the input, determine its type code based on language-specific patterns
Feature extraction – Build a feature set for each character position using a sliding window
Prediction – AdaBoost predicts whether each position is a word boundary

Design Principles

No dictionary dependency – Unlike MeCab or Lindera, Litsea relies solely on a statistical model learned from character patterns
Compact models – Model files are typically 1-22 KB, containing only the feature weights that matter
Language-agnostic framework – The core algorithm is the same for all languages; only the character type patterns differ
Simple extensibility – Adding a new language requires only defining character type patterns and training a model

Workspace Structure

Litsea is organized as a Cargo workspace with two crates and supporting directories.

Directory Layout

litsea/
├── Cargo.toml              # Workspace manifest
├── Cargo.lock              # Dependency lock file
├── Makefile                # Build convenience targets
├── rustfmt.toml            # Rust formatting configuration
├── LICENSE                 # MIT
├── README.md               # Project overview
├── litsea/                 # Core library crate
│   ├── Cargo.toml
│   ├── src/
│   │   ├── lib.rs          # Module declarations and version
│   │   ├── adaboost.rs     # AdaBoost algorithm
│   │   ├── segmenter.rs    # Word segmentation
│   │   ├── extractor.rs    # Feature extraction from corpus
│   │   ├── trainer.rs      # Training orchestration
│   │   ├── language.rs     # Language definitions and char patterns
│   │   └── util.rs         # URI scheme utilities
│   └── benches/
│       └── bench.rs        # Criterion benchmarks
├── litsea-cli/             # CLI binary crate
│   ├── Cargo.toml
│   └── src/
│       └── main.rs         # CLI entry point
├── models/                 # Pre-trained models
│   ├── japanese.model
│   ├── chinese.model
│   ├── korean.model
│   ├── RWCP.model
│   └── JEITA_Genpaku_ChaSen_IPAdic.model
├── resources/              # Sample data and test fixtures
│   └── bocchan.txt         # Sample corpus
├── scripts/                # Corpus preparation utilities
│   ├── download_udtreebank.sh      # Download UD Treebanks (prints CoNLL-U file path)
│   ├── corpus_udtreebank.sh           # Convert CoNLL-U to Litsea corpus format
│   └── wikitexts.sh        # Download and prepare Wikipedia text data
├── docs/                   # mdbook documentation (this book)
└── .github/
    └── workflows/          # CI/CD pipelines
        ├── regression.yml  # Test on push/PR
        ├── release.yml     # Release builds and publishing
        └── periodic.yml    # Weekly stability tests

Crate Details

`litsea` (Core Library)

The core library provides all segmentation, training, and model I/O functionality.

Dependency	Version	Purpose
`thiserror`	2.0	Error type derivation
`reqwest`	0.13	HTTP/HTTPS model loading (rustls)
`tokio`	1.49	Async runtime for remote model loading
`criterion`	0.8	Benchmarking (dev dependency)
`tempfile`	3.25	Temporary files for tests (dev dependency)

`litsea-cli` (CLI Binary)

The CLI provides a command-line interface to Litsea’s functionality.

Dependency	Version	Purpose
`clap`	4.5	Command-line argument parsing
`ctrlc`	3.5	Graceful Ctrl+C handling during training
`tokio`	1.49	Async runtime
`litsea`	0.4	Core library (workspace member)

Workspace Configuration

The workspace uses Cargo resolver version 3 (Rust Edition 2024):

[workspace]
resolver = "3"
members = ["litsea", "litsea-cli"]

[workspace.package]
version = "0.4.0"
edition = "2024"
rust-version = "1.87"

Shared dependencies are defined at the workspace level in [workspace.dependencies] and referenced by each crate with { workspace = true }.

Module Design

The litsea library crate is organized into focused modules, each with a clear responsibility.

Module Dependency Graph

graph TD
    language["language.rs<br/>Character classification"]
    segmenter["segmenter.rs<br/>Segmentation + POS tagging"]
    adaboost["adaboost.rs<br/>AdaBoost (boundaries)"]
    perceptron["perceptron.rs<br/>Averaged Perceptron (POS)"]
    upos["upos.rs<br/>UPOS tags and labels"]
    extractor["extractor.rs<br/>Feature extraction"]
    trainer["trainer.rs<br/>Training orchestration"]
    model_io["model_io.rs (private)<br/>Model URI loading"]
    error["error.rs<br/>LitseaError / Result"]
    metrics["metrics.rs<br/>Evaluation metrics"]

    language --> segmenter
    upos --> segmenter
    adaboost --> segmenter
    perceptron --> segmenter
    segmenter --> extractor
    adaboost --> trainer
    perceptron --> trainer
    model_io --> adaboost
    model_io --> perceptron
    error --> adaboost
    error --> perceptron
    metrics --> trainer

Module Details

`language.rs` – Language Definitions

Defines the Language enum and character type classification.

Language – Enum with variants Japanese, Chinese, Korean
- Implements FromStr (parses "japanese", "ja", "chinese", "zh", "korean", "ko")
- Implements Display (outputs lowercase name)
- char_type(c: char) -> &'static str – Classifies a character with a direct match on character ranges (allocation-free; no regex). Language-specific functions (japanese_char_type, etc.) share a punct_latin_digit() helper for the common "P"/"A"/"N" classes.

`segmenter.rs` – Word Segmentation and POS Tagging

The main user-facing module.

Segmenter – Holds a Language, an AdaBoost learner, and an optional AveragedPerceptron POS learner (fields are private; use language(), learner(), learner_mut(), pos_learner(), pos_learner_mut())
- new(language, learner) – Create a segmenter with an optional pre-trained model
- with_pos_learner(language, pos_learner) – Create a segmenter for joint segmentation + POS tagging
- segment(sentence) – Segment text into words, returns Vec<String>
- segment_with_pos(sentence) – Segment and tag, returns Vec<(String, Upos)>
- char_type(ch) – Classify a single character into its type code
- add_corpus(corpus) / add_corpus_with_pos(corpus) – Add training data
- add_corpus_with_writer(corpus, callback) / add_corpus_with_pos_writer(corpus, callback) – Process a corpus with a custom callback

`adaboost.rs` – AdaBoost Algorithm

The binary classifier used for word boundary decisions.

AdaBoost
- new(threshold, num_iterations) – Create with training parameters
- initialize_features(path) / initialize_instances(path) – Load training data
- train(running) – Run the AdaBoost training loop
- predict(&attributes) – Predict boundary (+1) or non-boundary (-1)
- load_model(uri) (async) / load_model_from_path(path) / load_model_from_reader(reader) – Load model weights
- save_model(path) – Save model weights to a file
- metrics() – Calculate accuracy, precision, and recall (BinaryMetrics)
- bias() – Get the model’s bias term

`perceptron.rs` – Averaged Perceptron

The multiclass classifier used for joint segmentation + POS tagging.

AveragedPerceptron
- add_instance(features, label) – Add a training instance
- train(num_epochs, running) – Train with weight averaging
- predict(&features) – Predict the best class label
- load_model(uri) (async) / load_model_from_path(path) / load_model_from_reader(reader) – Load model weights
- save_model(path) – Save model weights
- metrics() – Macro-averaged evaluation (MulticlassMetrics)
Weights are stored in a feature → per-class vector layout for fast inference.

`upos.rs` – Universal POS Tags

Upos – The 17 Universal Dependencies POS tags (NOUN, VERB, …)
SegmentLabel – Combined segmentation + POS label per character position (B(Upos) or O), with Display/FromStr for the "B-NOUN" / "O" string form

`extractor.rs` – Feature Extraction

Extracts features from a corpus for model training.

Extractor – Wraps a Segmenter to process corpus files
- new(language) – Create an extractor for a specific language
- extract(corpus_path, features_path) – Read a corpus, write a features file
- extract_with_pos(corpus_path, features_path) – Same for POS-tagged corpora

`trainer.rs` – Training Orchestration

High-level training workflows.

Trainer – Segmentation model training (AdaBoost)
- new(threshold, num_iterations, features_path) – Initialize from a features file
- load_model(uri) – Optionally load an existing model for incremental training (async)
- train(running, model_path) – Train and save, returns BinaryMetrics
PosTrainer – POS model training (Averaged Perceptron)
- new(num_epochs, features_path) / load_model(uri) / train(running, model_path) returning MulticlassMetrics

`error.rs` – Error Handling

LitseaError – Error enum (Io, InvalidData, InvalidInput, Unsupported, and Download with the remote_model feature)
Result<T> – Alias used by every fallible API

`metrics.rs` – Evaluation Metrics

BinaryMetrics – Accuracy, precision, recall, confusion matrix (AdaBoost)
MulticlassMetrics – Accuracy and macro-averaged precision/recall (Averaged Perceptron)

`model_io.rs` – Model Loading I/O (private)

Internal module that resolves a model URI (plain path, file://, or http(s):// with the remote_model feature) and returns the raw model bytes. Not part of the public API.

Public Exports

The library’s lib.rs exposes the public modules and re-exports the main types:

#![allow(unused)]
fn main() {
pub mod adaboost;
pub mod error;
pub mod extractor;
pub mod language;
pub mod metrics;
mod model_io;
pub mod perceptron;
pub mod segmenter;
pub mod trainer;
pub mod upos;

pub use adaboost::AdaBoost;
pub use error::{LitseaError, Result};
pub use extractor::Extractor;
pub use language::Language;
pub use metrics::{BinaryMetrics, MulticlassMetrics};
pub use perceptron::AveragedPerceptron;
pub use segmenter::Segmenter;
pub use trainer::{PosTrainer, Trainer};
pub use upos::{SegmentLabel, Upos};

pub fn version() -> &'static str { ... }
}

AdaBoost Binary Classification

Litsea uses the AdaBoost (Adaptive Boosting) algorithm for binary classification to determine word boundaries. This chapter explains the algorithm as implemented in Litsea.

Overview

AdaBoost combines many weak learners (simple classifiers) into a strong ensemble classifier. In Litsea:

Positive label (+1) = word boundary
Negative label (-1) = non-boundary (continuation of the current word)
Weak learners = individual features (each feature is a binary “stump” – present or absent)

Training Algorithm

The training loop in AdaBoost::train() works as follows:

Initialization

Load features and instances from the training file
Initialize instance weights uniformly (later adjusted based on initial score)
All model weights start at zero

Iterative Boosting

For each iteration t (up to num_iterations):

Step 1: Calculate weighted errors

For each feature h, compute its weighted error over all instances:

error[h] -= D[i] * y[i]   (for each instance i that has feature h)

where D[i] is the instance weight and y[i] is the true label.

Step 2: Select the best weak learner

Find the feature with the lowest weighted error rate:

error_rate(h) = (error[h] + positive_weight_sum) / instance_weight_sum
h_best = argmax_h |0.5 - error_rate(h)|

The baseline competitor is the “all-negative” classifier (always predicts -1), whose error rate equals the fraction of positive instances. Any real feature must beat this baseline.

Step 3: Check convergence

If |0.5 - best_error_rate| < threshold, stop early – no feature can significantly improve the model.

Step 4: Compute the weak learner weight

alpha = 0.5 * ln((1 - error_rate) / error_rate)
model[h_best] += alpha

A lower error rate produces a higher alpha, giving more influence to better features.

Step 5: Update instance weights

For each instance i:
    prediction = +1 if h_best in features(i), else -1

    if y[i] * prediction < 0:  (misclassified)
        D[i] *= exp(alpha)     (increase weight)
    else:                       (correctly classified)
        D[i] /= exp(alpha)     (decrease weight)

Normalize: D[i] /= sum(D)

This ensures subsequent iterations focus on the instances that are still difficult to classify.

Prediction

Given an input set of features (attributes), the prediction is:

score = bias + sum(model[feature] for each feature in attributes)
prediction = +1 if score >= 0, else -1

Bias Term

The bias is computed as:

bias = -sum(all model weights) / 2.0

This centers the decision boundary. The empty-string feature ("") serves as the bias bucket during training.

Model File Format

The trained model is saved as a simple text file:

feature1\tweight1
feature2\tweight2
...
bias_value

Each line contains a feature name and its weight (tab-separated)
Zero-weight features are omitted
The last line contains the bias term (a single number)

See Model File Format for details.

Hyperparameters

Parameter	Default	Description
`threshold`	0.01	Early stopping threshold. Lower values allow more iterations, potentially improving accuracy
`num_iterations`	100	Maximum number of boosting rounds. Higher values may improve accuracy at the cost of training time and model size

Averaged Perceptron

Litsea uses the Averaged Perceptron algorithm for multiclass classification to perform joint word segmentation and POS tagging. This chapter explains the algorithm as implemented in Litsea.

Overview

While AdaBoost performs binary classification (boundary vs. non-boundary), the Averaged Perceptron performs multiclass classification – predicting one of 18 segment labels for each character position:

17 boundary labels: B-ADJ, B-ADP, B-ADV, B-AUX, B-CCONJ, B-DET, B-INTJ, B-NOUN, B-NUM, B-PART, B-PRON, B-PROPN, B-PUNCT, B-SCONJ, B-SYM, B-VERB, B-X
1 non-boundary label: O (continuation of the current word)

These labels correspond to the 17 Universal POS (UPOS) tags from the Universal Dependencies project, prefixed with B- to indicate a word boundary. This enables simultaneous word boundary detection and POS estimation in a single classification step.

Algorithm

Weight Representation

The perceptron maintains a weight vector per class. Weights are stored as a sparse map:

weights: HashMap<Feature, HashMap<Class, f64>>

For example:

weights["UW4:猫"]["B-NOUN"] = 2.5
weights["UC4:H"]["B-NOUN"]  = 1.8
weights["UW4:猫"]["O"]      = -0.3
...

For a given feature set, the score for each class is the sum of its feature weights:

score(class) = sum(weights[feature][class] for each feature in input)
prediction = argmax(score(class) for all classes)

Update Rule

When the perceptron makes a misclassification:

For each training instance (features, truth):
    guess = predict(features)

    if guess != truth:
        For each feature f in features:
            weights[f][truth] += 1.0   # increase weight for correct class
            weights[f][guess] -= 1.0   # decrease weight for predicted class

This increases the weights for the correct class and decreases them for the incorrectly predicted class, making the correct prediction more likely for similar inputs in the future.

Averaging

A key improvement over the basic perceptron is weight averaging. Rather than using the final weights (which can be unstable and tend to overfit to the tail of the training data), the model averages all weight vectors seen during training. This improves generalization to unseen data.

The implementation uses a cumulative sum approach for efficiency:

cumulative[feature][class] += weights[feature][class] * elapsed_steps

At the end of training:
    averaged[feature][class] = cumulative[feature][class] / total_steps

This avoids storing all intermediate weight vectors while producing the same result. The averaging reduces dependence on the order of training data and improves generalization performance.

Training with Epochs

Training iterates over the data multiple times (epochs). Each epoch processes all training instances in order:

For each epoch (1 to num_epochs):
    For each instance in training data:
        features = extract_features(instance)
        predicted = argmax(score(class) for all classes)
        if predicted != correct_label:
            update weights
        accumulate weights for averaging

Training supports graceful interruption via AtomicBool – a Ctrl+C signal stops training and saves the model at its current state.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use litsea::perceptron::AveragedPerceptron;

let mut perceptron = AveragedPerceptron::new();
// ... add instances ...
let running = Arc::new(AtomicBool::new(true));
perceptron.train(10, running);  // 10 epochs
}

Model File Format

The Averaged Perceptron model is saved as a text file with the following structure:

18
O
B-ADJ
B-ADP
...
B-X
feature1\tclass1\tweight1
feature2\tclass2\tweight2
...

Line 1: Number of classes (18)
Lines 2 to N+1: Class names, one per line
Remaining lines: Feature weights, tab-separated as feature\tclass\tweight
Zero-weight entries are omitted

Comparison with AdaBoost

Aspect	AdaBoost	Averaged Perceptron
Classification	Binary (+1/-1)	Multiclass (18 classes)
Output	Word boundaries only	Word boundaries + POS tags
Weak learner	Decision stumps per feature	None (linear classifier)
Weight management	One weight per feature	Class x feature weight matrix
Generalization	Ensemble	Weight averaging
Training	Iterative boosting with sample reweighting	Online learning with weight averaging
Model size	A few KB	~11 MB (with POS features)
Hyperparameters	`threshold`, `num_iterations`	`num_epochs`

Hyperparameters

Parameter	Default	Description
`num_epochs`	10	Number of training passes over the data. More epochs can improve accuracy but may overfit

Feature Extraction

Litsea uses character n-gram features to capture the local context around each potential word boundary. This chapter catalogs all feature types.

Feature Categories

For each character position i in the input, the segmenter extracts features from a sliding window of characters, their type codes, and previous boundary decisions.

Base Features (38 features)

Category	IDs	Description	Window
UW (Unary Word)	UW1–UW6	Individual characters at positions i-3 to i+2	6
BW (Bigram Word)	BW1–BW3	Adjacent character pairs	3
UC (Unary Char-type)	UC1–UC6	Character type codes at positions i-3 to i+2	6
BC (Bigram Char-type)	BC1–BC3	Adjacent type code pairs	3
TC (Trigram Char-type)	TC1–TC4	Type code triples	4
UP (Unary Previous-tag)	UP1–UP3	Previous 3 boundary decisions	3
BP (Bigram Previous-tag)	BP1–BP2	Boundary decision pairs	2
UQ (Unary tag+type)	UQ1–UQ3	Combined boundary decision + type code	3
BQ (Bigram tag+type)	BQ1–BQ4	Combined decision + type code bigrams	4
TQ (Trigram tag+type)	TQ1–TQ4	Combined decision + type code trigrams	4

Language-Specific Features (4 features, Japanese and Chinese only)

Category	IDs	Description	Count
WC (Word+Char-type)	WC1–WC4	Character + type code mixed features	4

WC1: character at i-1 + type code at i
WC2: type code at i-1 + character at i
WC3: character at i-1 + type code at i-1
WC4: character at i + type code at i

Why no WC for Korean? Korean Hangul syllables are classified into only two types (SN and SF), so WC features would add noise rather than useful signal.

Total Feature Count

Language	Base	WC	Total
Japanese	38	4	42
Chinese	38	4	42
Korean	38	0	38

Feature Format

Each feature is represented as a string in the format PREFIX:VALUE:

UW4:は        ← The character at position i is "は"
UC4:I         ← The type code at position i is "I" (Hiragana)
BW2:はテ      ← The bigram at position i-1..i is "はテ"
BC2:IK        ← The type bigram is Hiragana + Katakana
UP3:B         ← The previous boundary decision was "B" (boundary)
WC1:はK       ← Character "は" combined with type "K"

Sliding Window Layout

The segmenter pads the input with sentinel characters:

Index:   0    1    2    3    4    5    ...  n+2  n+3  n+4  n+5
Chars:   B3   B2   B1   c1   c2   c3  ...  cn   E1   E2   E3
Types:   O    O    O    t1   t2   t3  ...  tn   O    O    O
Tags:    U    U    U    U    ?    ?   ...  ?

B3, B2, B1 – Begin sentinels (padding)
E1, E2, E3 – End sentinels (padding)
O – “Other” type for padding positions
U – “Unknown” tag for initial positions
B – “Boundary” tag (word start)
O – “Other” tag (continuation)

Features are extracted for positions 4 through len-3, where the full window of i-3 to i+2 is available.

Training Data Format

The extract command writes features to a file in this format:

1	UW1:B2 UW2:B1 UW3:L UW4:i UW5:t UC1:O UC2:O UC3:A UC4:A ...
-1	UW1:B1 UW2:L UW3:i UW4:t UW5:s UC1:O UC2:A UC3:A UC4:A ...

Each line contains:

A label (1 for boundary, -1 for non-boundary)
Tab-separated feature strings

Character Type Classification

Each language in Litsea defines a set of character type patterns that classify individual characters into linguistically meaningful categories. These type codes are used as features for the AdaBoost classifier.

How It Works

Language::char_type(c: char) -> &'static str classifies a character with a direct match expression on Unicode character ranges — no regex, no allocation. Match arms are tried top to bottom, so the first matching arm determines the type code. If no arm matches, the character is classified as "O" (Other).

Each language has its own classification function (japanese_char_type, chinese_char_type, korean_char_type); the classes shared by all languages — "P" (punctuation), "A" (Latin), "N" (digits) — live in a common punct_latin_digit() helper that is checked after the language-specific classes. Logic beyond plain ranges is expressed with match guards (e.g., Korean Hangul syllable structure).

Japanese Character Types

Code	Name	Pattern / Range	Examples
M	Kanji Numbers	`[一二三四五六七八九十百千万億兆]`	一, 千, 億
H	Kanji / CJK Ideographs	`[一-龠々〆ヵヶ]`	漢, 字, 学
I	Hiragana	`[ぁ-ん]`	あ, い, う
K	Katakana	`[ァ-ヴーｱ-ﾝﾞﾟ]`	ア, カ, ー
P	Punctuation	CJK Symbols (U+3000-303F), Full-width (U+FF01-FF65)	。, 、, 「
A	ASCII/Latin	`[a-zA-Zａ-ｚＡ-Ｚ]`	A, z, Ｂ
N	Digits	`[0-9０-９]`	0, ５
O	Other	Fallback	@, #

Note: “M” (Kanji numbers) is checked before “H” (general Kanji), so characters like 一 and 百 are classified as numbers rather than generic ideographs.

Chinese Character Types

Code	Name	Pattern / Range	Examples
F	Function Words	High-frequency grammatical words	的, 了, 在, 是
C	CJK Unified	U+4E00–U+9FFF	中, 国, 人
X	CJK Extension A	U+3400–U+4DBF	Rare characters
R	CJK Radicals	U+2E80–U+2FDF	Kangxi radicals
P	Punctuation	CJK Symbols + Full-width	。, ，, 《
B	Bopomofo	U+3100–U+312F, U+31A0–U+31BF	Zhuyin symbols
A	ASCII/Latin	`[a-zA-Zａ-ｚＡ-Ｚ]`	A, z
N	Digits	`[0-9０-９]`	0, ５
O	Other	Fallback	@, #

Chinese function words include:

Structural particles: 的, 地, 得
Aspect/modal particles: 了, 着, 过, 吗, 呢, 吧, 啊, 嘛
Conjunctions: 和, 与, 或, 但, 而, 且, 及
Prepositions: 在, 从, 到, 把, 被, 对, 向, 给
Common grammatical verbs/adverbs: 是, 有, 不, 也, 都, 就, 要, 会, 能, 可

Korean Character Types

Code	Name	Pattern / Range	Examples
E	Particles/Endings	High-frequency grammatical particles	은, 는, 을, 를, 의, 에
SN	Hangul (no batchim)	Hangul Syllable without final consonant	가, 나, 하
SF	Hangul (with batchim)	Hangul Syllable with final consonant	한, 글, 각
J	Hangul Jamo	U+1100–U+11FF	Individual consonants/vowels
G	Compatibility Jamo	U+3130–U+318F	ㄱ, ㅏ, ㅎ
H	Hanja	U+4E00–U+9FFF	CJK Ideographs
P	Punctuation	CJK Symbols + Full-width	。, ，
A	ASCII/Latin	`[a-zA-Zａ-ｚＡ-Ｚ]`	A, z
N	Digits	`[0-9０-９]`	0, ５
O	Other	Fallback	@, #

Korean Hangul Syllable Detection

Korean uses a match guard for the SN and SF types. This leverages Unicode’s systematic Hangul encoding:

Hangul Syllables occupy U+AC00–U+D7AF
Each syllable is encoded as: (initial * 21 + medial) * 28 + final + 0xAC00
If (codepoint - 0xAC00) % 28 == 0, the syllable has no final consonant (SN)
Otherwise, it has a final consonant (SF, “받침”)

This distinction is important because the presence of a final consonant (받침) affects Korean word boundary patterns and particle attachment.

Cross-Language Comparison

Feature	Japanese	Chinese	Korean
Total types	8	9	10
Unique types	M, H, I, K	F, C, X, R, B	E, SN, SF, J, G
Shared types	P, A, N, O	P, A, N, O	P, A, N, O (H shared with JP)
Matching method	Range match	Range match	Range match + guard
WC features used	Yes	Yes	No

Prediction Pipeline

This chapter provides a step-by-step walkthrough of how Segmenter::segment() processes input text.

Example: Segmenting “これはテストです。”

Step 1: Initialize Arrays with Padding

chars: ["B3", "B2", "B1"]
types: ["O",  "O",  "O" ]
tags:  ["U",  "U",  "U", "U"]

The tags array gets one extra “U” because tags[3] represents the first real character’s tag (set to “Unknown” since there is no prior boundary decision).

Step 2: Scan Input Characters

For each character in the input, determine its type using language-specific patterns and append to the arrays:

chars: ["B3","B2","B1", "こ","れ","は","テ","ス","ト","で","す","。"]
types: ["O", "O", "O",  "I", "I", "I", "K", "K", "K", "I", "I", "P"]

Step 3: Append End Sentinels

chars: [..., "。", "E1", "E2", "E3"]
types: [..., "P",  "O",  "O",  "O" ]

Step 4: Iterate and Predict

For each position i from 4 to len(chars) - 3:

i=4 (れ): Extract features → predict → label=-1 (O) → word="これ"
i=5 (は): Extract features → predict → label=+1 (B) → push "これ", word="は"
i=6 (テ): Extract features → predict → label=+1 (B) → push "は", word="テ"
i=7 (ス): Extract features → predict → label=-1 (O) → word="テス"
i=8 (ト): Extract features → predict → label=-1 (O) → word="テスト"
i=9 (で): Extract features → predict → label=+1 (B) → push "テスト", word="で"
i=10(す): Extract features → predict → label=-1 (O) → word="です"
i=11(。): Extract features → predict → label=+1 (B) → push "です", word="。"

Step 5: Push Final Word

Push the remaining word “。” to the result.

Result

["これ", "は", "テスト", "です", "。"]

How Prediction Works at Each Position

At each position i, the segmenter:

Extracts features – Calls get_attributes(i, tags, chars, types) to build a HashSet<String> of 38–42 features
Computes score – The AdaBoost learner sums the model weights for all matching features plus the bias:
```
score = bias + sum(model[feature] for feature in attributes)
```
Makes decision – If score >= 0, the character starts a new word (boundary); otherwise, it continues the current word
Updates tags – Pushes “B” or “O” to the tags array, which affects feature extraction for subsequent positions

Training vs. Prediction

Aspect	Training (`process_corpus`)	Prediction (`segment`)
Tags source	Pre-computed from the annotated corpus	Dynamically generated by the model
First tag	“U” (overrides “B” at position 3)	“U” (no prior decision)
Labels	Known from corpus (+1 or -1)	Predicted by AdaBoost
Features	Written to file via callback	Passed directly to `predict()`

During training, tags are derived from the ground-truth corpus segmentation, so the model learns from correct boundary decisions. During prediction, tags are generated on-the-fly, meaning each decision depends on all previous predictions – this is a left-to-right greedy approach.

Performance Characteristics

The segmentation algorithm is linear in the length of the input:

Each character position is visited once: O(n)
Feature extraction at each position: O(1) (fixed number of features)
Prediction at each position: O(f) where f is the number of active features (~38-42)
Total: O(n * f) which is effectively O(n)

Language Support Overview

Litsea supports word segmentation for three languages through a unified framework based on the Language enum.

Supported Languages

Language	Enum Variant	CLI Values	Feature Count	Pre-trained Model Accuracy
Japanese	`Language::Japanese`	`japanese`, `ja`	42	94.15%
Chinese	`Language::Chinese`	`chinese`, `zh`	42	80.72%
Korean	`Language::Korean`	`korean`, `ko`	38	85.08%

The Language Enum

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum Language {
    #[default]
    Japanese,
    Chinese,
    Korean,
}
}

Default is Japanese
Implements FromStr – parses from full name or ISO 639-1 code (case-insensitive)
Implements Display – outputs the lowercase full name

Parsing Examples

#![allow(unused)]
fn main() {
use litsea::language::Language;

let ja: Language = "japanese".parse().unwrap();
let zh: Language = "zh".parse().unwrap();
let ko: Language = "Korean".parse().unwrap();   // case-insensitive
let err = "french".parse::<Language>();          // Err(...)
}

How Languages Differ

Each language defines its own character type patterns that classify characters into type codes. These type codes are used as features for the AdaBoost classifier.

Aspect	Japanese	Chinese	Korean
Character types	8 (M, H, I, K, P, A, N, O)	9 (F, C, X, R, P, B, A, N, O)	10 (E, SN, SF, J, G, H, P, A, N, O)
WC features	Yes (4 extra)	Yes (4 extra)	No
Total features	42	42	38
Matching method	Regex only	Regex only	Regex + Closure

Why Korean Has Fewer Features

Korean Hangul syllables are classified into only two types: SN (without 받침/final consonant) and SF (with 받침). This binary distinction means WC features (word + character-type combinations) would produce redundant information with little discriminative power. Excluding them reduces noise and keeps the model compact.

Japanese

Japanese is the default language in Litsea.

Character Types

Code	Name	Pattern	Examples
M	Kanji Numbers	`[一二三四五六七八九十百千万億兆]`	一, 三, 千, 億
H	Kanji / CJK	`[一-龠々〆ヵヶ]`	漢, 字, 学, 々
I	Hiragana	`[ぁ-ん]`	あ, い, う, を
K	Katakana	`[ァ-ヴーｱ-ﾝﾞﾟ]`	ア, カ, ー, ﾊ
P	Punctuation	CJK Symbols + Full-width	。, 、, 「, 」
A	ASCII/Latin	`[a-zA-Zａ-ｚＡ-Ｚ]`	A, z, Ｂ
N	Digits	`[0-9０-９]`	0, 5, ５
O	Other	Fallback	@, #, $

Pattern Priority

Patterns are evaluated in order. Notably:

M before H: Characters like 一 and 百 are classified as “Kanji Numbers” (M), not generic “Kanji” (H)
This distinction helps the model learn number-specific boundary patterns

Pre-trained Models

japanese.model

Training corpus: UD Japanese-GSD
Accuracy: 94.15%
Precision: 95.57%
Recall: 94.36%

RWCP.model

Source: Extracted from the original TinySegmenter
License: BSD 3-Clause (Taku Kudo)
Size: ~22 KB

JEITA_Genpaku_ChaSen_IPAdic.model

Training corpus: JEITA Project Sugita Genpaku corpus
Tokenizer: ChaSen with IPAdic dictionary
Size: ~17 KB

Example

echo "LitseaはTinySegmenterを参考に開発された、Rustで実装された極めてコンパクトな単語分割ソフトウェアです。" \
  | litsea segment -l japanese ./models/japanese.model

Output:

Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。

Chinese

Litsea supports Chinese word segmentation covering both Simplified and Traditional Chinese.

Character Types

Code	Name	Pattern	Examples
F	Function Words	High-frequency grammatical words	的, 了, 在, 是, 和
C	CJK Unified	U+4E00–U+9FFF	中, 国, 人
X	CJK Extension A	U+3400–U+4DBF	Rare characters
R	CJK Radicals	U+2E80–U+2FDF	Kangxi radicals
P	Punctuation	CJK Symbols + Full-width	。, ，, 《, 》
B	Bopomofo	U+3100–U+312F, U+31A0–U+31BF	Zhuyin symbols
A	ASCII/Latin	`[a-zA-Zａ-ｚＡ-Ｚ]`	A, z
N	Digits	`[0-9０-９]`	0, 5, ５
O	Other	Fallback	@, #, $

Chinese Function Words (虚词)

The “F” type captures high-frequency grammatical words that are critical for segmentation:

Category	Characters
Structural particles	的, 地, 得
Aspect/modal particles	了, 着, 过, 吗, 呢, 吧, 啊, 嘛
Conjunctions	和, 与, 或, 但, 而, 且, 及
Prepositions	在, 从, 到, 把, 被, 对, 向, 给
Grammatical verbs/adverbs	是, 有, 不, 也, 都, 就, 要, 会, 能, 可

These characters appear overwhelmingly in grammatical roles and signal word boundaries differently from content words.

Pre-trained Model

chinese.model

Training corpus: UD Chinese-GSD
Accuracy: 80.72%

Example

echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model

Korean

Litsea supports Korean word segmentation with specialized Hangul character type detection.

Character Types

Code	Name	Pattern	Examples
E	Particles/Endings	`[은는을를의에]`	은, 는, 을, 를, 의, 에
SN	Hangul (no 받침)	Codepoint arithmetic	가, 나, 하, 모
SF	Hangul (with 받침)	Codepoint arithmetic	한, 글, 각, 붙
J	Hangul Jamo	U+1100–U+11FF	Individual consonants/vowels
G	Compatibility Jamo	U+3130–U+318F	ㄱ, ㅏ, ㅎ
H	Hanja	U+4E00–U+9FFF	CJK Ideographs
P	Punctuation	CJK Symbols + Full-width	。, ，
A	ASCII/Latin	`[a-zA-Zａ-ｚＡ-Ｚ]`	A, z
N	Digits	`[0-9０-９]`	0, 5, ５
O	Other	Fallback	@, #, $

Korean Particles (조사)

The “E” type captures six high-frequency grammatical particles:

Character	Role	Name
은/는	Topic marker	주격 조사
을/를	Object marker	목적격 조사
의	Possessive	관형격 조사
에	Locative	부사격 조사

These particles frequently appear at word boundaries and are given a distinct type code to improve segmentation accuracy.

Hangul Syllable Structure (받침 Detection)

Korean uses closure-based matching instead of regex for SN and SF types. This exploits the systematic Unicode Hangul encoding:

Hangul Syllables: U+AC00–U+D7AF (11,172 syllables)
Each syllable = (initial * 21 + medial) * 28 + final + 0xAC00
SN (no 받침): (codepoint - 0xAC00) % 28 == 0
SF (with 받침): (codepoint - 0xAC00) % 28 != 0

The 받침 (final consonant) distinction is linguistically significant because it affects how particles attach to words and where boundaries occur.

No WC Features

Korean does not use WC (word + character-type) features. Since most Hangul syllables fall into only two types (SN and SF), WC features would produce low-entropy, noisy combinations that hurt model accuracy.

Pre-trained Model

korean.model

Training corpus: UD Korean-GSD
Accuracy: 85.08%

Example

echo "한국어 단어 분할 테스트입니다." | litsea segment -l korean ./models/korean.model

Adding a New Language

Litsea’s multilingual framework is designed to be easily extensible. This guide explains how to add support for a new language.

Steps Overview

Add a variant to the Language enum
Implement Display and FromStr match arms
Create a character classification function
Register the classification function
Decide on WC feature inclusion
Prepare a training corpus and train a model
Add tests

Step 1: Add a Variant to `Language`

In litsea/src/language.rs, add a new variant to the Language enum:

#![allow(unused)]
fn main() {
pub enum Language {
    #[default]
    Japanese,
    Chinese,
    Korean,
    Thai,       // ← new language
}
}

Step 2: Implement Display and FromStr

Add match arms for the new language:

#![allow(unused)]
fn main() {
// In Display impl
Language::Thai => write!(f, "thai"),

// In FromStr impl
"thai" | "th" => Ok(Language::Thai),
}

Step 3: Create a Character Classification Function

Define a function that classifies a char into a type code for the new language. Classification is a direct match on character ranges (no regex), so each class is an arm; the first matching arm wins:

#![allow(unused)]
fn main() {
fn thai_char_type(c: char) -> &'static str {
    match c {
        // Thai consonants and sequential vowels (U+0E01-U+0E3A)
        '\u{0E01}'..='\u{0E3A}' => "T",
        // Thai vowels and tone marks (U+0E40-U+0E4E)
        '\u{0E40}'..='\u{0E4E}' => "V",
        // Thai digits (U+0E50-U+0E59)
        '\u{0E50}'..='\u{0E59}' => "N",
        // Shared classes: "P" (punctuation), "A" (Latin), "N" (digits)
        _ => punct_latin_digit(c).unwrap_or("O"),
    }
}
}

Design Tips for Character Types

Identify linguistically distinct categories that correlate with word boundary patterns
Order matters – match arms are tried top to bottom, so put more specific classes before general ones
Consider high-frequency function words as a separate type (as Chinese does with “F”)
Use match guards for logic beyond plain ranges (as Korean does to split syllables with/without 받침)
Reuse the shared punct_latin_digit() helper for the common “P”/“A”/“N” classes

Step 4: Register the Classification Function

Add a match arm in Language::char_type():

#![allow(unused)]
fn main() {
pub fn char_type(&self, c: char) -> &'static str {
    match self {
        Language::Japanese => japanese_char_type(c),
        Language::Chinese => chinese_char_type(c),
        Language::Korean => korean_char_type(c),
        Language::Thai => thai_char_type(c),    // ← new
    }
}
}

Step 5: Decide on WC Feature Inclusion

In segmenter.rs, the internal attribute builder (write_attributes()) has a match on the language to decide whether to include WC features:

#![allow(unused)]
fn main() {
match self.language {
    Language::Japanese | Language::Chinese => {
        // Include WC features
        attr!("WC1:{}{}", w3, c4);
        attr!("WC2:{}{}", c3, w4);
        attr!("WC3:{}{}", w3, c3);
        attr!("WC4:{}{}", w4, c4);
    }
    _ => {}
}
}

If your language’s character types have enough variety to make WC features informative, add it to the match arm. If your type system is low-entropy (like Korean’s SN/SF), it is better to exclude WC features.

Step 6: Prepare Corpus and Train a Model

Prepare a corpus with words separated by spaces:
```
word1 word2 word3 word4
```

Extract features:

litsea extract -l thai ./corpus.txt ./features.txt

Train a model:

litsea train -t 0.005 -i 1000 ./features.txt ./models/thai.model

Step 7: Add Tests

Add tests in both language.rs and segmenter.rs:

#![allow(unused)]
fn main() {
// In language.rs tests
#[test]
fn test_thai_char_types() {
    let lang = Language::Thai;
    assert_eq!(lang.char_type('ก'), "T");   // Thai consonant
    assert_eq!(lang.char_type('A'), "A");   // ASCII
    assert_eq!(lang.char_type('@'), "O");   // Other
}

// In segmenter.rs tests
#[test]
fn test_char_type_thai() {
    let segmenter = Segmenter::new(Language::Thai, None);
    assert_eq!(segmenter.char_type("ก"), "T");
}
}

Run all tests to verify:

cargo test --workspace

Library API Overview

The litsea crate provides a Rust API for word segmentation, model training, and feature extraction.

Installation

[dependencies]
litsea = "0.5.0"

Loading models from local files is synchronous and needs no async runtime. An async runtime such as tokio is only required when loading models over HTTP/HTTPS with the async load_model method.

Module Map

graph LR
    A["litsea::segmenter"] --- B["Segmenter"]
    C["litsea::adaboost"] --- D["AdaBoost"]
    E["litsea::language"] --- F["Language"]
    G["litsea::extractor"] --- H["Extractor"]
    I["litsea::trainer"] --- J["Trainer, PosTrainer"]
    K["litsea::error"] --- L["LitseaError, Result"]
    M["litsea::perceptron"] --- N["AveragedPerceptron"]
    O["litsea::upos"] --- P["Upos, SegmentLabel"]
    Q["litsea::metrics"] --- R["BinaryMetrics, MulticlassMetrics"]

Module	Primary Types	Purpose
`litsea::segmenter`	`Segmenter`	Word segmentation, joint segmentation with POS tagging
`litsea::adaboost`	`AdaBoost`	Binary classification, model I/O
`litsea::perceptron`	`AveragedPerceptron`	Multiclass classification (POS tagging), model I/O
`litsea::upos`	`Upos`, `SegmentLabel`	UPOS POS tags, segment labels
`litsea::language`	`Language`	Language definitions, character classification
`litsea::extractor`	`Extractor`	Feature extraction from corpus
`litsea::trainer`	`Trainer`, `PosTrainer`	Training orchestration
`litsea::error`	`LitseaError`, `Result`	Error type and result alias
`litsea::metrics`	`BinaryMetrics`, `MulticlassMetrics`	Evaluation metrics

All primary types are also re-exported at the crate root, so use litsea::Segmenter; works as a shorthand for use litsea::segmenter::Segmenter;.

Quick Example

use std::path::Path;

use litsea::adaboost::AdaBoost;
use litsea::language::Language;
use litsea::segmenter::Segmenter;

fn main() -> litsea::Result<()> {
    let mut learner = AdaBoost::new(0.01, 100);
    learner.load_model_from_path(Path::new("./models/japanese.model"))?;

    let segmenter = Segmenter::new(Language::Japanese, Some(learner));
    let tokens = segmenter.segment("これはテストです。");

    assert_eq!(tokens, vec!["これ", "は", "テスト", "です", "。"]);
    Ok(())
}

Quick Example (POS Tagging)

use std::path::Path;

use litsea::language::Language;
use litsea::perceptron::AveragedPerceptron;
use litsea::segmenter::Segmenter;

fn main() -> litsea::Result<()> {
    let mut pos_learner = AveragedPerceptron::new();
    pos_learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;

    let segmenter = Segmenter::with_pos_learner(Language::Japanese, pos_learner);
    let tokens = segmenter.segment_with_pos("これはテストです。");

    for (word, pos) in &tokens {
        print!("{}/{} ", word, pos);
    }
    println!();

    Ok(())
}

API Documentation

Full API documentation is available on docs.rs/litsea.

Segmenter

The Segmenter struct is the primary interface for word segmentation.

Definition

#![allow(unused)]
fn main() {
pub struct Segmenter {
    // private: language: Language,
    // private: learner: AdaBoost,
    // private: pos_learner: Option<AveragedPerceptron>,
}
}

The fields are private; use the accessor methods language(), learner(), learner_mut(), pos_learner(), and pos_learner_mut() to reach them.

Constructor

`Segmenter::new`

#![allow(unused)]
fn main() {
pub fn new(language: Language, learner: Option<AdaBoost>) -> Self
}

Creates a new segmenter.

language – The language for character type classification
learner – An optional pre-trained AdaBoost model. If None, a default (untrained) instance is created.

#![allow(unused)]
fn main() {
use litsea::language::Language;
use litsea::segmenter::Segmenter;

// With a pre-trained model
let segmenter = Segmenter::new(Language::Japanese, Some(learner));

// Without a model (for training or feature extraction)
let segmenter = Segmenter::new(Language::Japanese, None);
}

Methods

`segment`

#![allow(unused)]
fn main() {
pub fn segment(&self, sentence: &str) -> Vec<String>
}

Segments a sentence into words. Returns an empty vector for empty input.

#![allow(unused)]
fn main() {
let tokens = segmenter.segment("これはテストです。");
// ["これ", "は", "テスト", "です", "。"]
}

`char_type`

#![allow(unused)]
fn main() {
pub fn char_type(&self, ch: &str) -> &str
}

Classifies a single character into its type code using language-specific rules. The first character of the &str is classified; an empty string returns "O".

#![allow(unused)]
fn main() {
let segmenter = Segmenter::new(Language::Japanese, None);
assert_eq!(segmenter.char_type("あ"), "I");  // Hiragana
assert_eq!(segmenter.char_type("漢"), "H");  // Kanji
assert_eq!(segmenter.char_type("A"), "A");   // ASCII
}

`add_corpus`

#![allow(unused)]
fn main() {
pub fn add_corpus(&mut self, corpus: &str)
}

Processes a space-separated corpus and adds instances to the internal AdaBoost learner.

#![allow(unused)]
fn main() {
let mut segmenter = Segmenter::new(Language::Japanese, None);
segmenter.add_corpus("テスト です");
}

`add_corpus_with_writer`

#![allow(unused)]
fn main() {
pub fn add_corpus_with_writer<F>(&self, corpus: &str, writer: F)
where
    F: FnMut(HashSet<String>, i8),
}

Processes a corpus and calls the callback for each character position with its feature set and label.

#![allow(unused)]
fn main() {
segmenter.add_corpus_with_writer("テスト です", |attrs, label| {
    println!("Features: {:?}, Label: {}", attrs, label);
});
}

Accessors

#![allow(unused)]
fn main() {
pub fn language(&self) -> Language
pub fn learner(&self) -> &AdaBoost
pub fn learner_mut(&mut self) -> &mut AdaBoost
pub fn pos_learner(&self) -> Option<&AveragedPerceptron>
pub fn pos_learner_mut(&mut self) -> Option<&mut AveragedPerceptron>
}

Provide access to the segmenter’s language and its internal learners.

Feature extraction for a character position (38 features for Korean, 42 for Japanese/Chinese) is an internal detail; the former get_attributes method is now private.

Extractor

The Extractor struct extracts features from a corpus file for model training.

Definition

#![allow(unused)]
fn main() {
pub struct Extractor {
    segmenter: Segmenter,
}
}

Constructor

`Extractor::new`

#![allow(unused)]
fn main() {
pub fn new(language: Language) -> Self
}

Creates a new extractor for the specified language. Internally creates a Segmenter without a pre-trained model.

#![allow(unused)]
fn main() {
use litsea::extractor::Extractor;
use litsea::language::Language;

let mut extractor = Extractor::new(Language::Japanese);
}

Methods

`extract`

#![allow(unused)]
fn main() {
pub fn extract(
    &mut self,
    corpus_path: &Path,
    features_path: &Path,
) -> litsea::Result<()>
}

Reads a corpus file (space-separated words, one sentence per line) and writes the extracted features to the output file.

#![allow(unused)]
fn main() {
use std::path::Path;

extractor.extract(
    Path::new("./corpus.txt"),
    Path::new("./features.txt"),
)?;
}

Pipeline

flowchart LR
    A["corpus.txt<br/>(space-separated words)"] --> B["Extractor::extract()"]
    B --> C["features.txt<br/>(label + features per position)"]

The extractor:

Reads each line from the corpus file
Calls Segmenter::add_corpus_with_writer() to process each line
Writes the label and feature set for each character position to the output file

Trainer

The Trainer struct orchestrates the full model training pipeline.

Definition

#![allow(unused)]
fn main() {
pub struct Trainer {
    learner: AdaBoost,
}
}

Constructor

`Trainer::new`

#![allow(unused)]
fn main() {
pub fn new(
    threshold: f64,
    num_iterations: usize,
    features_path: &Path,
) -> litsea::Result<Self>
}

Creates a trainer and initializes it from a features file. This calls AdaBoost::initialize_features() and AdaBoost::initialize_instances().

#![allow(unused)]
fn main() {
use std::path::Path;
use litsea::trainer::Trainer;

let mut trainer = Trainer::new(
    0.005,                           // threshold
    1000,                            // max iterations
    Path::new("./features.txt"),     // features file
)?;
}

Methods

`load_model`

#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}

Loads an existing model for retraining. Supports file paths, file://, and (with the remote_model feature) http:// and https:// URIs.

When called after Trainer::new, the loaded weights are merged into the freshly initialized training data by feature name, so incremental training starts from the existing model without corrupting the feature index.

#![allow(unused)]
fn main() {
trainer.load_model("./models/japanese.model").await?;
}

`train`

#![allow(unused)]
fn main() {
pub fn train(
    &mut self,
    running: Arc<AtomicBool>,
    model_path: &Path,
) -> litsea::Result<BinaryMetrics>
}

Trains the model and saves it to the specified path. Returns evaluation metrics.

The running flag enables graceful interruption – set it to false to stop training early.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::path::Path;

let running = Arc::new(AtomicBool::new(true));
let metrics = trainer.train(running, Path::new("./model.model"))?;

println!("Accuracy: {:.2}%", metrics.accuracy);
}

Full Training Example

use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::path::Path;

use litsea::trainer::Trainer;

#[tokio::main]
async fn main() -> litsea::Result<()> {
    let mut trainer = Trainer::new(
        0.005,
        1000,
        Path::new("./features.txt"),
    )?;

    // Optionally resume from an existing model
    // trainer.load_model("./models/japanese.model").await?;

    let running = Arc::new(AtomicBool::new(true));
    let metrics = trainer.train(running, Path::new("./model.model"))?;

    println!("Accuracy:  {:.2}%", metrics.accuracy);
    println!("Precision: {:.2}%", metrics.precision);
    println!("Recall:    {:.2}%", metrics.recall);

    Ok(())
}

AdaBoost

The AdaBoost struct implements binary classification for word boundary detection.

Definition

#![allow(unused)]
fn main() {
pub struct AdaBoost {
    pub threshold: f64,
    pub num_iterations: usize,
    // internal fields: model weights, features, instances, etc.
}
}

Constructor

`AdaBoost::new`

#![allow(unused)]
fn main() {
pub fn new(threshold: f64, num_iterations: usize) -> Self
}

Creates a new AdaBoost instance with the specified hyperparameters.

#![allow(unused)]
fn main() {
use litsea::adaboost::AdaBoost;

let mut learner = AdaBoost::new(0.01, 100);
}

Model Loading

`load_model_from_path`

#![allow(unused)]
fn main() {
pub fn load_model_from_path(&mut self, path: &Path) -> litsea::Result<()>
}

Loads model weights from a local file, synchronously. This is the preferred method for local files – no async runtime is needed.

#![allow(unused)]
fn main() {
use std::path::Path;

learner.load_model_from_path(Path::new("./models/japanese.model"))?;
}

`load_model_from_reader`

#![allow(unused)]
fn main() {
pub fn load_model_from_reader<R: BufRead>(&mut self, reader: R) -> litsea::Result<()>
}

Loads model weights from any BufRead source, such as an in-memory buffer or an already-open file.

`load_model`

#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}

Loads model weights from a URI. Supports:

Local file path: ./models/japanese.model
File URI: file:///path/to/model
HTTP: http://example.com/model (requires the remote_model feature)
HTTPS: https://example.com/model (requires the remote_model feature)

#![allow(unused)]
fn main() {
learner.load_model("https://example.com/model").await?;
}

`save_model`

#![allow(unused)]
fn main() {
pub fn save_model(&self, filename: &Path) -> litsea::Result<()>
}

Saves model weights to a file. Returns an error if the model is empty.

Training Methods

`initialize_features`

#![allow(unused)]
fn main() {
pub fn initialize_features(&mut self, filename: &Path) -> litsea::Result<()>
}

Reads a features file and builds the feature index. Must be called before initialize_instances.

`initialize_instances`

#![allow(unused)]
fn main() {
pub fn initialize_instances(&mut self, filename: &Path) -> litsea::Result<()>
}

Reads the same features file and initializes labeled instances with their weights.

`train`

#![allow(unused)]
fn main() {
pub fn train(&mut self, running: Arc<AtomicBool>)
}

Runs the AdaBoost training loop. Set running to false to stop early.

`add_instance`

#![allow(unused)]
fn main() {
pub fn add_instance(&mut self, attributes: HashSet<String>, label: i8)
}

Adds a single training instance with its feature set and label.

Prediction

`predict`

#![allow(unused)]
fn main() {
pub fn predict(&self, attributes: &HashSet<String>) -> i8
}

Predicts the label for a given feature set. Returns +1 (boundary) or -1 (non-boundary).

#![allow(unused)]
fn main() {
use std::collections::HashSet;

let mut attrs = HashSet::new();
attrs.insert("UW4:は".to_string());
attrs.insert("UC4:I".to_string());
// ... more features

let label = learner.predict(&attrs);
// label == 1 (boundary) or -1 (non-boundary)
}

`bias`

#![allow(unused)]
fn main() {
pub fn bias(&self) -> f64
}

Returns the bias term: -sum(all model weights) / 2.0.

Evaluation

`metrics`

#![allow(unused)]
fn main() {
pub fn metrics(&self) -> BinaryMetrics
}

Calculates evaluation metrics on the training data.

BinaryMetrics

Defined in litsea::metrics (also re-exported as litsea::BinaryMetrics):

#![allow(unused)]
fn main() {
pub struct BinaryMetrics {
    pub accuracy: f64,          // Accuracy in percentage
    pub precision: f64,         // Precision in percentage
    pub recall: f64,            // Recall in percentage
    pub num_instances: usize,
    pub true_positives: usize,
    pub false_positives: usize,
    pub false_negatives: usize,
    pub true_negatives: usize,
}
}

Averaged Perceptron

The AveragedPerceptron struct implements multiclass classification for joint word segmentation and POS tagging.

Definition

#![allow(unused)]
fn main() {
pub struct AveragedPerceptron {
    // internal fields: weights, accumulated, timestamps, step, classes, instances
}
}

Constructor

`AveragedPerceptron::new`

#![allow(unused)]
fn main() {
pub fn new() -> Self
}

Creates a new empty Averaged Perceptron instance.

#![allow(unused)]
fn main() {
use litsea::perceptron::AveragedPerceptron;

let mut learner = AveragedPerceptron::new();
}

Adding Instances

`add_instance`

#![allow(unused)]
fn main() {
pub fn add_instance(&mut self, features: HashSet<String>, label: String)
}

Adds a training instance with a feature set and a label. Unknown classes are automatically registered.

#![allow(unused)]
fn main() {
use std::collections::HashSet;
use litsea::perceptron::AveragedPerceptron;

let mut learner = AveragedPerceptron::new();
let mut feats = HashSet::new();
feats.insert("UW4:猫".to_string());
feats.insert("UC4:H".to_string());
learner.add_instance(feats, "B-NOUN".to_string());
}

Training

`train`

#![allow(unused)]
fn main() {
pub fn train(&mut self, num_epochs: usize, running: Arc<AtomicBool>)
}

Runs the Averaged Perceptron training loop for the given number of epochs. Set running to false to stop early. Weights are automatically averaged at the end of training.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;

let running = Arc::new(AtomicBool::new(true));
learner.train(10, running);
}

Prediction

`predict`

#![allow(unused)]
fn main() {
pub fn predict(&self, features: &HashSet<String>) -> String
}

Predicts the class label for a given feature set. Computes a score for each class and returns the class name with the highest score. Returns an empty string if no classes are registered.

#![allow(unused)]
fn main() {
use std::collections::HashSet;

let mut attrs = HashSet::new();
attrs.insert("UW4:は".to_string());
attrs.insert("UC4:I".to_string());
// ... more features

let label = learner.predict(&attrs);
// label == "B-ADP", "O", etc.
}

Model I/O

`save_model`

#![allow(unused)]
fn main() {
pub fn save_model(&self, path: &Path) -> litsea::Result<()>
}

Saves model weights to a file. Returns an error if the model is empty.

`load_model_from_path`

#![allow(unused)]
fn main() {
pub fn load_model_from_path(&mut self, path: &Path) -> litsea::Result<()>
}

Loads model weights from a local file, synchronously. This is the preferred method for local files – no async runtime is needed.

#![allow(unused)]
fn main() {
use std::path::Path;

learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;
}

`load_model_from_reader`

#![allow(unused)]
fn main() {
pub fn load_model_from_reader<R: BufRead>(&mut self, reader: R) -> litsea::Result<()>
}

Loads model weights from any BufRead source, such as an in-memory buffer or an already-open file.

`load_model`

#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}

Loads model weights from a URI. Supports the following URI schemes:

Local file path: ./models/japanese_pos.model
File URI: file:///path/to/model
HTTP: http://example.com/model (requires the remote_model feature)
HTTPS: https://example.com/model (requires the remote_model feature)

#![allow(unused)]
fn main() {
learner.load_model("https://example.com/models/japanese_pos.model").await?;
}

Evaluation

`metrics`

#![allow(unused)]
fn main() {
pub fn metrics(&self) -> MulticlassMetrics
}

Calculates evaluation metrics on the training data.

MulticlassMetrics

Defined in litsea::metrics (also re-exported as litsea::MulticlassMetrics):

#![allow(unused)]
fn main() {
pub struct MulticlassMetrics {
    pub accuracy: f64,                            // Overall accuracy in percentage
    pub macro_precision: f64,                     // Macro-averaged precision in percentage
    pub macro_recall: f64,                        // Macro-averaged recall in percentage
    pub num_instances: usize,                     // Number of instances
    pub correct_per_class: HashMap<String, usize>,   // Correct count per class
    pub predicted_per_class: HashMap<String, usize>,  // Predicted count per class
    pub gold_per_class: HashMap<String, usize>,       // Gold label count per class
}
}

UPOS

The upos module defines the Universal POS (UPOS) tagset and segment label types used for POS tagging.

Upos

Definition

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum Upos {
    ADJ,    // Adjective
    ADP,    // Adposition
    ADV,    // Adverb
    AUX,    // Auxiliary
    CCONJ,  // Coordinating conjunction
    DET,    // Determiner
    INTJ,   // Interjection
    NOUN,   // Noun
    NUM,    // Numeral
    PART,   // Particle
    PRON,   // Pronoun
    PROPN,  // Proper noun
    PUNCT,  // Punctuation
    SCONJ,  // Subordinating conjunction
    SYM,    // Symbol
    VERB,   // Verb
    X,      // Other
}
}

Litsea supports all 17 UPOS tags from the Universal Dependencies project:

Tag	Description	Example (Japanese)
`ADJ`	Adjective	いい, 大きい
`ADP`	Adposition	は, が, を, に
`ADV`	Adverb	とても, まだ
`AUX`	Auxiliary	です, ます, た
`CCONJ`	Coordinating conjunction	と, や
`DET`	Determiner	この, その
`INTJ`	Interjection	ああ, はい
`NOUN`	Noun	天気, 本
`NUM`	Numeral	一, 二, 100
`PART`	Particle	ね, よ
`PRON`	Pronoun	これ, それ
`PROPN`	Proper noun	東京, 太郎
`PUNCT`	Punctuation	。, 、
`SCONJ`	Subordinating conjunction	ので, から
`SYM`	Symbol	%, $
`VERB`	Verb	読む, 書く
`X`	Other	(unclassified tokens)

Constant

`Upos::ALL`

#![allow(unused)]
fn main() {
pub const ALL: [Upos; 17]
}

Returns an array of all 17 UPOS tags.

Trait Implementations

Display: Converts to a string such as "NOUN", "VERB", etc.
FromStr: Parses a string into Upos. Returns an error for invalid strings.

#![allow(unused)]
fn main() {
use litsea::upos::Upos;

let pos: Upos = "NOUN".parse().unwrap();
assert_eq!(pos.to_string(), "NOUN");
}

SegmentLabel

Definition

The SegmentLabel type combines word boundary detection with POS tagging. Each character position is assigned one of 18 labels:

B(Upos) (17 labels): Word boundary with the given UPOS tag (e.g., B-NOUN, B-VERB)
O (1 label): Non-boundary (continuation of the current word)

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum SegmentLabel {
    B(Upos),  // Start of a word (boundary). Carries POS information.
    O,        // Continuation of a word (non-boundary).
}
}

#![allow(unused)]
fn main() {
use litsea::upos::SegmentLabel;

// Segment labels for "今日は" (kyou wa)
// 今 → B-NOUN  (start of "今日", tagged as NOUN)
// 日 → O       (continuation of "今日")
// は → B-ADP   (start of "は", tagged as ADP)
}

Methods

`all_labels`

#![allow(unused)]
fn main() {
pub fn all_labels() -> Vec<SegmentLabel>
}

Returns a vector of all 18 segment label strings.

`is_boundary`

#![allow(unused)]
fn main() {
pub fn is_boundary(&self) -> bool
}

Returns whether this is a boundary label (B-*).

`pos`

#![allow(unused)]
fn main() {
pub fn pos(&self) -> Option<Upos>
}

Returns the UPOS tag. Returns None for the non-boundary label (O).

Trait Implementations

Display: Converts to a string such as "B-NOUN", "O", etc.
FromStr: Parses a string into SegmentLabel.

#![allow(unused)]
fn main() {
use litsea::upos::{SegmentLabel, Upos};

let label: SegmentLabel = "B-NOUN".parse().unwrap();
assert!(label.is_boundary());
assert_eq!(label.pos(), Some(Upos::NOUN));

let label_o: SegmentLabel = "O".parse().unwrap();
assert!(!label_o.is_boundary());
assert_eq!(label_o.pos(), None);
}

Language

The Language enum defines language-specific behavior, including character type classification.

Language Enum

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum Language {
    #[default]
    Japanese,
    Chinese,
    Korean,
}
}

Traits

Default – Returns Language::Japanese
Display – Returns lowercase name ("japanese", "chinese", "korean")
FromStr – Parses from full name or ISO 639-1 code (case-insensitive)

Parsing

#![allow(unused)]
fn main() {
use litsea::language::Language;

// Full names
let ja: Language = "japanese".parse().unwrap();
let zh: Language = "chinese".parse().unwrap();
let ko: Language = "korean".parse().unwrap();

// ISO 639-1 codes
let ja: Language = "ja".parse().unwrap();
let zh: Language = "zh".parse().unwrap();
let ko: Language = "ko".parse().unwrap();

// Case-insensitive
let ko: Language = "KOREAN".parse().unwrap();

// Invalid
assert!("french".parse::<Language>().is_err());
}

`char_type`

#![allow(unused)]
fn main() {
pub fn char_type(&self, c: char) -> &'static str
}

Classifies a character into its language-specific type code. Returns "O" (Other) if the character does not belong to any class.

Classification is a direct match on character ranges – allocation-free, O(1), and with no regex involved.

#![allow(unused)]
fn main() {
use litsea::language::Language;

let lang = Language::Japanese;
assert_eq!(lang.char_type('あ'), "I");
assert_eq!(lang.char_type('漢'), "H");
assert_eq!(lang.char_type('@'), "O");
}

Internally, char_type dispatches to a private per-language function (japanese_char_type, chinese_char_type, korean_char_type). The classes common to all languages – "P" (punctuation), "A" (Latin), and "N" (digits) – are handled by a shared helper that is checked after the language-specific classes.

CLI Reference Overview

The litsea CLI provides commands for word segmentation, model training, and text processing.

Usage

litsea <COMMAND> [OPTIONS] [ARGS]

Commands

Command	Description
`extract`	Extract features from a corpus for training
`train`	Train a word segmentation model
`segment`	Segment text into words using a trained model

Global Options

Option	Description
`-h`, `--help`	Show help information
`-V`, `--version`	Show version number

Typical Workflow

AdaBoost Workflow (Word Segmentation Only)

flowchart LR
    A["1. scripts/download_udtreebank.sh"] --> B["2. scripts/corpus_udtreebank.sh"]
    B --> C["3. litsea extract"]
    C --> D["4. litsea train"]
    D --> E["5. litsea segment"]

Download a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
Convert to corpus format: bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
Extract features: litsea extract -l japanese corpus.txt features.txt
Train a model: litsea train -t 0.005 -i 1000 features.txt model.model
Segment text: echo "text" | litsea segment -l japanese model.model

POS Workflow (Word Segmentation with POS Tagging)

flowchart LR
    A["1. scripts/download_udtreebank.sh"] --> B["2. scripts/corpus_udtreebank.sh -p"]
    B --> C["3. litsea extract --pos"]
    C --> D["4. litsea train --pos"]
    D --> E["5. litsea segment --pos"]

Download a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
Convert to POS corpus format: bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
Extract POS features: litsea extract --pos -l japanese pos_corpus.txt features_pos.txt
Train a POS model: litsea train --pos --num-epochs 10 features_pos.txt model_pos.model
Segment with POS tags: echo "text" | litsea segment --pos -l japanese model_pos.model

extract

Extract features from a corpus file for model training.

Usage

litsea extract [OPTIONS] <CORPUS_FILE> <FEATURES_FILE>

Arguments

Argument	Description
`CORPUS_FILE`	Path to the input corpus file (words separated by spaces, one sentence per line)
`FEATURES_FILE`	Path to the output features file

Options

Option	Default	Description
`-l`, `--language <LANGUAGE>`	`japanese`	Language for character type classification. Accepts: `japanese` / `ja`, `chinese` / `zh`, `korean` / `ko`
`--pos`	off	Enable POS (Part-of-Speech) feature extraction mode. Requires a POS corpus as input

Corpus Format

The input corpus must have words separated by spaces, one sentence per line:

Litsea は TinySegmenter を 参考 に 開発 さ れ た 。
Rust で 実装 さ れ た コンパクト な 単語 分割 ソフトウェア です 。

Output Format

The features file contains one line per character position:

1	UW1:B2 UW2:B1 UW3:L UW4:i UW5:t UC1:O UC2:O UC3:A UC4:A ...
-1	UW1:B1 UW2:L UW3:i UW4:t UW5:s UC1:O UC2:A UC3:A UC4:A ...

1 = word boundary
-1 = non-boundary
Features are tab-separated

Examples

# Japanese
litsea extract -l japanese ./corpus.txt ./features.txt

# Chinese
litsea extract -l zh ./corpus_zh.txt ./features_zh.txt

# Korean
litsea extract -l ko ./corpus_ko.txt ./features_ko.txt

Output to stderr on success:

Feature extraction completed successfully.

POS Feature Extraction

When the --pos flag is specified, extract expects a POS corpus instead of a plain word-separated corpus. Each line contains words annotated with UPOS tags in the format word/POS:

POS Corpus Format

これ/PRON は/PART テスト/NOUN です/AUX 。/PUNCT
今日/NOUN は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

POS Feature Output Format

In POS mode, the label column uses segment labels (B-NOUN, B-VERB, …, B-X, O) instead of binary 1/-1:

B-NOUN	UW1:B2 UW2:B1 UW3:こ UW4:れ UW5:は UC1:O UC2:O UC3:I UC4:I ...
O	UW1:B1 UW2:こ UW3:れ UW4:は UW5:テ UC1:O UC2:I UC3:I UC4:I ...

POS Extraction Example

litsea extract --pos -l japanese ./pos_corpus.txt ./pos_features.txt

train

Train a word segmentation model using AdaBoost.

Usage

litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

Arguments

Argument	Description
`FEATURES_FILE`	Path to the input features file (output from `extract`)
`MODEL_FILE`	Path to the output model file

Options

Option	Default	Description
`-t`, `--threshold <THRESHOLD>`	`0.01`	Weak classifier accuracy threshold for early stopping. Lower values allow more iterations
`-i`, `--num-iterations <NUM_ITERATIONS>`	`100`	Maximum number of boosting iterations
`-m`, `--load-model-uri <LOAD_MODEL_URI>`	None	URI of an existing model to resume training from (file path or HTTP/HTTPS URL)
`--pos`	off	Enable POS (Part-of-Speech) training mode using Averaged Perceptron
`-e`, `--num-epochs <NUM_EPOCHS>`	`10`	Number of training epochs (POS mode only)

Output

Training metrics are printed to stderr:

Result Metrics:
  Accuracy: 94.15% ( 564133 / 599198 )
  Precision: 95.57% ( 330454 / 345758 )
  Recall: 94.36% ( 330454 / 350215 )
  Confusion Matrix:
    True Positives: 330454
    False Positives: 15304
    False Negatives: 19761
    True Negatives: 233679

Ctrl+C Handling

Training supports graceful interruption:

First Ctrl+C: Stops training and saves the model at its current state
Second Ctrl+C: Exits immediately without saving

This allows you to stop long-running training sessions without losing progress.

Examples

Basic training:

litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model

Training with higher precision (lower threshold, more iterations):

litsea train -t 0.001 -i 5000 ./features.txt ./model.model

Retraining from an existing model:

litsea train -t 0.005 -i 1000 -m ./models/japanese.model \
    ./new_features.txt ./models/japanese_v2.model

Hyperparameter Tuning

Parameter	Effect of Decreasing	Effect of Increasing
`threshold`	More iterations, potentially higher accuracy, longer training time	Fewer iterations, faster training, may underfit
`num_iterations`	Fewer boosting rounds, smaller model, may underfit	More rounds, larger model, potentially higher accuracy

POS Model Training

When the --pos flag is specified, train uses the Averaged Perceptron algorithm instead of AdaBoost. This trains a multiclass classifier for joint word segmentation and POS tagging.

Usage

litsea train --pos [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

POS Training Options

Option	Default	Description
`--pos`	off	Enable POS training mode
`-e`, `--num-epochs <NUM_EPOCHS>`	`10`	Number of training epochs

Examples

# Train a POS model from POS features
litsea train --pos -e 10 ./pos_features.txt ./models/japanese_pos.model

Output

POS training metrics are printed to stderr (macro-averaged precision and recall):

Result Metrics:
  Accuracy: 98.34%
  Macro Precision: 97.87%
  Macro Recall: 91.67%

Ctrl+C Handling

Same as AdaBoost training, POS training supports graceful interruption. The first Ctrl+C stops training and saves the model at its current state.

POS Hyperparameters

Parameter	Effect of Decreasing	Effect of Increasing
`num_epochs`	Faster training, may underfit	Better accuracy, longer training, may overfit

segment

Segment text into words using a trained model.

Usage

echo "text" | litsea segment [OPTIONS] <MODEL_URI>

Arguments

Argument	Description
`MODEL_URI`	Path or URL to the trained model file. Supports: local file paths, `file://`, `http://`, `https://`

Options

Option	Default	Description
`-l`, `--language <LANGUAGE>`	`japanese`	Language for character type classification. Accepts: `japanese` / `ja`, `chinese` / `zh`, `korean` / `ko`
`--pos`	off	Enable POS-tagged segmentation output. Requires a POS model trained with `train --pos`

Input / Output

Input: Reads from stdin, one sentence per line. Empty lines are skipped.
Output: Writes to stdout, space-separated tokens, one line per input line.

Examples

Japanese:

echo "LitseaはTinySegmenterを参考に開発された。" \
  | litsea segment -l japanese ./models/japanese.model

Litsea は TinySegmenter を 参考 に 開発 さ れ た 。

Chinese:

echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model

Korean:

echo "한국어 단어 분할 테스트입니다." \
  | litsea segment -l korean ./models/korean.model

Processing a file:

cat input.txt | litsea segment -l japanese ./models/japanese.model > output.txt

Loading a model from a URL:

echo "テスト文です。" \
  | litsea segment -l japanese https://example.com/models/japanese.model

POS-Tagged Segmentation (`--pos`)

When the --pos flag is specified, segmentation and POS tagging are performed simultaneously using an Averaged Perceptron model.

Usage

echo "text" | litsea segment --pos [OPTIONS] <MODEL_URI>

Output Format

Each token is output in word/POS format. POS tags conform to the UPOS tag set.

echo "今日はいい天気ですね。" \
  | litsea segment --pos -l japanese ./models/japanese_pos.model

今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

Processing a File

cat input.txt | litsea segment --pos -l japanese ./models/japanese_pos.model > output.txt

Notes

The --language flag must match the language the model was trained for
Model loading is asynchronous and supports HTTP/HTTPS with TLS (rustls)
The model URI is not restricted to file paths – any valid URL is accepted
When using --pos, the model must be a POS model trained with train --pos

Training Guide

This guide walks you through training custom word segmentation and POS tagging models with Litsea.

Both workflows use Universal Dependencies (UD) Treebanks as the data source.

Word Segmentation (AdaBoost)

Prepare a corpus from a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
Extract features from the corpus
Train a model using AdaBoost

POS Tagging (Averaged Perceptron)

Prepare a POS corpus from a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
Extract POS features: litsea extract --pos -l japanese pos_corpus.txt features.txt
Train a POS model: litsea train --pos --num-epochs 10 features.txt model.txt

Additional Topics

Evaluating Models – assess model quality
Retraining Models – fine-tune existing models

Preparing a Corpus

A good training corpus is essential for model accuracy. This guide explains how to prepare one using Universal Dependencies (UD) Treebanks.

Data Source: UD Treebanks

Litsea uses UD Treebanks as the data source for both word segmentation and POS tagging. UD Treebanks provide high-quality, manually annotated data in CoNLL-U format for many languages.

Available Treebanks

Language	Treebank	Repository
Japanese	UD Japanese-GSD	`UD_Japanese-GSD`
Chinese	UD Chinese-GSD	`UD_Chinese-GSD`
Korean	UD Korean-GSD	`UD_Korean-GSD`

Step 1: Download a UD Treebank

Use scripts/download_udtreebank.sh to download a UD Treebank. It prints the path to the training CoNLL-U file to stdout:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)

Supported languages: ja (Japanese, default), ko (Korean), zh (Chinese). Use -o to specify the output directory (default: current directory).

Corpus for Word Segmentation

For word segmentation (AdaBoost), the corpus must be a plain text file with:

One sentence per line
Words separated by spaces

太郎 は 走っ た 。
Litsea は コンパクト な 単語 分割 ソフトウェア です 。

Convert CoNLL-U to Word Segmentation Corpus

Use scripts/corpus_udtreebank.sh to convert a CoNLL-U file to corpus format:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt

This converts the CoNLL-U data into space-separated words (one sentence per line).

Corpus for POS Tagging

For POS tagging (Averaged Perceptron), each word must be annotated with its POS tag.

POS Corpus Format

Each line represents one sentence, with words annotated as word/POS pairs separated by spaces:

これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT
Litsea/PROPN は/ADP 単語/NOUN 分割/NOUN ソフトウェア/NOUN です/AUX 。/PUNCT

The POS tags follow the Universal POS (UPOS) tagset with 17 categories: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.

Convert CoNLL-U to POS Corpus

Use scripts/corpus_udtreebank.sh with the -p flag to produce a POS corpus:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt

Multi-word tokens and empty nodes are automatically handled during conversion.

Automated Corpus Preparation

Litsea includes helper scripts in the scripts/ directory that automate the UD Treebank download and conversion:

scripts/download_udtreebank.sh – Downloads a UD Treebank and prints the path to the training CoNLL-U file
scripts/corpus_udtreebank.sh – Converts a CoNLL-U file to Litsea corpus format

# Download UD Treebank and get CoNLL-U file path
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)

# Generate word segmentation corpus
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt

# Generate POS corpus
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt

Supported languages for download_udtreebank.sh: ja (Japanese, default), ko (Korean), zh (Chinese).

Corpus from Wikipedia Dump

For larger-scale training, you can build a corpus from a full Wikipedia dump using scripts/corpus_wikidump.sh. This extracts plain text with wicket, filters for actual sentences, and tokenizes with lindera.

Usage

# Japanese (default)
bash scripts/corpus_wikidump.sh jawiki-latest-pages-articles.xml.bz2 corpus_ja.txt

# Korean
bash scripts/corpus_wikidump.sh -l ko kowiki-latest-pages-articles.xml.bz2 corpus_ko.txt

# Chinese
bash scripts/corpus_wikidump.sh -l zh zhwiki-latest-pages-articles.xml.bz2 corpus_zh.txt

Options

Option	Description	Default
`-l lang`	Language code: `ja`, `ko`, `zh`	`ja`
`-n max_lines`	Maximum sentence lines to process (0 = unlimited)	`100000`

Sentence Filtering

The script applies two filters to keep only well-formed sentences:

Sentence-ending punctuation – Lines must end with 。, ., !, or ?. This excludes section headers (e.g., “参考文献”), list items, and metadata.
Minimum length – Lines must be at least 20 characters. This excludes short fragments and isolated labels.

Tokenizer Dictionaries

Language	Dictionary	Token Filter
Japanese (`ja`)	`embedded://unidic`	`japanese_compound_word` (numeral compound)
Korean (`ko`)	`embedded://ko-dic`	None
Chinese (`zh`)	`embedded://cc-cedict`	None

Corpus Size Guidelines

The recommended corpus size depends on your use case:

Size (sentence lines)	Use Case
~10,000	Minimum for prototyping and smoke tests
50,000 – 100,000	Practical range for model training
100,000 – 500,000	High-quality, robust models
Unlimited	Use full dump for maximum accuracy

The default max_lines=100000 in corpus_wikidump.sh targets the practical-to-high-quality range.

Corpus Quality Tips

Diversity – Include text from various domains (news, literature, web, etc.)
Size – See Corpus Size Guidelines above for recommended sizes
Consistency – Ensure consistent tokenization throughout the corpus
Deduplication – Remove duplicate sentences to avoid bias
Cleaning – Remove HTML tags, special formatting, and non-text content

Extracting Features

After preparing a corpus, the next step is to extract features for model training.

Command

litsea extract -l <LANGUAGE> <CORPUS_FILE> <FEATURES_FILE>

Example

litsea extract -l japanese ./corpus.txt ./features.txt

Output:

Feature extraction completed successfully.

What Happens Internally

flowchart TD
    A["Read corpus line by line"] --> B["Split line into words"]
    B --> C["Build chars, types, and tags arrays"]
    C --> D["For each character position"]
    D --> E["Extract 38-42 features"]
    E --> F["Write label + features to file"]

The Extractor reads each line from the corpus
For each sentence, it creates a Segmenter context with character arrays, type arrays, and tag arrays
For each character position (except the first), it extracts features and writes them with the correct label

Feature File Format

Each line represents one character position:

1	UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:は ...
-1	UP1:U UP2:U UP3:B BP1:UB BP2:BU UW1:B1 UW2:は UW3:テ ...

First column: label (1 = boundary, -1 = non-boundary)
Remaining columns: features (tab-separated)

POS Feature Extraction

For POS tagging models, use the --pos flag to extract features with POS labels instead of binary boundary labels.

Command

litsea extract --pos -l <LANGUAGE> <CORPUS_FILE> <FEATURES_FILE>

Example

litsea extract --pos -l japanese ./corpus.txt ./features.txt

POS Labels

When extracting POS features, each character position is labeled with one of 18 segment labels instead of the binary 1/-1 labels:

B-NOUN, B-VERB, B-ADJ, B-ADP, B-ADV, B-AUX, B-CCONJ, B-DET, B-INTJ, B-NUM, B-PART, B-PRON, B-PROPN, B-PUNCT, B-SCONJ, B-SYM, B-X – Word boundary with the corresponding POS tag
O – Non-boundary (inside a word)

The feature template (character n-grams, type n-grams, etc.) is the same as for standard segmentation – only the label scheme differs.

POS Feature File Format

B-NOUN	UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:は ...
O	UP1:U UP2:U UP3:B BP1:UB BP2:BU UW1:B1 UW2:は UW3:テ ...
B-VERB	UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:い ...

First column: segment label (e.g., B-NOUN, O)
Remaining columns: features (tab-separated)

File Size Expectations

The features file will be significantly larger than the corpus because each character position generates 38-42 feature strings. For a 1 MB corpus, expect a features file of roughly 50-100 MB.

Training Models

Once features are extracted, train a model using AdaBoost.

Command

litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

Basic Example

litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model

Training Process

flowchart TD
    A["Initialize features<br/>(read feature names)"] --> B["Initialize instances<br/>(read labels + features)"]
    B --> C["AdaBoost training loop"]
    C --> D{"Converged or<br/>max iterations?"}
    D -->|No| C
    D -->|Yes| E["Save model"]
    E --> F["Output metrics"]

Initialize features – Reads the features file to build the feature index
Initialize instances – Reads again to load labeled instances and initial weights
Training loop – Iteratively selects the best feature, updates model weights, and reweights instances
Save model – Writes non-zero feature weights to the model file
Output metrics – Prints accuracy, precision, recall, and confusion matrix

Hyperparameters

Parameter	Flag	Default	Guidance
Threshold	`-t`	0.01	Start with 0.005. Lower values allow more iterations but increase training time
Iterations	`-i`	100	Start with 1000. Increase if accuracy is still improving when training stops

Interpreting Output

Result Metrics:
  Accuracy: 94.15% ( 564133 / 599198 )
  Precision: 95.57% ( 330454 / 345758 )
  Recall: 94.36% ( 330454 / 350215 )
  Confusion Matrix:
    True Positives: 330454
    False Positives: 15304
    False Negatives: 19761
    True Negatives: 233679

Accuracy – Percentage of correct predictions (both boundaries and non-boundaries)
Precision – Of predicted boundaries, what fraction is correct
Recall – Of actual boundaries, what fraction was found
True Positives – Correctly predicted boundaries
False Positives – Predicted boundary where there is none
False Negatives – Missed actual boundaries
True Negatives – Correctly predicted non-boundaries

Graceful Interruption

Press Ctrl+C once during training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.

POS Model Training

For training POS tagging models, use the --pos flag. POS models use the Averaged Perceptron algorithm (multiclass classifier) instead of AdaBoost (binary classifier).

POS Training Command

litsea train --pos --num-epochs 10 <FEATURES_FILE> <MODEL_FILE>

POS Training Example

litsea train --pos --num-epochs 10 ./features.txt ./models/japanese_pos.model

Averaged Perceptron vs AdaBoost

Aspect	AdaBoost (Segmentation)	Averaged Perceptron (POS)
Classification	Binary (boundary / non-boundary)	Multiclass (18 segment labels)
Labels	`1`, `-1`	`B-NOUN`, `B-VERB`, …, `O`
Hyperparameters	Threshold, Iterations	Number of epochs
Model size	~1-22 KB	~11 MB

POS Hyperparameters

Parameter	Flag	Default	Guidance
Epochs	`--num-epochs`	10	Number of passes over the training data. Start with 10 and adjust based on metrics

POS Training Output

Result Metrics:
  Accuracy: 98.34%
  Macro Precision: 97.87%
  Macro Recall: 91.67%

Accuracy – Percentage of correct predictions across all classes
Macro Precision – Average precision across all POS classes
Macro Recall – Average recall across all POS classes

POS Graceful Interruption

Press Ctrl+C once during POS training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.

Evaluating Models

Understanding model quality is essential for producing good segmentation results.

Metrics

The train command outputs three key metrics after training:

Accuracy

Accuracy = (TP + TN) / Total Instances

The percentage of all character positions that were correctly classified (both boundaries and non-boundaries). This is the broadest measure of model quality.

Precision

Precision = TP / (TP + FP)

Of the boundaries the model predicted, what fraction was correct. High precision means few false boundaries (over-segmentation).

Recall

Recall = TP / (TP + FN)

Of the actual boundaries, what fraction did the model find. High recall means few missed boundaries (under-segmentation).

Confusion Matrix

	Predicted Boundary (+1)	Predicted Non-boundary (-1)
Actual Boundary	True Positive (TP)	False Negative (FN)
Actual Non-boundary	False Positive (FP)	True Negative (TN)

Pre-trained Model Benchmarks

Model	Accuracy	Precision	Recall	Training Corpus
japanese.model	94.15%	95.57%	94.36%	UD Japanese-GSD
korean.model	85.08%	–	–	UD Korean-GSD
chinese.model	80.72%	–	–	UD Chinese-GSD

Improving Model Quality

If accuracy is unsatisfactory, consider:

More training data – A larger and more diverse corpus
Lower threshold – Try -t 0.001 to allow more boosting iterations
More iterations – Try -i 5000 or higher
Better corpus quality – Ensure consistent tokenization and clean text
Retraining – Start from an existing model and train with additional data (see Retraining Models)

Retraining Models

You can improve an existing model by resuming training with new data.

Command

litsea train -t 0.005 -i 1000 -m <EXISTING_MODEL> <NEW_FEATURES_FILE> <OUTPUT_MODEL>

Example

# Extract features from new corpus
litsea extract -l japanese ./new_corpus.txt ./new_features.txt

# Retrain from existing model
litsea train -t 0.005 -i 1000 \
    -m ./models/japanese.model \
    ./new_features.txt \
    ./models/japanese_v2.model

How It Works

flowchart LR
    A["Existing model<br/>(weights)"] --> C["Trainer"]
    B["New features"] --> C
    C --> D["Retrained model<br/>(updated weights)"]

The trainer initializes features and instances from the new features file
It loads the existing model weights via -m
Training continues with the loaded weights as a starting point
The new model inherits all learned patterns and refines them with new data

Use Cases

Domain adaptation – Fine-tune a general model on domain-specific text (e.g., medical, legal)
Incremental improvement – Add more training data without retraining from scratch
Error correction – Train on examples where the current model makes mistakes

Notes

The output model can be the same path as the input model (overwrites)
The -m flag accepts file paths, file://, http://, and https:// URIs
Retraining starts from the existing weights, so fewer iterations may be needed

Model File Format

Litsea models are stored as simple plain-text files.

Format Specification

<feature_name>\t<weight>
<feature_name>\t<weight>
...
<bias>

Each line (except the last) contains a feature name and its weight, separated by a tab character
Zero-weight features are omitted to keep the file compact
The last line contains the bias term as a single number

Example

BC1:IK	0.3456
BC2:KI	-0.1234
UW4:は	0.5678
UC4:I	0.2345
...
-0.0891

Bias Reconstruction

When loading a model, the bias is reconstructed using:

bias_bucket_weight = -bias_value * 2 - sum(all_feature_weights)

During prediction:

bias = -sum(all_model_weights) / 2.0
score = bias + sum(model[feature] for feature in input_attributes)

File Size

Model files are very compact:

Model	Size	Features
japanese.model	~2.9 KB	Wikipedia-trained
korean.model	~1.8 KB	Wikipedia-trained
chinese.model	~1.3 KB	Wikipedia-trained
RWCP.model	~22 KB	Original TinySegmenter
JEITA_Genpaku_ChaSen_IPAdic.model	~17 KB	JEITA corpus

The compact size is a key advantage of Litsea – models can be embedded directly in applications or served over HTTP with minimal overhead.

Compatibility

Model files are encoding-agnostic (feature names are stored as-is)
The format is deterministic (features are sorted via BTreeMap)
Models are forward-compatible – new features in the input that are not in the model are simply ignored during prediction

Remote Model Loading

Litsea supports loading models from HTTP/HTTPS URLs in addition to local files.

Supported URI Schemes

Scheme	Example	Description
(none)	`./model.model`	Local file path (default)
`file://`	`file:///path/to/model`	Explicit file URI
`http://`	`http://example.com/model`	HTTP URL
`https://`	`https://example.com/model`	HTTPS URL

CLI Usage

echo "テスト" | litsea segment -l japanese https://example.com/japanese.model

Library Usage

#![allow(unused)]
fn main() {
let mut learner = AdaBoost::new(0.01, 100);

// Local file
learner.load_model_from_path(Path::new("./models/japanese.model"))?; // local, synchronous

// HTTP URL
learner.load_model("https://example.com/models/japanese.model").await?;
}

Implementation Details

HTTP client: reqwest with rustls (no OpenSSL dependency)
Custom User-Agent: Litsea/<version>
The load_model method is async because HTTP loading requires an async runtime
For the CLI, tokio provides the async runtime

WASM Considerations

On wasm32 targets:

Local file paths are not supported – file system access is unavailable
file:// scheme is not supported
HTTP/HTTPS loading works via the browser’s fetch API (through reqwest’s WASM support)

Error messages guide users to use URLs instead of file paths when running in WASM.

Benchmarking

Litsea includes a Criterion benchmark suite for measuring performance.

Running Benchmarks

cargo bench --bench bench

Or via the Makefile:

make bench

Benchmark Suite

The benchmarks are defined in litsea/benches/bench.rs:

Benchmark	Description
`segment_short/adaboost/{ja,zh,ko}`	Segment a short sentence (AdaBoost)
`segment_short/averaged_perceptron/{ja,zh,ko}`	Segment + POS tag a short sentence
`segment_long_japanese/{adaboost,averaged_perceptron}`	Process the full Bocchan novel (~300 KB)
`get_type_hiragana`	Character type classification
`add_corpus`	Corpus ingestion for training
`predict_adaboost`	Single AdaBoost prediction

Models are loaded synchronously with load_model_from_path — no async runtime is involved in the benchmarks.

HTML Reports

Criterion generates detailed HTML reports with statistics and comparison graphs at:

target/criterion/report/index.html

Open this file in a browser after running benchmarks to view:

Iteration times with confidence intervals
Throughput measurements
Comparison with previous runs (automatic regression detection)

Interpreting Results

Key performance factors:

Segmentation is linear in input length (O(n))
Character classification is a direct match on character ranges (a few nanoseconds; no setup cost)
Prediction at each position depends on the number of features (38-42, constant)
Model loading time is proportional to the model file size

Pre-trained Models

Litsea ships with several pre-trained models in the models/ directory.

Model Catalog

japanese.model

Property	Value
Language	Japanese
Training Corpus	UD Japanese-GSD
Accuracy	94.15%
Precision	95.57%
Recall	94.36%
File Size	~2.9 KB

korean.model

Property	Value
Language	Korean
Training Corpus	UD Korean-GSD
Accuracy	85.08%
File Size	~1.8 KB

chinese.model

Property	Value
Language	Chinese (Simplified & Traditional)
Training Corpus	UD Chinese-GSD
Accuracy	80.72%
File Size	~1.3 KB

RWCP.model

Property	Value
Language	Japanese
Source	Extracted from the original TinySegmenter
License	BSD 3-Clause (Taku Kudo)
File Size	~22 KB

JEITA_Genpaku_ChaSen_IPAdic.model

Property	Value
Language	Japanese
Training Corpus	JEITA Project Sugita Genpaku corpus
Tokenizer	ChaSen with IPAdic
File Size	~17 KB

POS Tagging Models

japanese_pos.model

Property	Value
Language	Japanese
Algorithm	Averaged Perceptron
Training Corpus	UD Japanese-GSD (7,050 sentences)
Epochs	10
Accuracy	98.34%
Macro Precision	97.87%
Macro Recall	91.67%
File Size	~11 MB

chinese_pos.model

Property	Value
Language	Chinese (Simplified & Traditional)
Algorithm	Averaged Perceptron
Training Corpus	UD Chinese-GSD (3,997 sentences)
Epochs	10
Accuracy	97.09%
Macro Precision	97.31%
Macro Recall	96.23%
File Size	~19 MB

korean_pos.model

Property	Value
Language	Korean
Algorithm	Averaged Perceptron
Training Corpus	UD Korean-GSD (4,400 sentences)
Epochs	10
Accuracy	95.33%
Macro Precision	95.30%
Macro Recall	87.69%
File Size	~8.4 MB

Usage

echo "これはテストです。" | litsea segment --pos -l japanese models/japanese_pos.model

Output:

これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT

Choosing a Model

For Japanese, use japanese.model for the best accuracy, or RWCP.model for compatibility with the original TinySegmenter
For Chinese, use chinese.model
For Korean, use korean.model
For POS tagging, use the corresponding *_pos.model (japanese_pos.model, chinese_pos.model, korean_pos.model) for joint word segmentation and POS tagging
For domain-specific needs, consider training your own model or retraining an existing one

Sample Data

The resources/ directory also contains sample data:

bocchan.txt – Sample Japanese corpus from the novel “Botchan” by Natsume Soseki (~307 KB). Used for benchmarking.

License

Litsea is distributed under a dual license.

MIT License

The main Litsea codebase is licensed under the MIT License:

MIT License

Copyright (c) 2025 Minoru OSUKA
Copyright (c) 2022 ICHINOSE Shogo

BSD 3-Clause License

Code originally developed by Taku Kudo (TinySegmenter) is licensed under the BSD 3-Clause License:

Copyright (c) 2008, Taku Kudo
All rights reserved.

Full License Text

The complete license text is available in the LICENSE file in the repository.

Keyboard shortcuts

Litsea Documentation