Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Litsea is an extremely compact word segmentation library implemented in Rust, inspired by TinySegmenter and TinySegmenterMaker.

Unlike traditional morphological analyzers such as MeCab and Lindera, Litsea does not rely on large-scale dictionaries. Instead, it performs word segmentation using a compact pre-trained model based on the AdaBoost binary classification algorithm. Litsea also supports joint word segmentation and POS (Part-of-Speech) tagging using the Averaged Perceptron multiclass classifier with the Universal POS (UPOS) tagset.

Key Features

  • Fast and safe Rust implementation – built with Rust’s safety guarantees and performance
  • Compact pre-trained models – model files are only a few kilobytes in size
  • No dictionary dependency – segmentation is driven entirely by a statistical model
  • POS tagging – joint segmentation and Part-of-Speech tagging with UPOS tags via Averaged Perceptron
  • Multilingual support – Japanese, Chinese (Simplified/Traditional), and Korean
  • Model training capabilities – train custom models using AdaBoost or Averaged Perceptron with your own corpora
  • Remote model loading – load models from HTTP/HTTPS URLs or local files
  • Simple and extensible API – easy to integrate into Rust projects as a library

How It Works

Litsea treats word segmentation as a binary classification problem: for each character position in a sentence, the model predicts whether it is a word boundary (+1) or not a boundary (-1). The classifier uses character n-gram features and character type information specific to each language.

Input:  "LitseaはRust製です"
         ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
         O O O O B O B O B   ← boundary predictions
Output: ["Litsea", "は", "Rust製", "です"]

POS Tagging

Litsea also supports POS (Part-of-Speech) tagging in addition to word segmentation. Using the Averaged Perceptron multiclass classifier, it performs joint segmentation and POS tagging simultaneously.

For each character position, the model predicts one of 18 SegmentLabel classes:

  • B-NOUN, B-VERB, …, B-X (boundary labels for 17 POS tags)
  • O (non-boundary = continuation of the current word)

The POS tags follow the Universal Dependencies UPOS tagset (17 POS tags).

Input:  "今日はいい天気ですね。"
Output: 今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

Name Origin

There is a small plant called Litsea cubeba (Aomoji) in the same Lauraceae family as Lindera (Kuromoji). This is the origin of the name Litsea.

Current Version

Litsea v0.5.0 – Rust Edition 2024, minimum Rust version 1.87.

Getting Started

Welcome to Litsea! This section will help you get up and running quickly.

Litsea is a compact word segmentation library in Rust that supports both word segmentation (AdaBoost) and joint segmentation with POS tagging (Averaged Perceptron).

Next Steps

Installation

Prerequisites

  • Rust 1.87 or later (stable channel) from rust-lang.org
  • Cargo (Rust’s package manager, included with Rust)

Installing the CLI Tool

From crates.io

cargo install litsea-cli

From Source

git clone https://github.com/mosuka/litsea.git
cd litsea
cargo build --release

The binary will be available at ./target/release/litsea.

Verify the installation:

./target/release/litsea --help

Using as a Library

Add Litsea to your project’s Cargo.toml:

[dependencies]
litsea = "0.5.0"

Note: Loading models from local files (load_model_from_path) is synchronous, so no async runtime is needed. An async runtime such as tokio is only required if you load models over HTTP/HTTPS with the async load_model method (enabled by the remote_model feature, which is on by default).

Supported Platforms

Litsea is tested on the following platforms:

OSArchitecture
Linuxx86_64, aarch64
macOSx86_64 (Intel), aarch64 (Apple Silicon)
Windowsx86_64, aarch64

Quick Start

CLI Quick Start

Segmenting Text

Litsea ships with pre-trained models in the models/ directory. Pipe text into the segment command:

Japanese:

echo "LitseaはTinySegmenterを参考に開発された、Rustで実装された極めてコンパクトな単語分割ソフトウェアです。" \
  | litsea segment -l japanese ./models/japanese.model

Output:

Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。

Chinese:

echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model

Korean:

echo "한국어 단어 분할 테스트입니다." | litsea segment -l korean ./models/korean.model

POS Tagging

Litsea can perform joint word segmentation and POS tagging using a POS model. Add the --pos flag to the segment command:

echo "今日はいい天気ですね。" \
  | litsea segment --pos -l japanese ./models/japanese_pos.model

Output:

今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

Each token is annotated with a Universal POS (UPOS) tag.

Library Quick Start

Here is a minimal Rust program that loads a model and segments text:

use std::path::Path;

use litsea::adaboost::AdaBoost;
use litsea::language::Language;
use litsea::segmenter::Segmenter;

fn main() -> litsea::Result<()> {
    // Load the pre-trained model
    let mut learner = AdaBoost::new(0.01, 100);
    learner.load_model_from_path(Path::new("./models/japanese.model"))?;

    // Create a segmenter
    let segmenter = Segmenter::new(Language::Japanese, Some(learner));

    // Segment text
    let tokens = segmenter.segment("これはテストです。");
    println!("{}", tokens.join(" "));
    // Output: これ は テスト です 。

    Ok(())
}

POS Tagging with the Library

Here is a minimal Rust program that loads a POS model and segments text with POS tags:

use std::path::Path;

use litsea::language::Language;
use litsea::perceptron::AveragedPerceptron;
use litsea::segmenter::Segmenter;

fn main() -> litsea::Result<()> {
    // Load the pre-trained POS model
    let mut pos_learner = AveragedPerceptron::new();
    pos_learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;

    // Create a segmenter with POS support
    let segmenter = Segmenter::with_pos_learner(Language::Japanese, pos_learner);

    // Segment text with POS tags
    let tokens = segmenter.segment_with_pos("今日はいい天気ですね。");
    for (word, pos) in &tokens {
        print!("{}/{} ", word, pos);
    }
    // Output: 今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

    Ok(())
}

What’s Next

Architecture Overview

Litsea is designed as a compact, dictionary-free word segmentation system. It treats word segmentation as a binary classification problem and uses AdaBoost to learn word boundary patterns from character-level features.

High-Level Data Flow

Litsea has two main workflows: training and segmentation.

Training Pipeline

flowchart LR
    A["Corpus (text)"] --> B["Extractor"]
    B --> C["Features File (.txt)"]
    C --> D["Trainer (AdaBoost)"]
    D --> E["Model File (.model)"]
  1. Corpus preparation – Prepare text with words separated by spaces
  2. Feature extraction – The Extractor reads the corpus, classifies characters by type, and outputs labeled feature vectors
  3. Model training – The Trainer feeds features into AdaBoost, which iteratively selects the most informative features and produces a compact model

Segmentation Pipeline

flowchart LR
    F["Raw text"] --> G["Segmenter (AdaBoost)"]
    H["Model file"] --> G
    G --> I["Segmented words"]
  1. Model loading – Load a pre-trained model (from file or URL)
  2. Character classification – For each character in the input, determine its type code based on language-specific patterns
  3. Feature extraction – Build a feature set for each character position using a sliding window
  4. Prediction – AdaBoost predicts whether each position is a word boundary

Design Principles

  • No dictionary dependency – Unlike MeCab or Lindera, Litsea relies solely on a statistical model learned from character patterns
  • Compact models – Model files are typically 1-22 KB, containing only the feature weights that matter
  • Language-agnostic framework – The core algorithm is the same for all languages; only the character type patterns differ
  • Simple extensibility – Adding a new language requires only defining character type patterns and training a model

Workspace Structure

Litsea is organized as a Cargo workspace with two crates and supporting directories.

Directory Layout

litsea/
├── Cargo.toml              # Workspace manifest
├── Cargo.lock              # Dependency lock file
├── Makefile                # Build convenience targets
├── rustfmt.toml            # Rust formatting configuration
├── LICENSE                 # MIT
├── README.md               # Project overview
├── litsea/                 # Core library crate
│   ├── Cargo.toml
│   ├── src/
│   │   ├── lib.rs          # Module declarations and version
│   │   ├── adaboost.rs     # AdaBoost algorithm
│   │   ├── segmenter.rs    # Word segmentation
│   │   ├── extractor.rs    # Feature extraction from corpus
│   │   ├── trainer.rs      # Training orchestration
│   │   ├── language.rs     # Language definitions and char patterns
│   │   └── util.rs         # URI scheme utilities
│   └── benches/
│       └── bench.rs        # Criterion benchmarks
├── litsea-cli/             # CLI binary crate
│   ├── Cargo.toml
│   └── src/
│       └── main.rs         # CLI entry point
├── models/                 # Pre-trained models
│   ├── japanese.model
│   ├── chinese.model
│   ├── korean.model
│   ├── RWCP.model
│   └── JEITA_Genpaku_ChaSen_IPAdic.model
├── resources/              # Sample data and test fixtures
│   └── bocchan.txt         # Sample corpus
├── scripts/                # Corpus preparation utilities
│   ├── download_udtreebank.sh      # Download UD Treebanks (prints CoNLL-U file path)
│   ├── corpus_udtreebank.sh           # Convert CoNLL-U to Litsea corpus format
│   └── wikitexts.sh        # Download and prepare Wikipedia text data
├── docs/                   # mdbook documentation (this book)
└── .github/
    └── workflows/          # CI/CD pipelines
        ├── regression.yml  # Test on push/PR
        ├── release.yml     # Release builds and publishing
        └── periodic.yml    # Weekly stability tests

Crate Details

litsea (Core Library)

The core library provides all segmentation, training, and model I/O functionality.

DependencyVersionPurpose
thiserror2.0Error type derivation
reqwest0.13HTTP/HTTPS model loading (rustls)
tokio1.49Async runtime for remote model loading
criterion0.8Benchmarking (dev dependency)
tempfile3.25Temporary files for tests (dev dependency)

litsea-cli (CLI Binary)

The CLI provides a command-line interface to Litsea’s functionality.

DependencyVersionPurpose
clap4.5Command-line argument parsing
ctrlc3.5Graceful Ctrl+C handling during training
tokio1.49Async runtime
litsea0.4Core library (workspace member)

Workspace Configuration

The workspace uses Cargo resolver version 3 (Rust Edition 2024):

[workspace]
resolver = "3"
members = ["litsea", "litsea-cli"]

[workspace.package]
version = "0.4.0"
edition = "2024"
rust-version = "1.87"

Shared dependencies are defined at the workspace level in [workspace.dependencies] and referenced by each crate with { workspace = true }.

Module Design

The litsea library crate is organized into focused modules, each with a clear responsibility.

Module Dependency Graph

graph TD
    language["language.rs<br/>Character classification"]
    segmenter["segmenter.rs<br/>Segmentation + POS tagging"]
    adaboost["adaboost.rs<br/>AdaBoost (boundaries)"]
    perceptron["perceptron.rs<br/>Averaged Perceptron (POS)"]
    upos["upos.rs<br/>UPOS tags and labels"]
    extractor["extractor.rs<br/>Feature extraction"]
    trainer["trainer.rs<br/>Training orchestration"]
    model_io["model_io.rs (private)<br/>Model URI loading"]
    error["error.rs<br/>LitseaError / Result"]
    metrics["metrics.rs<br/>Evaluation metrics"]

    language --> segmenter
    upos --> segmenter
    adaboost --> segmenter
    perceptron --> segmenter
    segmenter --> extractor
    adaboost --> trainer
    perceptron --> trainer
    model_io --> adaboost
    model_io --> perceptron
    error --> adaboost
    error --> perceptron
    metrics --> trainer

Module Details

language.rs – Language Definitions

Defines the Language enum and character type classification.

  • Language – Enum with variants Japanese, Chinese, Korean
    • Implements FromStr (parses "japanese", "ja", "chinese", "zh", "korean", "ko")
    • Implements Display (outputs lowercase name)
    • char_type(c: char) -> &'static str – Classifies a character with a direct match on character ranges (allocation-free; no regex). Language-specific functions (japanese_char_type, etc.) share a punct_latin_digit() helper for the common "P"/"A"/"N" classes.

segmenter.rs – Word Segmentation and POS Tagging

The main user-facing module.

  • Segmenter – Holds a Language, an AdaBoost learner, and an optional AveragedPerceptron POS learner (fields are private; use language(), learner(), learner_mut(), pos_learner(), pos_learner_mut())
    • new(language, learner) – Create a segmenter with an optional pre-trained model
    • with_pos_learner(language, pos_learner) – Create a segmenter for joint segmentation + POS tagging
    • segment(sentence) – Segment text into words, returns Vec<String>
    • segment_with_pos(sentence) – Segment and tag, returns Vec<(String, Upos)>
    • char_type(ch) – Classify a single character into its type code
    • add_corpus(corpus) / add_corpus_with_pos(corpus) – Add training data
    • add_corpus_with_writer(corpus, callback) / add_corpus_with_pos_writer(corpus, callback) – Process a corpus with a custom callback

adaboost.rs – AdaBoost Algorithm

The binary classifier used for word boundary decisions.

  • AdaBoost
    • new(threshold, num_iterations) – Create with training parameters
    • initialize_features(path) / initialize_instances(path) – Load training data
    • train(running) – Run the AdaBoost training loop
    • predict(&attributes) – Predict boundary (+1) or non-boundary (-1)
    • load_model(uri) (async) / load_model_from_path(path) / load_model_from_reader(reader) – Load model weights
    • save_model(path) – Save model weights to a file
    • metrics() – Calculate accuracy, precision, and recall (BinaryMetrics)
    • bias() – Get the model’s bias term

perceptron.rs – Averaged Perceptron

The multiclass classifier used for joint segmentation + POS tagging.

  • AveragedPerceptron
    • add_instance(features, label) – Add a training instance
    • train(num_epochs, running) – Train with weight averaging
    • predict(&features) – Predict the best class label
    • load_model(uri) (async) / load_model_from_path(path) / load_model_from_reader(reader) – Load model weights
    • save_model(path) – Save model weights
    • metrics() – Macro-averaged evaluation (MulticlassMetrics)
  • Weights are stored in a feature → per-class vector layout for fast inference.

upos.rs – Universal POS Tags

  • Upos – The 17 Universal Dependencies POS tags (NOUN, VERB, …)
  • SegmentLabel – Combined segmentation + POS label per character position (B(Upos) or O), with Display/FromStr for the "B-NOUN" / "O" string form

extractor.rs – Feature Extraction

Extracts features from a corpus for model training.

  • Extractor – Wraps a Segmenter to process corpus files
    • new(language) – Create an extractor for a specific language
    • extract(corpus_path, features_path) – Read a corpus, write a features file
    • extract_with_pos(corpus_path, features_path) – Same for POS-tagged corpora

trainer.rs – Training Orchestration

High-level training workflows.

  • Trainer – Segmentation model training (AdaBoost)
    • new(threshold, num_iterations, features_path) – Initialize from a features file
    • load_model(uri) – Optionally load an existing model for incremental training (async)
    • train(running, model_path) – Train and save, returns BinaryMetrics
  • PosTrainer – POS model training (Averaged Perceptron)
    • new(num_epochs, features_path) / load_model(uri) / train(running, model_path) returning MulticlassMetrics

error.rs – Error Handling

  • LitseaError – Error enum (Io, InvalidData, InvalidInput, Unsupported, and Download with the remote_model feature)
  • Result<T> – Alias used by every fallible API

metrics.rs – Evaluation Metrics

  • BinaryMetrics – Accuracy, precision, recall, confusion matrix (AdaBoost)
  • MulticlassMetrics – Accuracy and macro-averaged precision/recall (Averaged Perceptron)

model_io.rs – Model Loading I/O (private)

Internal module that resolves a model URI (plain path, file://, or http(s):// with the remote_model feature) and returns the raw model bytes. Not part of the public API.

Public Exports

The library’s lib.rs exposes the public modules and re-exports the main types:

#![allow(unused)]
fn main() {
pub mod adaboost;
pub mod error;
pub mod extractor;
pub mod language;
pub mod metrics;
mod model_io;
pub mod perceptron;
pub mod segmenter;
pub mod trainer;
pub mod upos;

pub use adaboost::AdaBoost;
pub use error::{LitseaError, Result};
pub use extractor::Extractor;
pub use language::Language;
pub use metrics::{BinaryMetrics, MulticlassMetrics};
pub use perceptron::AveragedPerceptron;
pub use segmenter::Segmenter;
pub use trainer::{PosTrainer, Trainer};
pub use upos::{SegmentLabel, Upos};

pub fn version() -> &'static str { ... }
}

AdaBoost Binary Classification

Litsea uses the AdaBoost (Adaptive Boosting) algorithm for binary classification to determine word boundaries. This chapter explains the algorithm as implemented in Litsea.

Overview

AdaBoost combines many weak learners (simple classifiers) into a strong ensemble classifier. In Litsea:

  • Positive label (+1) = word boundary
  • Negative label (-1) = non-boundary (continuation of the current word)
  • Weak learners = individual features (each feature is a binary “stump” – present or absent)

Training Algorithm

The training loop in AdaBoost::train() works as follows:

Initialization

  1. Load features and instances from the training file
  2. Initialize instance weights uniformly (later adjusted based on initial score)
  3. All model weights start at zero

Iterative Boosting

For each iteration t (up to num_iterations):

Step 1: Calculate weighted errors

For each feature h, compute its weighted error over all instances:

error[h] -= D[i] * y[i]   (for each instance i that has feature h)

where D[i] is the instance weight and y[i] is the true label.

Step 2: Select the best weak learner

Find the feature with the lowest weighted error rate:

error_rate(h) = (error[h] + positive_weight_sum) / instance_weight_sum
h_best = argmax_h |0.5 - error_rate(h)|

The baseline competitor is the “all-negative” classifier (always predicts -1), whose error rate equals the fraction of positive instances. Any real feature must beat this baseline.

Step 3: Check convergence

If |0.5 - best_error_rate| < threshold, stop early – no feature can significantly improve the model.

Step 4: Compute the weak learner weight

alpha = 0.5 * ln((1 - error_rate) / error_rate)
model[h_best] += alpha

A lower error rate produces a higher alpha, giving more influence to better features.

Step 5: Update instance weights

For each instance i:
    prediction = +1 if h_best in features(i), else -1

    if y[i] * prediction < 0:  (misclassified)
        D[i] *= exp(alpha)     (increase weight)
    else:                       (correctly classified)
        D[i] /= exp(alpha)     (decrease weight)

Normalize: D[i] /= sum(D)

This ensures subsequent iterations focus on the instances that are still difficult to classify.

Prediction

Given an input set of features (attributes), the prediction is:

score = bias + sum(model[feature] for each feature in attributes)
prediction = +1 if score >= 0, else -1

Bias Term

The bias is computed as:

bias = -sum(all model weights) / 2.0

This centers the decision boundary. The empty-string feature ("") serves as the bias bucket during training.

Model File Format

The trained model is saved as a simple text file:

feature1\tweight1
feature2\tweight2
...
bias_value
  • Each line contains a feature name and its weight (tab-separated)
  • Zero-weight features are omitted
  • The last line contains the bias term (a single number)

See Model File Format for details.

Hyperparameters

ParameterDefaultDescription
threshold0.01Early stopping threshold. Lower values allow more iterations, potentially improving accuracy
num_iterations100Maximum number of boosting rounds. Higher values may improve accuracy at the cost of training time and model size

Averaged Perceptron

Litsea uses the Averaged Perceptron algorithm for multiclass classification to perform joint word segmentation and POS tagging. This chapter explains the algorithm as implemented in Litsea.

Overview

While AdaBoost performs binary classification (boundary vs. non-boundary), the Averaged Perceptron performs multiclass classification – predicting one of 18 segment labels for each character position:

  • 17 boundary labels: B-ADJ, B-ADP, B-ADV, B-AUX, B-CCONJ, B-DET, B-INTJ, B-NOUN, B-NUM, B-PART, B-PRON, B-PROPN, B-PUNCT, B-SCONJ, B-SYM, B-VERB, B-X
  • 1 non-boundary label: O (continuation of the current word)

These labels correspond to the 17 Universal POS (UPOS) tags from the Universal Dependencies project, prefixed with B- to indicate a word boundary. This enables simultaneous word boundary detection and POS estimation in a single classification step.

Algorithm

Weight Representation

The perceptron maintains a weight vector per class. Weights are stored as a sparse map:

weights: HashMap<Feature, HashMap<Class, f64>>

For example:

weights["UW4:猫"]["B-NOUN"] = 2.5
weights["UC4:H"]["B-NOUN"]  = 1.8
weights["UW4:猫"]["O"]      = -0.3
...

For a given feature set, the score for each class is the sum of its feature weights:

score(class) = sum(weights[feature][class] for each feature in input)
prediction = argmax(score(class) for all classes)

Update Rule

When the perceptron makes a misclassification:

For each training instance (features, truth):
    guess = predict(features)

    if guess != truth:
        For each feature f in features:
            weights[f][truth] += 1.0   # increase weight for correct class
            weights[f][guess] -= 1.0   # decrease weight for predicted class

This increases the weights for the correct class and decreases them for the incorrectly predicted class, making the correct prediction more likely for similar inputs in the future.

Averaging

A key improvement over the basic perceptron is weight averaging. Rather than using the final weights (which can be unstable and tend to overfit to the tail of the training data), the model averages all weight vectors seen during training. This improves generalization to unseen data.

The implementation uses a cumulative sum approach for efficiency:

cumulative[feature][class] += weights[feature][class] * elapsed_steps

At the end of training:
    averaged[feature][class] = cumulative[feature][class] / total_steps

This avoids storing all intermediate weight vectors while producing the same result. The averaging reduces dependence on the order of training data and improves generalization performance.

Training with Epochs

Training iterates over the data multiple times (epochs). Each epoch processes all training instances in order:

For each epoch (1 to num_epochs):
    For each instance in training data:
        features = extract_features(instance)
        predicted = argmax(score(class) for all classes)
        if predicted != correct_label:
            update weights
        accumulate weights for averaging

Training supports graceful interruption via AtomicBool – a Ctrl+C signal stops training and saves the model at its current state.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use litsea::perceptron::AveragedPerceptron;

let mut perceptron = AveragedPerceptron::new();
// ... add instances ...
let running = Arc::new(AtomicBool::new(true));
perceptron.train(10, running);  // 10 epochs
}

Model File Format

The Averaged Perceptron model is saved as a text file with the following structure:

18
O
B-ADJ
B-ADP
...
B-X
feature1\tclass1\tweight1
feature2\tclass2\tweight2
...
  • Line 1: Number of classes (18)
  • Lines 2 to N+1: Class names, one per line
  • Remaining lines: Feature weights, tab-separated as feature\tclass\tweight
  • Zero-weight entries are omitted

Comparison with AdaBoost

AspectAdaBoostAveraged Perceptron
ClassificationBinary (+1/-1)Multiclass (18 classes)
OutputWord boundaries onlyWord boundaries + POS tags
Weak learnerDecision stumps per featureNone (linear classifier)
Weight managementOne weight per featureClass x feature weight matrix
GeneralizationEnsembleWeight averaging
TrainingIterative boosting with sample reweightingOnline learning with weight averaging
Model sizeA few KB~11 MB (with POS features)
Hyperparametersthreshold, num_iterationsnum_epochs

Hyperparameters

ParameterDefaultDescription
num_epochs10Number of training passes over the data. More epochs can improve accuracy but may overfit

Feature Extraction

Litsea uses character n-gram features to capture the local context around each potential word boundary. This chapter catalogs all feature types.

Feature Categories

For each character position i in the input, the segmenter extracts features from a sliding window of characters, their type codes, and previous boundary decisions.

Base Features (38 features)

CategoryIDsDescriptionWindow
UW (Unary Word)UW1–UW6Individual characters at positions i-3 to i+26
BW (Bigram Word)BW1–BW3Adjacent character pairs3
UC (Unary Char-type)UC1–UC6Character type codes at positions i-3 to i+26
BC (Bigram Char-type)BC1–BC3Adjacent type code pairs3
TC (Trigram Char-type)TC1–TC4Type code triples4
UP (Unary Previous-tag)UP1–UP3Previous 3 boundary decisions3
BP (Bigram Previous-tag)BP1–BP2Boundary decision pairs2
UQ (Unary tag+type)UQ1–UQ3Combined boundary decision + type code3
BQ (Bigram tag+type)BQ1–BQ4Combined decision + type code bigrams4
TQ (Trigram tag+type)TQ1–TQ4Combined decision + type code trigrams4

Language-Specific Features (4 features, Japanese and Chinese only)

CategoryIDsDescriptionCount
WC (Word+Char-type)WC1–WC4Character + type code mixed features4
  • WC1: character at i-1 + type code at i
  • WC2: type code at i-1 + character at i
  • WC3: character at i-1 + type code at i-1
  • WC4: character at i + type code at i

Why no WC for Korean? Korean Hangul syllables are classified into only two types (SN and SF), so WC features would add noise rather than useful signal.

Total Feature Count

LanguageBaseWCTotal
Japanese38442
Chinese38442
Korean38038

Feature Format

Each feature is represented as a string in the format PREFIX:VALUE:

UW4:は        ← The character at position i is "は"
UC4:I         ← The type code at position i is "I" (Hiragana)
BW2:はテ      ← The bigram at position i-1..i is "はテ"
BC2:IK        ← The type bigram is Hiragana + Katakana
UP3:B         ← The previous boundary decision was "B" (boundary)
WC1:はK       ← Character "は" combined with type "K"

Sliding Window Layout

The segmenter pads the input with sentinel characters:

Index:   0    1    2    3    4    5    ...  n+2  n+3  n+4  n+5
Chars:   B3   B2   B1   c1   c2   c3  ...  cn   E1   E2   E3
Types:   O    O    O    t1   t2   t3  ...  tn   O    O    O
Tags:    U    U    U    U    ?    ?   ...  ?
  • B3, B2, B1 – Begin sentinels (padding)
  • E1, E2, E3 – End sentinels (padding)
  • O – “Other” type for padding positions
  • U – “Unknown” tag for initial positions
  • B – “Boundary” tag (word start)
  • O – “Other” tag (continuation)

Features are extracted for positions 4 through len-3, where the full window of i-3 to i+2 is available.

Training Data Format

The extract command writes features to a file in this format:

1	UW1:B2 UW2:B1 UW3:L UW4:i UW5:t UC1:O UC2:O UC3:A UC4:A ...
-1	UW1:B1 UW2:L UW3:i UW4:t UW5:s UC1:O UC2:A UC3:A UC4:A ...

Each line contains:

  1. A label (1 for boundary, -1 for non-boundary)
  2. Tab-separated feature strings

Character Type Classification

Each language in Litsea defines a set of character type patterns that classify individual characters into linguistically meaningful categories. These type codes are used as features for the AdaBoost classifier.

How It Works

Language::char_type(c: char) -> &'static str classifies a character with a direct match expression on Unicode character ranges — no regex, no allocation. Match arms are tried top to bottom, so the first matching arm determines the type code. If no arm matches, the character is classified as "O" (Other).

Each language has its own classification function (japanese_char_type, chinese_char_type, korean_char_type); the classes shared by all languages — "P" (punctuation), "A" (Latin), "N" (digits) — live in a common punct_latin_digit() helper that is checked after the language-specific classes. Logic beyond plain ranges is expressed with match guards (e.g., Korean Hangul syllable structure).

Japanese Character Types

CodeNamePattern / RangeExamples
MKanji Numbers[一二三四五六七八九十百千万億兆]一, 千, 億
HKanji / CJK Ideographs[一-龠々〆ヵヶ]漢, 字, 学
IHiragana[ぁ-ん]あ, い, う
KKatakana[ァ-ヴーア-ン゙゚]ア, カ, ー
PPunctuationCJK Symbols (U+3000-303F), Full-width (U+FF01-FF65)。, 、, 「
AASCII/Latin[a-zA-Za-zA-Z]A, z, B
NDigits[0-90-9]0, 5
OOtherFallback@, #

Note: “M” (Kanji numbers) is checked before “H” (general Kanji), so characters like 一 and 百 are classified as numbers rather than generic ideographs.

Chinese Character Types

CodeNamePattern / RangeExamples
FFunction WordsHigh-frequency grammatical words的, 了, 在, 是
CCJK UnifiedU+4E00–U+9FFF中, 国, 人
XCJK Extension AU+3400–U+4DBFRare characters
RCJK RadicalsU+2E80–U+2FDFKangxi radicals
PPunctuationCJK Symbols + Full-width。, ,, 《
BBopomofoU+3100–U+312F, U+31A0–U+31BFZhuyin symbols
AASCII/Latin[a-zA-Za-zA-Z]A, z
NDigits[0-90-9]0, 5
OOtherFallback@, #

Chinese function words include:

  • Structural particles: 的, 地, 得
  • Aspect/modal particles: 了, 着, 过, 吗, 呢, 吧, 啊, 嘛
  • Conjunctions: 和, 与, 或, 但, 而, 且, 及
  • Prepositions: 在, 从, 到, 把, 被, 对, 向, 给
  • Common grammatical verbs/adverbs: 是, 有, 不, 也, 都, 就, 要, 会, 能, 可

Korean Character Types

CodeNamePattern / RangeExamples
EParticles/EndingsHigh-frequency grammatical particles은, 는, 을, 를, 의, 에
SNHangul (no batchim)Hangul Syllable without final consonant가, 나, 하
SFHangul (with batchim)Hangul Syllable with final consonant한, 글, 각
JHangul JamoU+1100–U+11FFIndividual consonants/vowels
GCompatibility JamoU+3130–U+318Fㄱ, ㅏ, ㅎ
HHanjaU+4E00–U+9FFFCJK Ideographs
PPunctuationCJK Symbols + Full-width。, ,
AASCII/Latin[a-zA-Za-zA-Z]A, z
NDigits[0-90-9]0, 5
OOtherFallback@, #

Korean Hangul Syllable Detection

Korean uses a match guard for the SN and SF types. This leverages Unicode’s systematic Hangul encoding:

  • Hangul Syllables occupy U+AC00–U+D7AF
  • Each syllable is encoded as: (initial * 21 + medial) * 28 + final + 0xAC00
  • If (codepoint - 0xAC00) % 28 == 0, the syllable has no final consonant (SN)
  • Otherwise, it has a final consonant (SF, “받침”)

This distinction is important because the presence of a final consonant (받침) affects Korean word boundary patterns and particle attachment.

Cross-Language Comparison

FeatureJapaneseChineseKorean
Total types8910
Unique typesM, H, I, KF, C, X, R, BE, SN, SF, J, G
Shared typesP, A, N, OP, A, N, OP, A, N, O (H shared with JP)
Matching methodRange matchRange matchRange match + guard
WC features usedYesYesNo

Prediction Pipeline

This chapter provides a step-by-step walkthrough of how Segmenter::segment() processes input text.

Example: Segmenting “これはテストです。”

Step 1: Initialize Arrays with Padding

chars: ["B3", "B2", "B1"]
types: ["O",  "O",  "O" ]
tags:  ["U",  "U",  "U", "U"]

The tags array gets one extra “U” because tags[3] represents the first real character’s tag (set to “Unknown” since there is no prior boundary decision).

Step 2: Scan Input Characters

For each character in the input, determine its type using language-specific patterns and append to the arrays:

chars: ["B3","B2","B1", "こ","れ","は","テ","ス","ト","で","す","。"]
types: ["O", "O", "O",  "I", "I", "I", "K", "K", "K", "I", "I", "P"]

Step 3: Append End Sentinels

chars: [..., "。", "E1", "E2", "E3"]
types: [..., "P",  "O",  "O",  "O" ]

Step 4: Iterate and Predict

For each position i from 4 to len(chars) - 3:

i=4 (れ): Extract features → predict → label=-1 (O) → word="これ"
i=5 (は): Extract features → predict → label=+1 (B) → push "これ", word="は"
i=6 (テ): Extract features → predict → label=+1 (B) → push "は", word="テ"
i=7 (ス): Extract features → predict → label=-1 (O) → word="テス"
i=8 (ト): Extract features → predict → label=-1 (O) → word="テスト"
i=9 (で): Extract features → predict → label=+1 (B) → push "テスト", word="で"
i=10(す): Extract features → predict → label=-1 (O) → word="です"
i=11(。): Extract features → predict → label=+1 (B) → push "です", word="。"

Step 5: Push Final Word

Push the remaining word “。” to the result.

Result

["これ", "は", "テスト", "です", "。"]

How Prediction Works at Each Position

At each position i, the segmenter:

  1. Extracts features – Calls get_attributes(i, tags, chars, types) to build a HashSet<String> of 38–42 features

  2. Computes score – The AdaBoost learner sums the model weights for all matching features plus the bias:

    score = bias + sum(model[feature] for feature in attributes)
    
  3. Makes decision – If score >= 0, the character starts a new word (boundary); otherwise, it continues the current word

  4. Updates tags – Pushes “B” or “O” to the tags array, which affects feature extraction for subsequent positions

Training vs. Prediction

AspectTraining (process_corpus)Prediction (segment)
Tags sourcePre-computed from the annotated corpusDynamically generated by the model
First tag“U” (overrides “B” at position 3)“U” (no prior decision)
LabelsKnown from corpus (+1 or -1)Predicted by AdaBoost
FeaturesWritten to file via callbackPassed directly to predict()

During training, tags are derived from the ground-truth corpus segmentation, so the model learns from correct boundary decisions. During prediction, tags are generated on-the-fly, meaning each decision depends on all previous predictions – this is a left-to-right greedy approach.

Performance Characteristics

The segmentation algorithm is linear in the length of the input:

  • Each character position is visited once: O(n)
  • Feature extraction at each position: O(1) (fixed number of features)
  • Prediction at each position: O(f) where f is the number of active features (~38-42)
  • Total: O(n * f) which is effectively O(n)

Language Support Overview

Litsea supports word segmentation for three languages through a unified framework based on the Language enum.

Supported Languages

LanguageEnum VariantCLI ValuesFeature CountPre-trained Model Accuracy
JapaneseLanguage::Japanesejapanese, ja4294.15%
ChineseLanguage::Chinesechinese, zh4280.72%
KoreanLanguage::Koreankorean, ko3885.08%

The Language Enum

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum Language {
    #[default]
    Japanese,
    Chinese,
    Korean,
}
}
  • Default is Japanese
  • Implements FromStr – parses from full name or ISO 639-1 code (case-insensitive)
  • Implements Display – outputs the lowercase full name

Parsing Examples

#![allow(unused)]
fn main() {
use litsea::language::Language;

let ja: Language = "japanese".parse().unwrap();
let zh: Language = "zh".parse().unwrap();
let ko: Language = "Korean".parse().unwrap();   // case-insensitive
let err = "french".parse::<Language>();          // Err(...)
}

How Languages Differ

Each language defines its own character type patterns that classify characters into type codes. These type codes are used as features for the AdaBoost classifier.

AspectJapaneseChineseKorean
Character types8 (M, H, I, K, P, A, N, O)9 (F, C, X, R, P, B, A, N, O)10 (E, SN, SF, J, G, H, P, A, N, O)
WC featuresYes (4 extra)Yes (4 extra)No
Total features424238
Matching methodRegex onlyRegex onlyRegex + Closure

Why Korean Has Fewer Features

Korean Hangul syllables are classified into only two types: SN (without 받침/final consonant) and SF (with 받침). This binary distinction means WC features (word + character-type combinations) would produce redundant information with little discriminative power. Excluding them reduces noise and keeps the model compact.

Japanese

Japanese is the default language in Litsea.

Character Types

CodeNamePatternExamples
MKanji Numbers[一二三四五六七八九十百千万億兆]一, 三, 千, 億
HKanji / CJK[一-龠々〆ヵヶ]漢, 字, 学, 々
IHiragana[ぁ-ん]あ, い, う, を
KKatakana[ァ-ヴーア-ン゙゚]ア, カ, ー, ハ
PPunctuationCJK Symbols + Full-width。, 、, 「, 」
AASCII/Latin[a-zA-Za-zA-Z]A, z, B
NDigits[0-90-9]0, 5, 5
OOtherFallback@, #, $

Pattern Priority

Patterns are evaluated in order. Notably:

  • M before H: Characters like 一 and 百 are classified as “Kanji Numbers” (M), not generic “Kanji” (H)
  • This distinction helps the model learn number-specific boundary patterns

Pre-trained Models

japanese.model

  • Training corpus: UD Japanese-GSD
  • Accuracy: 94.15%
  • Precision: 95.57%
  • Recall: 94.36%

RWCP.model

  • Source: Extracted from the original TinySegmenter
  • License: BSD 3-Clause (Taku Kudo)
  • Size: ~22 KB

JEITA_Genpaku_ChaSen_IPAdic.model

  • Training corpus: JEITA Project Sugita Genpaku corpus
  • Tokenizer: ChaSen with IPAdic dictionary
  • Size: ~17 KB

Example

echo "LitseaはTinySegmenterを参考に開発された、Rustで実装された極めてコンパクトな単語分割ソフトウェアです。" \
  | litsea segment -l japanese ./models/japanese.model

Output:

Litsea は TinySegmenter を 参考 に 開発 さ れ た 、 Rust で 実装 さ れ た 極めて コンパクト な 単語 分割 ソフトウェア です 。

Chinese

Litsea supports Chinese word segmentation covering both Simplified and Traditional Chinese.

Character Types

CodeNamePatternExamples
FFunction WordsHigh-frequency grammatical words的, 了, 在, 是, 和
CCJK UnifiedU+4E00–U+9FFF中, 国, 人
XCJK Extension AU+3400–U+4DBFRare characters
RCJK RadicalsU+2E80–U+2FDFKangxi radicals
PPunctuationCJK Symbols + Full-width。, ,, 《, 》
BBopomofoU+3100–U+312F, U+31A0–U+31BFZhuyin symbols
AASCII/Latin[a-zA-Za-zA-Z]A, z
NDigits[0-90-9]0, 5, 5
OOtherFallback@, #, $

Chinese Function Words (虚词)

The “F” type captures high-frequency grammatical words that are critical for segmentation:

CategoryCharacters
Structural particles的, 地, 得
Aspect/modal particles了, 着, 过, 吗, 呢, 吧, 啊, 嘛
Conjunctions和, 与, 或, 但, 而, 且, 及
Prepositions在, 从, 到, 把, 被, 对, 向, 给
Grammatical verbs/adverbs是, 有, 不, 也, 都, 就, 要, 会, 能, 可

These characters appear overwhelmingly in grammatical roles and signal word boundaries differently from content words.

Pre-trained Model

chinese.model

  • Training corpus: UD Chinese-GSD
  • Accuracy: 80.72%

Example

echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model

Korean

Litsea supports Korean word segmentation with specialized Hangul character type detection.

Character Types

CodeNamePatternExamples
EParticles/Endings[은는을를의에]은, 는, 을, 를, 의, 에
SNHangul (no 받침)Codepoint arithmetic가, 나, 하, 모
SFHangul (with 받침)Codepoint arithmetic한, 글, 각, 붙
JHangul JamoU+1100–U+11FFIndividual consonants/vowels
GCompatibility JamoU+3130–U+318Fㄱ, ㅏ, ㅎ
HHanjaU+4E00–U+9FFFCJK Ideographs
PPunctuationCJK Symbols + Full-width。, ,
AASCII/Latin[a-zA-Za-zA-Z]A, z
NDigits[0-90-9]0, 5, 5
OOtherFallback@, #, $

Korean Particles (조사)

The “E” type captures six high-frequency grammatical particles:

CharacterRoleName
은/는Topic marker주격 조사
을/를Object marker목적격 조사
Possessive관형격 조사
Locative부사격 조사

These particles frequently appear at word boundaries and are given a distinct type code to improve segmentation accuracy.

Hangul Syllable Structure (받침 Detection)

Korean uses closure-based matching instead of regex for SN and SF types. This exploits the systematic Unicode Hangul encoding:

  • Hangul Syllables: U+AC00–U+D7AF (11,172 syllables)
  • Each syllable = (initial * 21 + medial) * 28 + final + 0xAC00
  • SN (no 받침): (codepoint - 0xAC00) % 28 == 0
  • SF (with 받침): (codepoint - 0xAC00) % 28 != 0

The 받침 (final consonant) distinction is linguistically significant because it affects how particles attach to words and where boundaries occur.

No WC Features

Korean does not use WC (word + character-type) features. Since most Hangul syllables fall into only two types (SN and SF), WC features would produce low-entropy, noisy combinations that hurt model accuracy.

Pre-trained Model

korean.model

  • Training corpus: UD Korean-GSD
  • Accuracy: 85.08%

Example

echo "한국어 단어 분할 테스트입니다." | litsea segment -l korean ./models/korean.model

Adding a New Language

Litsea’s multilingual framework is designed to be easily extensible. This guide explains how to add support for a new language.

Steps Overview

  1. Add a variant to the Language enum
  2. Implement Display and FromStr match arms
  3. Create a character classification function
  4. Register the classification function
  5. Decide on WC feature inclusion
  6. Prepare a training corpus and train a model
  7. Add tests

Step 1: Add a Variant to Language

In litsea/src/language.rs, add a new variant to the Language enum:

#![allow(unused)]
fn main() {
pub enum Language {
    #[default]
    Japanese,
    Chinese,
    Korean,
    Thai,       // ← new language
}
}

Step 2: Implement Display and FromStr

Add match arms for the new language:

#![allow(unused)]
fn main() {
// In Display impl
Language::Thai => write!(f, "thai"),

// In FromStr impl
"thai" | "th" => Ok(Language::Thai),
}

Step 3: Create a Character Classification Function

Define a function that classifies a char into a type code for the new language. Classification is a direct match on character ranges (no regex), so each class is an arm; the first matching arm wins:

#![allow(unused)]
fn main() {
fn thai_char_type(c: char) -> &'static str {
    match c {
        // Thai consonants and sequential vowels (U+0E01-U+0E3A)
        '\u{0E01}'..='\u{0E3A}' => "T",
        // Thai vowels and tone marks (U+0E40-U+0E4E)
        '\u{0E40}'..='\u{0E4E}' => "V",
        // Thai digits (U+0E50-U+0E59)
        '\u{0E50}'..='\u{0E59}' => "N",
        // Shared classes: "P" (punctuation), "A" (Latin), "N" (digits)
        _ => punct_latin_digit(c).unwrap_or("O"),
    }
}
}

Design Tips for Character Types

  • Identify linguistically distinct categories that correlate with word boundary patterns
  • Order matters – match arms are tried top to bottom, so put more specific classes before general ones
  • Consider high-frequency function words as a separate type (as Chinese does with “F”)
  • Use match guards for logic beyond plain ranges (as Korean does to split syllables with/without 받침)
  • Reuse the shared punct_latin_digit() helper for the common “P”/“A”/“N” classes

Step 4: Register the Classification Function

Add a match arm in Language::char_type():

#![allow(unused)]
fn main() {
pub fn char_type(&self, c: char) -> &'static str {
    match self {
        Language::Japanese => japanese_char_type(c),
        Language::Chinese => chinese_char_type(c),
        Language::Korean => korean_char_type(c),
        Language::Thai => thai_char_type(c),    // ← new
    }
}
}

Step 5: Decide on WC Feature Inclusion

In segmenter.rs, the internal attribute builder (write_attributes()) has a match on the language to decide whether to include WC features:

#![allow(unused)]
fn main() {
match self.language {
    Language::Japanese | Language::Chinese => {
        // Include WC features
        attr!("WC1:{}{}", w3, c4);
        attr!("WC2:{}{}", c3, w4);
        attr!("WC3:{}{}", w3, c3);
        attr!("WC4:{}{}", w4, c4);
    }
    _ => {}
}
}

If your language’s character types have enough variety to make WC features informative, add it to the match arm. If your type system is low-entropy (like Korean’s SN/SF), it is better to exclude WC features.

Step 6: Prepare Corpus and Train a Model

  1. Prepare a corpus with words separated by spaces:

    word1 word2 word3 word4
    
  2. Extract features:

    litsea extract -l thai ./corpus.txt ./features.txt
    
  3. Train a model:

    litsea train -t 0.005 -i 1000 ./features.txt ./models/thai.model
    

Step 7: Add Tests

Add tests in both language.rs and segmenter.rs:

#![allow(unused)]
fn main() {
// In language.rs tests
#[test]
fn test_thai_char_types() {
    let lang = Language::Thai;
    assert_eq!(lang.char_type('ก'), "T");   // Thai consonant
    assert_eq!(lang.char_type('A'), "A");   // ASCII
    assert_eq!(lang.char_type('@'), "O");   // Other
}

// In segmenter.rs tests
#[test]
fn test_char_type_thai() {
    let segmenter = Segmenter::new(Language::Thai, None);
    assert_eq!(segmenter.char_type("ก"), "T");
}
}

Run all tests to verify:

cargo test --workspace

Library API Overview

The litsea crate provides a Rust API for word segmentation, model training, and feature extraction.

Installation

[dependencies]
litsea = "0.5.0"

Loading models from local files is synchronous and needs no async runtime. An async runtime such as tokio is only required when loading models over HTTP/HTTPS with the async load_model method.

Module Map

graph LR
    A["litsea::segmenter"] --- B["Segmenter"]
    C["litsea::adaboost"] --- D["AdaBoost"]
    E["litsea::language"] --- F["Language"]
    G["litsea::extractor"] --- H["Extractor"]
    I["litsea::trainer"] --- J["Trainer, PosTrainer"]
    K["litsea::error"] --- L["LitseaError, Result"]
    M["litsea::perceptron"] --- N["AveragedPerceptron"]
    O["litsea::upos"] --- P["Upos, SegmentLabel"]
    Q["litsea::metrics"] --- R["BinaryMetrics, MulticlassMetrics"]
ModulePrimary TypesPurpose
litsea::segmenterSegmenterWord segmentation, joint segmentation with POS tagging
litsea::adaboostAdaBoostBinary classification, model I/O
litsea::perceptronAveragedPerceptronMulticlass classification (POS tagging), model I/O
litsea::uposUpos, SegmentLabelUPOS POS tags, segment labels
litsea::languageLanguageLanguage definitions, character classification
litsea::extractorExtractorFeature extraction from corpus
litsea::trainerTrainer, PosTrainerTraining orchestration
litsea::errorLitseaError, ResultError type and result alias
litsea::metricsBinaryMetrics, MulticlassMetricsEvaluation metrics

All primary types are also re-exported at the crate root, so use litsea::Segmenter; works as a shorthand for use litsea::segmenter::Segmenter;.

Quick Example

use std::path::Path;

use litsea::adaboost::AdaBoost;
use litsea::language::Language;
use litsea::segmenter::Segmenter;

fn main() -> litsea::Result<()> {
    let mut learner = AdaBoost::new(0.01, 100);
    learner.load_model_from_path(Path::new("./models/japanese.model"))?;

    let segmenter = Segmenter::new(Language::Japanese, Some(learner));
    let tokens = segmenter.segment("これはテストです。");

    assert_eq!(tokens, vec!["これ", "は", "テスト", "です", "。"]);
    Ok(())
}

Quick Example (POS Tagging)

use std::path::Path;

use litsea::language::Language;
use litsea::perceptron::AveragedPerceptron;
use litsea::segmenter::Segmenter;

fn main() -> litsea::Result<()> {
    let mut pos_learner = AveragedPerceptron::new();
    pos_learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;

    let segmenter = Segmenter::with_pos_learner(Language::Japanese, pos_learner);
    let tokens = segmenter.segment_with_pos("これはテストです。");

    for (word, pos) in &tokens {
        print!("{}/{} ", word, pos);
    }
    println!();

    Ok(())
}

API Documentation

Full API documentation is available on docs.rs/litsea.

Segmenter

The Segmenter struct is the primary interface for word segmentation.

Definition

#![allow(unused)]
fn main() {
pub struct Segmenter {
    // private: language: Language,
    // private: learner: AdaBoost,
    // private: pos_learner: Option<AveragedPerceptron>,
}
}

The fields are private; use the accessor methods language(), learner(), learner_mut(), pos_learner(), and pos_learner_mut() to reach them.

Constructor

Segmenter::new

#![allow(unused)]
fn main() {
pub fn new(language: Language, learner: Option<AdaBoost>) -> Self
}

Creates a new segmenter.

  • language – The language for character type classification
  • learner – An optional pre-trained AdaBoost model. If None, a default (untrained) instance is created.
#![allow(unused)]
fn main() {
use litsea::language::Language;
use litsea::segmenter::Segmenter;

// With a pre-trained model
let segmenter = Segmenter::new(Language::Japanese, Some(learner));

// Without a model (for training or feature extraction)
let segmenter = Segmenter::new(Language::Japanese, None);
}

Methods

segment

#![allow(unused)]
fn main() {
pub fn segment(&self, sentence: &str) -> Vec<String>
}

Segments a sentence into words. Returns an empty vector for empty input.

#![allow(unused)]
fn main() {
let tokens = segmenter.segment("これはテストです。");
// ["これ", "は", "テスト", "です", "。"]
}

char_type

#![allow(unused)]
fn main() {
pub fn char_type(&self, ch: &str) -> &str
}

Classifies a single character into its type code using language-specific rules. The first character of the &str is classified; an empty string returns "O".

#![allow(unused)]
fn main() {
let segmenter = Segmenter::new(Language::Japanese, None);
assert_eq!(segmenter.char_type("あ"), "I");  // Hiragana
assert_eq!(segmenter.char_type("漢"), "H");  // Kanji
assert_eq!(segmenter.char_type("A"), "A");   // ASCII
}

add_corpus

#![allow(unused)]
fn main() {
pub fn add_corpus(&mut self, corpus: &str)
}

Processes a space-separated corpus and adds instances to the internal AdaBoost learner.

#![allow(unused)]
fn main() {
let mut segmenter = Segmenter::new(Language::Japanese, None);
segmenter.add_corpus("テスト です");
}

add_corpus_with_writer

#![allow(unused)]
fn main() {
pub fn add_corpus_with_writer<F>(&self, corpus: &str, writer: F)
where
    F: FnMut(HashSet<String>, i8),
}

Processes a corpus and calls the callback for each character position with its feature set and label.

#![allow(unused)]
fn main() {
segmenter.add_corpus_with_writer("テスト です", |attrs, label| {
    println!("Features: {:?}, Label: {}", attrs, label);
});
}

Accessors

#![allow(unused)]
fn main() {
pub fn language(&self) -> Language
pub fn learner(&self) -> &AdaBoost
pub fn learner_mut(&mut self) -> &mut AdaBoost
pub fn pos_learner(&self) -> Option<&AveragedPerceptron>
pub fn pos_learner_mut(&mut self) -> Option<&mut AveragedPerceptron>
}

Provide access to the segmenter’s language and its internal learners.

Feature extraction for a character position (38 features for Korean, 42 for Japanese/Chinese) is an internal detail; the former get_attributes method is now private.

Extractor

The Extractor struct extracts features from a corpus file for model training.

Definition

#![allow(unused)]
fn main() {
pub struct Extractor {
    segmenter: Segmenter,
}
}

Constructor

Extractor::new

#![allow(unused)]
fn main() {
pub fn new(language: Language) -> Self
}

Creates a new extractor for the specified language. Internally creates a Segmenter without a pre-trained model.

#![allow(unused)]
fn main() {
use litsea::extractor::Extractor;
use litsea::language::Language;

let mut extractor = Extractor::new(Language::Japanese);
}

Methods

extract

#![allow(unused)]
fn main() {
pub fn extract(
    &mut self,
    corpus_path: &Path,
    features_path: &Path,
) -> litsea::Result<()>
}

Reads a corpus file (space-separated words, one sentence per line) and writes the extracted features to the output file.

#![allow(unused)]
fn main() {
use std::path::Path;

extractor.extract(
    Path::new("./corpus.txt"),
    Path::new("./features.txt"),
)?;
}

Pipeline

flowchart LR
    A["corpus.txt<br/>(space-separated words)"] --> B["Extractor::extract()"]
    B --> C["features.txt<br/>(label + features per position)"]

The extractor:

  1. Reads each line from the corpus file
  2. Calls Segmenter::add_corpus_with_writer() to process each line
  3. Writes the label and feature set for each character position to the output file

Trainer

The Trainer struct orchestrates the full model training pipeline.

Definition

#![allow(unused)]
fn main() {
pub struct Trainer {
    learner: AdaBoost,
}
}

Constructor

Trainer::new

#![allow(unused)]
fn main() {
pub fn new(
    threshold: f64,
    num_iterations: usize,
    features_path: &Path,
) -> litsea::Result<Self>
}

Creates a trainer and initializes it from a features file. This calls AdaBoost::initialize_features() and AdaBoost::initialize_instances().

#![allow(unused)]
fn main() {
use std::path::Path;
use litsea::trainer::Trainer;

let mut trainer = Trainer::new(
    0.005,                           // threshold
    1000,                            // max iterations
    Path::new("./features.txt"),     // features file
)?;
}

Methods

load_model

#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}

Loads an existing model for retraining. Supports file paths, file://, and (with the remote_model feature) http:// and https:// URIs.

When called after Trainer::new, the loaded weights are merged into the freshly initialized training data by feature name, so incremental training starts from the existing model without corrupting the feature index.

#![allow(unused)]
fn main() {
trainer.load_model("./models/japanese.model").await?;
}

train

#![allow(unused)]
fn main() {
pub fn train(
    &mut self,
    running: Arc<AtomicBool>,
    model_path: &Path,
) -> litsea::Result<BinaryMetrics>
}

Trains the model and saves it to the specified path. Returns evaluation metrics.

The running flag enables graceful interruption – set it to false to stop training early.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::path::Path;

let running = Arc::new(AtomicBool::new(true));
let metrics = trainer.train(running, Path::new("./model.model"))?;

println!("Accuracy: {:.2}%", metrics.accuracy);
}

Full Training Example

use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::path::Path;

use litsea::trainer::Trainer;

#[tokio::main]
async fn main() -> litsea::Result<()> {
    let mut trainer = Trainer::new(
        0.005,
        1000,
        Path::new("./features.txt"),
    )?;

    // Optionally resume from an existing model
    // trainer.load_model("./models/japanese.model").await?;

    let running = Arc::new(AtomicBool::new(true));
    let metrics = trainer.train(running, Path::new("./model.model"))?;

    println!("Accuracy:  {:.2}%", metrics.accuracy);
    println!("Precision: {:.2}%", metrics.precision);
    println!("Recall:    {:.2}%", metrics.recall);

    Ok(())
}

AdaBoost

The AdaBoost struct implements binary classification for word boundary detection.

Definition

#![allow(unused)]
fn main() {
pub struct AdaBoost {
    pub threshold: f64,
    pub num_iterations: usize,
    // internal fields: model weights, features, instances, etc.
}
}

Constructor

AdaBoost::new

#![allow(unused)]
fn main() {
pub fn new(threshold: f64, num_iterations: usize) -> Self
}

Creates a new AdaBoost instance with the specified hyperparameters.

#![allow(unused)]
fn main() {
use litsea::adaboost::AdaBoost;

let mut learner = AdaBoost::new(0.01, 100);
}

Model Loading

load_model_from_path

#![allow(unused)]
fn main() {
pub fn load_model_from_path(&mut self, path: &Path) -> litsea::Result<()>
}

Loads model weights from a local file, synchronously. This is the preferred method for local files – no async runtime is needed.

#![allow(unused)]
fn main() {
use std::path::Path;

learner.load_model_from_path(Path::new("./models/japanese.model"))?;
}

load_model_from_reader

#![allow(unused)]
fn main() {
pub fn load_model_from_reader<R: BufRead>(&mut self, reader: R) -> litsea::Result<()>
}

Loads model weights from any BufRead source, such as an in-memory buffer or an already-open file.

load_model

#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}

Loads model weights from a URI. Supports:

  • Local file path: ./models/japanese.model
  • File URI: file:///path/to/model
  • HTTP: http://example.com/model (requires the remote_model feature)
  • HTTPS: https://example.com/model (requires the remote_model feature)
#![allow(unused)]
fn main() {
learner.load_model("https://example.com/model").await?;
}

save_model

#![allow(unused)]
fn main() {
pub fn save_model(&self, filename: &Path) -> litsea::Result<()>
}

Saves model weights to a file. Returns an error if the model is empty.

Training Methods

initialize_features

#![allow(unused)]
fn main() {
pub fn initialize_features(&mut self, filename: &Path) -> litsea::Result<()>
}

Reads a features file and builds the feature index. Must be called before initialize_instances.

initialize_instances

#![allow(unused)]
fn main() {
pub fn initialize_instances(&mut self, filename: &Path) -> litsea::Result<()>
}

Reads the same features file and initializes labeled instances with their weights.

train

#![allow(unused)]
fn main() {
pub fn train(&mut self, running: Arc<AtomicBool>)
}

Runs the AdaBoost training loop. Set running to false to stop early.

add_instance

#![allow(unused)]
fn main() {
pub fn add_instance(&mut self, attributes: HashSet<String>, label: i8)
}

Adds a single training instance with its feature set and label.

Prediction

predict

#![allow(unused)]
fn main() {
pub fn predict(&self, attributes: &HashSet<String>) -> i8
}

Predicts the label for a given feature set. Returns +1 (boundary) or -1 (non-boundary).

#![allow(unused)]
fn main() {
use std::collections::HashSet;

let mut attrs = HashSet::new();
attrs.insert("UW4:は".to_string());
attrs.insert("UC4:I".to_string());
// ... more features

let label = learner.predict(&attrs);
// label == 1 (boundary) or -1 (non-boundary)
}

bias

#![allow(unused)]
fn main() {
pub fn bias(&self) -> f64
}

Returns the bias term: -sum(all model weights) / 2.0.

Evaluation

metrics

#![allow(unused)]
fn main() {
pub fn metrics(&self) -> BinaryMetrics
}

Calculates evaluation metrics on the training data.

BinaryMetrics

Defined in litsea::metrics (also re-exported as litsea::BinaryMetrics):

#![allow(unused)]
fn main() {
pub struct BinaryMetrics {
    pub accuracy: f64,          // Accuracy in percentage
    pub precision: f64,         // Precision in percentage
    pub recall: f64,            // Recall in percentage
    pub num_instances: usize,
    pub true_positives: usize,
    pub false_positives: usize,
    pub false_negatives: usize,
    pub true_negatives: usize,
}
}

Averaged Perceptron

The AveragedPerceptron struct implements multiclass classification for joint word segmentation and POS tagging.

Definition

#![allow(unused)]
fn main() {
pub struct AveragedPerceptron {
    // internal fields: weights, accumulated, timestamps, step, classes, instances
}
}

Constructor

AveragedPerceptron::new

#![allow(unused)]
fn main() {
pub fn new() -> Self
}

Creates a new empty Averaged Perceptron instance.

#![allow(unused)]
fn main() {
use litsea::perceptron::AveragedPerceptron;

let mut learner = AveragedPerceptron::new();
}

Adding Instances

add_instance

#![allow(unused)]
fn main() {
pub fn add_instance(&mut self, features: HashSet<String>, label: String)
}

Adds a training instance with a feature set and a label. Unknown classes are automatically registered.

#![allow(unused)]
fn main() {
use std::collections::HashSet;
use litsea::perceptron::AveragedPerceptron;

let mut learner = AveragedPerceptron::new();
let mut feats = HashSet::new();
feats.insert("UW4:猫".to_string());
feats.insert("UC4:H".to_string());
learner.add_instance(feats, "B-NOUN".to_string());
}

Training

train

#![allow(unused)]
fn main() {
pub fn train(&mut self, num_epochs: usize, running: Arc<AtomicBool>)
}

Runs the Averaged Perceptron training loop for the given number of epochs. Set running to false to stop early. Weights are automatically averaged at the end of training.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::AtomicBool;

let running = Arc::new(AtomicBool::new(true));
learner.train(10, running);
}

Prediction

predict

#![allow(unused)]
fn main() {
pub fn predict(&self, features: &HashSet<String>) -> String
}

Predicts the class label for a given feature set. Computes a score for each class and returns the class name with the highest score. Returns an empty string if no classes are registered.

#![allow(unused)]
fn main() {
use std::collections::HashSet;

let mut attrs = HashSet::new();
attrs.insert("UW4:は".to_string());
attrs.insert("UC4:I".to_string());
// ... more features

let label = learner.predict(&attrs);
// label == "B-ADP", "O", etc.
}

Model I/O

save_model

#![allow(unused)]
fn main() {
pub fn save_model(&self, path: &Path) -> litsea::Result<()>
}

Saves model weights to a file. Returns an error if the model is empty.

load_model_from_path

#![allow(unused)]
fn main() {
pub fn load_model_from_path(&mut self, path: &Path) -> litsea::Result<()>
}

Loads model weights from a local file, synchronously. This is the preferred method for local files – no async runtime is needed.

#![allow(unused)]
fn main() {
use std::path::Path;

learner.load_model_from_path(Path::new("./models/japanese_pos.model"))?;
}

load_model_from_reader

#![allow(unused)]
fn main() {
pub fn load_model_from_reader<R: BufRead>(&mut self, reader: R) -> litsea::Result<()>
}

Loads model weights from any BufRead source, such as an in-memory buffer or an already-open file.

load_model

#![allow(unused)]
fn main() {
pub async fn load_model(&mut self, uri: &str) -> litsea::Result<()>
}

Loads model weights from a URI. Supports the following URI schemes:

  • Local file path: ./models/japanese_pos.model
  • File URI: file:///path/to/model
  • HTTP: http://example.com/model (requires the remote_model feature)
  • HTTPS: https://example.com/model (requires the remote_model feature)
#![allow(unused)]
fn main() {
learner.load_model("https://example.com/models/japanese_pos.model").await?;
}

Evaluation

metrics

#![allow(unused)]
fn main() {
pub fn metrics(&self) -> MulticlassMetrics
}

Calculates evaluation metrics on the training data.

MulticlassMetrics

Defined in litsea::metrics (also re-exported as litsea::MulticlassMetrics):

#![allow(unused)]
fn main() {
pub struct MulticlassMetrics {
    pub accuracy: f64,                            // Overall accuracy in percentage
    pub macro_precision: f64,                     // Macro-averaged precision in percentage
    pub macro_recall: f64,                        // Macro-averaged recall in percentage
    pub num_instances: usize,                     // Number of instances
    pub correct_per_class: HashMap<String, usize>,   // Correct count per class
    pub predicted_per_class: HashMap<String, usize>,  // Predicted count per class
    pub gold_per_class: HashMap<String, usize>,       // Gold label count per class
}
}

UPOS

The upos module defines the Universal POS (UPOS) tagset and segment label types used for POS tagging.

Upos

Definition

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum Upos {
    ADJ,    // Adjective
    ADP,    // Adposition
    ADV,    // Adverb
    AUX,    // Auxiliary
    CCONJ,  // Coordinating conjunction
    DET,    // Determiner
    INTJ,   // Interjection
    NOUN,   // Noun
    NUM,    // Numeral
    PART,   // Particle
    PRON,   // Pronoun
    PROPN,  // Proper noun
    PUNCT,  // Punctuation
    SCONJ,  // Subordinating conjunction
    SYM,    // Symbol
    VERB,   // Verb
    X,      // Other
}
}

Litsea supports all 17 UPOS tags from the Universal Dependencies project:

TagDescriptionExample (Japanese)
ADJAdjectiveいい, 大きい
ADPAdpositionは, が, を, に
ADVAdverbとても, まだ
AUXAuxiliaryです, ます, た
CCONJCoordinating conjunctionと, や
DETDeterminerこの, その
INTJInterjectionああ, はい
NOUNNoun天気, 本
NUMNumeral一, 二, 100
PARTParticleね, よ
PRONPronounこれ, それ
PROPNProper noun東京, 太郎
PUNCTPunctuation。, 、
SCONJSubordinating conjunctionので, から
SYMSymbol%, $
VERBVerb読む, 書く
XOther(unclassified tokens)

Constant

Upos::ALL

#![allow(unused)]
fn main() {
pub const ALL: [Upos; 17]
}

Returns an array of all 17 UPOS tags.

Trait Implementations

  • Display: Converts to a string such as "NOUN", "VERB", etc.
  • FromStr: Parses a string into Upos. Returns an error for invalid strings.
#![allow(unused)]
fn main() {
use litsea::upos::Upos;

let pos: Upos = "NOUN".parse().unwrap();
assert_eq!(pos.to_string(), "NOUN");
}

SegmentLabel

Definition

The SegmentLabel type combines word boundary detection with POS tagging. Each character position is assigned one of 18 labels:

  • B(Upos) (17 labels): Word boundary with the given UPOS tag (e.g., B-NOUN, B-VERB)
  • O (1 label): Non-boundary (continuation of the current word)
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum SegmentLabel {
    B(Upos),  // Start of a word (boundary). Carries POS information.
    O,        // Continuation of a word (non-boundary).
}
}
#![allow(unused)]
fn main() {
use litsea::upos::SegmentLabel;

// Segment labels for "今日は" (kyou wa)
// 今 → B-NOUN  (start of "今日", tagged as NOUN)
// 日 → O       (continuation of "今日")
// は → B-ADP   (start of "は", tagged as ADP)
}

Methods

all_labels

#![allow(unused)]
fn main() {
pub fn all_labels() -> Vec<SegmentLabel>
}

Returns a vector of all 18 segment label strings.

is_boundary

#![allow(unused)]
fn main() {
pub fn is_boundary(&self) -> bool
}

Returns whether this is a boundary label (B-*).

pos

#![allow(unused)]
fn main() {
pub fn pos(&self) -> Option<Upos>
}

Returns the UPOS tag. Returns None for the non-boundary label (O).

Trait Implementations

  • Display: Converts to a string such as "B-NOUN", "O", etc.
  • FromStr: Parses a string into SegmentLabel.
#![allow(unused)]
fn main() {
use litsea::upos::{SegmentLabel, Upos};

let label: SegmentLabel = "B-NOUN".parse().unwrap();
assert!(label.is_boundary());
assert_eq!(label.pos(), Some(Upos::NOUN));

let label_o: SegmentLabel = "O".parse().unwrap();
assert!(!label_o.is_boundary());
assert_eq!(label_o.pos(), None);
}

Language

The Language enum defines language-specific behavior, including character type classification.

Language Enum

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum Language {
    #[default]
    Japanese,
    Chinese,
    Korean,
}
}

Traits

  • Default – Returns Language::Japanese
  • Display – Returns lowercase name ("japanese", "chinese", "korean")
  • FromStr – Parses from full name or ISO 639-1 code (case-insensitive)

Parsing

#![allow(unused)]
fn main() {
use litsea::language::Language;

// Full names
let ja: Language = "japanese".parse().unwrap();
let zh: Language = "chinese".parse().unwrap();
let ko: Language = "korean".parse().unwrap();

// ISO 639-1 codes
let ja: Language = "ja".parse().unwrap();
let zh: Language = "zh".parse().unwrap();
let ko: Language = "ko".parse().unwrap();

// Case-insensitive
let ko: Language = "KOREAN".parse().unwrap();

// Invalid
assert!("french".parse::<Language>().is_err());
}

char_type

#![allow(unused)]
fn main() {
pub fn char_type(&self, c: char) -> &'static str
}

Classifies a character into its language-specific type code. Returns "O" (Other) if the character does not belong to any class.

Classification is a direct match on character ranges – allocation-free, O(1), and with no regex involved.

#![allow(unused)]
fn main() {
use litsea::language::Language;

let lang = Language::Japanese;
assert_eq!(lang.char_type('あ'), "I");
assert_eq!(lang.char_type('漢'), "H");
assert_eq!(lang.char_type('@'), "O");
}

Internally, char_type dispatches to a private per-language function (japanese_char_type, chinese_char_type, korean_char_type). The classes common to all languages – "P" (punctuation), "A" (Latin), and "N" (digits) – are handled by a shared helper that is checked after the language-specific classes.

CLI Reference Overview

The litsea CLI provides commands for word segmentation, model training, and text processing.

Usage

litsea <COMMAND> [OPTIONS] [ARGS]

Commands

CommandDescription
extractExtract features from a corpus for training
trainTrain a word segmentation model
segmentSegment text into words using a trained model

Global Options

OptionDescription
-h, --helpShow help information
-V, --versionShow version number

Typical Workflow

AdaBoost Workflow (Word Segmentation Only)

flowchart LR
    A["1. scripts/download_udtreebank.sh"] --> B["2. scripts/corpus_udtreebank.sh"]
    B --> C["3. litsea extract"]
    C --> D["4. litsea train"]
    D --> E["5. litsea segment"]
  1. Download a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
  2. Convert to corpus format: bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
  3. Extract features: litsea extract -l japanese corpus.txt features.txt
  4. Train a model: litsea train -t 0.005 -i 1000 features.txt model.model
  5. Segment text: echo "text" | litsea segment -l japanese model.model

POS Workflow (Word Segmentation with POS Tagging)

flowchart LR
    A["1. scripts/download_udtreebank.sh"] --> B["2. scripts/corpus_udtreebank.sh -p"]
    B --> C["3. litsea extract --pos"]
    C --> D["4. litsea train --pos"]
    D --> E["5. litsea segment --pos"]
  1. Download a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
  2. Convert to POS corpus format: bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
  3. Extract POS features: litsea extract --pos -l japanese pos_corpus.txt features_pos.txt
  4. Train a POS model: litsea train --pos --num-epochs 10 features_pos.txt model_pos.model
  5. Segment with POS tags: echo "text" | litsea segment --pos -l japanese model_pos.model

extract

Extract features from a corpus file for model training.

Usage

litsea extract [OPTIONS] <CORPUS_FILE> <FEATURES_FILE>

Arguments

ArgumentDescription
CORPUS_FILEPath to the input corpus file (words separated by spaces, one sentence per line)
FEATURES_FILEPath to the output features file

Options

OptionDefaultDescription
-l, --language <LANGUAGE>japaneseLanguage for character type classification. Accepts: japanese / ja, chinese / zh, korean / ko
--posoffEnable POS (Part-of-Speech) feature extraction mode. Requires a POS corpus as input

Corpus Format

The input corpus must have words separated by spaces, one sentence per line:

Litsea は TinySegmenter を 参考 に 開発 さ れ た 。
Rust で 実装 さ れ た コンパクト な 単語 分割 ソフトウェア です 。

Output Format

The features file contains one line per character position:

1	UW1:B2 UW2:B1 UW3:L UW4:i UW5:t UC1:O UC2:O UC3:A UC4:A ...
-1	UW1:B1 UW2:L UW3:i UW4:t UW5:s UC1:O UC2:A UC3:A UC4:A ...
  • 1 = word boundary
  • -1 = non-boundary
  • Features are tab-separated

Examples

# Japanese
litsea extract -l japanese ./corpus.txt ./features.txt

# Chinese
litsea extract -l zh ./corpus_zh.txt ./features_zh.txt

# Korean
litsea extract -l ko ./corpus_ko.txt ./features_ko.txt

Output to stderr on success:

Feature extraction completed successfully.

POS Feature Extraction

When the --pos flag is specified, extract expects a POS corpus instead of a plain word-separated corpus. Each line contains words annotated with UPOS tags in the format word/POS:

POS Corpus Format

これ/PRON は/PART テスト/NOUN です/AUX 。/PUNCT
今日/NOUN は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

POS Feature Output Format

In POS mode, the label column uses segment labels (B-NOUN, B-VERB, …, B-X, O) instead of binary 1/-1:

B-NOUN	UW1:B2 UW2:B1 UW3:こ UW4:れ UW5:は UC1:O UC2:O UC3:I UC4:I ...
O	UW1:B1 UW2:こ UW3:れ UW4:は UW5:テ UC1:O UC2:I UC3:I UC4:I ...

POS Extraction Example

litsea extract --pos -l japanese ./pos_corpus.txt ./pos_features.txt

train

Train a word segmentation model using AdaBoost.

Usage

litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

Arguments

ArgumentDescription
FEATURES_FILEPath to the input features file (output from extract)
MODEL_FILEPath to the output model file

Options

OptionDefaultDescription
-t, --threshold <THRESHOLD>0.01Weak classifier accuracy threshold for early stopping. Lower values allow more iterations
-i, --num-iterations <NUM_ITERATIONS>100Maximum number of boosting iterations
-m, --load-model-uri <LOAD_MODEL_URI>NoneURI of an existing model to resume training from (file path or HTTP/HTTPS URL)
--posoffEnable POS (Part-of-Speech) training mode using Averaged Perceptron
-e, --num-epochs <NUM_EPOCHS>10Number of training epochs (POS mode only)

Output

Training metrics are printed to stderr:

Result Metrics:
  Accuracy: 94.15% ( 564133 / 599198 )
  Precision: 95.57% ( 330454 / 345758 )
  Recall: 94.36% ( 330454 / 350215 )
  Confusion Matrix:
    True Positives: 330454
    False Positives: 15304
    False Negatives: 19761
    True Negatives: 233679

Ctrl+C Handling

Training supports graceful interruption:

  • First Ctrl+C: Stops training and saves the model at its current state
  • Second Ctrl+C: Exits immediately without saving

This allows you to stop long-running training sessions without losing progress.

Examples

Basic training:

litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model

Training with higher precision (lower threshold, more iterations):

litsea train -t 0.001 -i 5000 ./features.txt ./model.model

Retraining from an existing model:

litsea train -t 0.005 -i 1000 -m ./models/japanese.model \
    ./new_features.txt ./models/japanese_v2.model

Hyperparameter Tuning

ParameterEffect of DecreasingEffect of Increasing
thresholdMore iterations, potentially higher accuracy, longer training timeFewer iterations, faster training, may underfit
num_iterationsFewer boosting rounds, smaller model, may underfitMore rounds, larger model, potentially higher accuracy

POS Model Training

When the --pos flag is specified, train uses the Averaged Perceptron algorithm instead of AdaBoost. This trains a multiclass classifier for joint word segmentation and POS tagging.

Usage

litsea train --pos [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

POS Training Options

OptionDefaultDescription
--posoffEnable POS training mode
-e, --num-epochs <NUM_EPOCHS>10Number of training epochs

Examples

# Train a POS model from POS features
litsea train --pos -e 10 ./pos_features.txt ./models/japanese_pos.model

Output

POS training metrics are printed to stderr (macro-averaged precision and recall):

Result Metrics:
  Accuracy: 98.34%
  Macro Precision: 97.87%
  Macro Recall: 91.67%

Ctrl+C Handling

Same as AdaBoost training, POS training supports graceful interruption. The first Ctrl+C stops training and saves the model at its current state.

POS Hyperparameters

ParameterEffect of DecreasingEffect of Increasing
num_epochsFaster training, may underfitBetter accuracy, longer training, may overfit

segment

Segment text into words using a trained model.

Usage

echo "text" | litsea segment [OPTIONS] <MODEL_URI>

Arguments

ArgumentDescription
MODEL_URIPath or URL to the trained model file. Supports: local file paths, file://, http://, https://

Options

OptionDefaultDescription
-l, --language <LANGUAGE>japaneseLanguage for character type classification. Accepts: japanese / ja, chinese / zh, korean / ko
--posoffEnable POS-tagged segmentation output. Requires a POS model trained with train --pos

Input / Output

  • Input: Reads from stdin, one sentence per line. Empty lines are skipped.
  • Output: Writes to stdout, space-separated tokens, one line per input line.

Examples

Japanese:

echo "LitseaはTinySegmenterを参考に開発された。" \
  | litsea segment -l japanese ./models/japanese.model
Litsea は TinySegmenter を 参考 に 開発 さ れ た 。

Chinese:

echo "中文分词测试。" | litsea segment -l chinese ./models/chinese.model

Korean:

echo "한국어 단어 분할 테스트입니다." \
  | litsea segment -l korean ./models/korean.model

Processing a file:

cat input.txt | litsea segment -l japanese ./models/japanese.model > output.txt

Loading a model from a URL:

echo "テスト文です。" \
  | litsea segment -l japanese https://example.com/models/japanese.model

POS-Tagged Segmentation (--pos)

When the --pos flag is specified, segmentation and POS tagging are performed simultaneously using an Averaged Perceptron model.

Usage

echo "text" | litsea segment --pos [OPTIONS] <MODEL_URI>

Output Format

Each token is output in word/POS format. POS tags conform to the UPOS tag set.

echo "今日はいい天気ですね。" \
  | litsea segment --pos -l japanese ./models/japanese_pos.model
今日/X は/ADP いい/ADJ 天気/NOUN です/AUX ね/PART 。/PUNCT

Processing a File

cat input.txt | litsea segment --pos -l japanese ./models/japanese_pos.model > output.txt

Notes

  • The --language flag must match the language the model was trained for
  • Model loading is asynchronous and supports HTTP/HTTPS with TLS (rustls)
  • The model URI is not restricted to file paths – any valid URL is accepted
  • When using --pos, the model must be a POS model trained with train --pos

Training Guide

This guide walks you through training custom word segmentation and POS tagging models with Litsea.

Both workflows use Universal Dependencies (UD) Treebanks as the data source.

Word Segmentation (AdaBoost)

  1. Prepare a corpus from a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
  2. Extract features from the corpus
  3. Train a model using AdaBoost

POS Tagging (Averaged Perceptron)

  1. Prepare a POS corpus from a UD Treebank: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
  2. Extract POS features: litsea extract --pos -l japanese pos_corpus.txt features.txt
  3. Train a POS model: litsea train --pos --num-epochs 10 features.txt model.txt

Additional Topics

Preparing a Corpus

A good training corpus is essential for model accuracy. This guide explains how to prepare one using Universal Dependencies (UD) Treebanks.

Data Source: UD Treebanks

Litsea uses UD Treebanks as the data source for both word segmentation and POS tagging. UD Treebanks provide high-quality, manually annotated data in CoNLL-U format for many languages.

Available Treebanks

LanguageTreebankRepository
JapaneseUD Japanese-GSDUD_Japanese-GSD
ChineseUD Chinese-GSDUD_Chinese-GSD
KoreanUD Korean-GSDUD_Korean-GSD

Step 1: Download a UD Treebank

Use scripts/download_udtreebank.sh to download a UD Treebank. It prints the path to the training CoNLL-U file to stdout:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)

Supported languages: ja (Japanese, default), ko (Korean), zh (Chinese). Use -o to specify the output directory (default: current directory).

Corpus for Word Segmentation

For word segmentation (AdaBoost), the corpus must be a plain text file with:

  • One sentence per line
  • Words separated by spaces
太郎 は 走っ た 。
Litsea は コンパクト な 単語 分割 ソフトウェア です 。

Convert CoNLL-U to Word Segmentation Corpus

Use scripts/corpus_udtreebank.sh to convert a CoNLL-U file to corpus format:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt

This converts the CoNLL-U data into space-separated words (one sentence per line).

Corpus for POS Tagging

For POS tagging (Averaged Perceptron), each word must be annotated with its POS tag.

POS Corpus Format

Each line represents one sentence, with words annotated as word/POS pairs separated by spaces:

これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT
Litsea/PROPN は/ADP 単語/NOUN 分割/NOUN ソフトウェア/NOUN です/AUX 。/PUNCT

The POS tags follow the Universal POS (UPOS) tagset with 17 categories: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.

Convert CoNLL-U to POS Corpus

Use scripts/corpus_udtreebank.sh with the -p flag to produce a POS corpus:

conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt

Multi-word tokens and empty nodes are automatically handled during conversion.

Automated Corpus Preparation

Litsea includes helper scripts in the scripts/ directory that automate the UD Treebank download and conversion:

  • scripts/download_udtreebank.sh – Downloads a UD Treebank and prints the path to the training CoNLL-U file
  • scripts/corpus_udtreebank.sh – Converts a CoNLL-U file to Litsea corpus format
# Download UD Treebank and get CoNLL-U file path
conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp)

# Generate word segmentation corpus
bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt

# Generate POS corpus
bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt

Supported languages for download_udtreebank.sh: ja (Japanese, default), ko (Korean), zh (Chinese).

Corpus from Wikipedia Dump

For larger-scale training, you can build a corpus from a full Wikipedia dump using scripts/corpus_wikidump.sh. This extracts plain text with wicket, filters for actual sentences, and tokenizes with lindera.

Usage

# Japanese (default)
bash scripts/corpus_wikidump.sh jawiki-latest-pages-articles.xml.bz2 corpus_ja.txt

# Korean
bash scripts/corpus_wikidump.sh -l ko kowiki-latest-pages-articles.xml.bz2 corpus_ko.txt

# Chinese
bash scripts/corpus_wikidump.sh -l zh zhwiki-latest-pages-articles.xml.bz2 corpus_zh.txt

Options

OptionDescriptionDefault
-l langLanguage code: ja, ko, zhja
-n max_linesMaximum sentence lines to process (0 = unlimited)100000

Sentence Filtering

The script applies two filters to keep only well-formed sentences:

  1. Sentence-ending punctuation – Lines must end with , ., !, or ?. This excludes section headers (e.g., “参考文献”), list items, and metadata.
  2. Minimum length – Lines must be at least 20 characters. This excludes short fragments and isolated labels.

Tokenizer Dictionaries

LanguageDictionaryToken Filter
Japanese (ja)embedded://unidicjapanese_compound_word (numeral compound)
Korean (ko)embedded://ko-dicNone
Chinese (zh)embedded://cc-cedictNone

Corpus Size Guidelines

The recommended corpus size depends on your use case:

Size (sentence lines)Use Case
~10,000Minimum for prototyping and smoke tests
50,000 – 100,000Practical range for model training
100,000 – 500,000High-quality, robust models
UnlimitedUse full dump for maximum accuracy

The default max_lines=100000 in corpus_wikidump.sh targets the practical-to-high-quality range.

Corpus Quality Tips

  • Diversity – Include text from various domains (news, literature, web, etc.)
  • Size – See Corpus Size Guidelines above for recommended sizes
  • Consistency – Ensure consistent tokenization throughout the corpus
  • Deduplication – Remove duplicate sentences to avoid bias
  • Cleaning – Remove HTML tags, special formatting, and non-text content

Extracting Features

After preparing a corpus, the next step is to extract features for model training.

Command

litsea extract -l <LANGUAGE> <CORPUS_FILE> <FEATURES_FILE>

Example

litsea extract -l japanese ./corpus.txt ./features.txt

Output:

Feature extraction completed successfully.

What Happens Internally

flowchart TD
    A["Read corpus line by line"] --> B["Split line into words"]
    B --> C["Build chars, types, and tags arrays"]
    C --> D["For each character position"]
    D --> E["Extract 38-42 features"]
    E --> F["Write label + features to file"]
  1. The Extractor reads each line from the corpus
  2. For each sentence, it creates a Segmenter context with character arrays, type arrays, and tag arrays
  3. For each character position (except the first), it extracts features and writes them with the correct label

Feature File Format

Each line represents one character position:

1	UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:は ...
-1	UP1:U UP2:U UP3:B BP1:UB BP2:BU UW1:B1 UW2:は UW3:テ ...
  • First column: label (1 = boundary, -1 = non-boundary)
  • Remaining columns: features (tab-separated)

POS Feature Extraction

For POS tagging models, use the --pos flag to extract features with POS labels instead of binary boundary labels.

Command

litsea extract --pos -l <LANGUAGE> <CORPUS_FILE> <FEATURES_FILE>

Example

litsea extract --pos -l japanese ./corpus.txt ./features.txt

POS Labels

When extracting POS features, each character position is labeled with one of 18 segment labels instead of the binary 1/-1 labels:

  • B-NOUN, B-VERB, B-ADJ, B-ADP, B-ADV, B-AUX, B-CCONJ, B-DET, B-INTJ, B-NUM, B-PART, B-PRON, B-PROPN, B-PUNCT, B-SCONJ, B-SYM, B-X – Word boundary with the corresponding POS tag
  • O – Non-boundary (inside a word)

The feature template (character n-grams, type n-grams, etc.) is the same as for standard segmentation – only the label scheme differs.

POS Feature File Format

B-NOUN	UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:は ...
O	UP1:U UP2:U UP3:B BP1:UB BP2:BU UW1:B1 UW2:は UW3:テ ...
B-VERB	UP1:U UP2:U UP3:U BP1:UU BP2:UU UW1:B2 UW2:B1 UW3:い ...
  • First column: segment label (e.g., B-NOUN, O)
  • Remaining columns: features (tab-separated)

File Size Expectations

The features file will be significantly larger than the corpus because each character position generates 38-42 feature strings. For a 1 MB corpus, expect a features file of roughly 50-100 MB.

Training Models

Once features are extracted, train a model using AdaBoost.

Command

litsea train [OPTIONS] <FEATURES_FILE> <MODEL_FILE>

Basic Example

litsea train -t 0.005 -i 1000 ./features.txt ./models/japanese.model

Training Process

flowchart TD
    A["Initialize features<br/>(read feature names)"] --> B["Initialize instances<br/>(read labels + features)"]
    B --> C["AdaBoost training loop"]
    C --> D{"Converged or<br/>max iterations?"}
    D -->|No| C
    D -->|Yes| E["Save model"]
    E --> F["Output metrics"]
  1. Initialize features – Reads the features file to build the feature index
  2. Initialize instances – Reads again to load labeled instances and initial weights
  3. Training loop – Iteratively selects the best feature, updates model weights, and reweights instances
  4. Save model – Writes non-zero feature weights to the model file
  5. Output metrics – Prints accuracy, precision, recall, and confusion matrix

Hyperparameters

ParameterFlagDefaultGuidance
Threshold-t0.01Start with 0.005. Lower values allow more iterations but increase training time
Iterations-i100Start with 1000. Increase if accuracy is still improving when training stops

Interpreting Output

Result Metrics:
  Accuracy: 94.15% ( 564133 / 599198 )
  Precision: 95.57% ( 330454 / 345758 )
  Recall: 94.36% ( 330454 / 350215 )
  Confusion Matrix:
    True Positives: 330454
    False Positives: 15304
    False Negatives: 19761
    True Negatives: 233679
  • Accuracy – Percentage of correct predictions (both boundaries and non-boundaries)
  • Precision – Of predicted boundaries, what fraction is correct
  • Recall – Of actual boundaries, what fraction was found
  • True Positives – Correctly predicted boundaries
  • False Positives – Predicted boundary where there is none
  • False Negatives – Missed actual boundaries
  • True Negatives – Correctly predicted non-boundaries

Graceful Interruption

Press Ctrl+C once during training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.

POS Model Training

For training POS tagging models, use the --pos flag. POS models use the Averaged Perceptron algorithm (multiclass classifier) instead of AdaBoost (binary classifier).

POS Training Command

litsea train --pos --num-epochs 10 <FEATURES_FILE> <MODEL_FILE>

POS Training Example

litsea train --pos --num-epochs 10 ./features.txt ./models/japanese_pos.model

Averaged Perceptron vs AdaBoost

AspectAdaBoost (Segmentation)Averaged Perceptron (POS)
ClassificationBinary (boundary / non-boundary)Multiclass (18 segment labels)
Labels1, -1B-NOUN, B-VERB, …, O
HyperparametersThreshold, IterationsNumber of epochs
Model size~1-22 KB~11 MB

POS Hyperparameters

ParameterFlagDefaultGuidance
Epochs--num-epochs10Number of passes over the training data. Start with 10 and adjust based on metrics

POS Training Output

Result Metrics:
  Accuracy: 98.34%
  Macro Precision: 97.87%
  Macro Recall: 91.67%
  • Accuracy – Percentage of correct predictions across all classes
  • Macro Precision – Average precision across all POS classes
  • Macro Recall – Average recall across all POS classes

POS Graceful Interruption

Press Ctrl+C once during POS training to stop and save the model at its current state. Press Ctrl+C twice to exit immediately without saving.

Evaluating Models

Understanding model quality is essential for producing good segmentation results.

Metrics

The train command outputs three key metrics after training:

Accuracy

Accuracy = (TP + TN) / Total Instances

The percentage of all character positions that were correctly classified (both boundaries and non-boundaries). This is the broadest measure of model quality.

Precision

Precision = TP / (TP + FP)

Of the boundaries the model predicted, what fraction was correct. High precision means few false boundaries (over-segmentation).

Recall

Recall = TP / (TP + FN)

Of the actual boundaries, what fraction did the model find. High recall means few missed boundaries (under-segmentation).

Confusion Matrix

Predicted Boundary (+1)Predicted Non-boundary (-1)
Actual BoundaryTrue Positive (TP)False Negative (FN)
Actual Non-boundaryFalse Positive (FP)True Negative (TN)

Pre-trained Model Benchmarks

ModelAccuracyPrecisionRecallTraining Corpus
japanese.model94.15%95.57%94.36%UD Japanese-GSD
korean.model85.08%UD Korean-GSD
chinese.model80.72%UD Chinese-GSD

Improving Model Quality

If accuracy is unsatisfactory, consider:

  1. More training data – A larger and more diverse corpus
  2. Lower threshold – Try -t 0.001 to allow more boosting iterations
  3. More iterations – Try -i 5000 or higher
  4. Better corpus quality – Ensure consistent tokenization and clean text
  5. Retraining – Start from an existing model and train with additional data (see Retraining Models)

Retraining Models

You can improve an existing model by resuming training with new data.

Command

litsea train -t 0.005 -i 1000 -m <EXISTING_MODEL> <NEW_FEATURES_FILE> <OUTPUT_MODEL>

Example

# Extract features from new corpus
litsea extract -l japanese ./new_corpus.txt ./new_features.txt

# Retrain from existing model
litsea train -t 0.005 -i 1000 \
    -m ./models/japanese.model \
    ./new_features.txt \
    ./models/japanese_v2.model

How It Works

flowchart LR
    A["Existing model<br/>(weights)"] --> C["Trainer"]
    B["New features"] --> C
    C --> D["Retrained model<br/>(updated weights)"]
  1. The trainer initializes features and instances from the new features file
  2. It loads the existing model weights via -m
  3. Training continues with the loaded weights as a starting point
  4. The new model inherits all learned patterns and refines them with new data

Use Cases

  • Domain adaptation – Fine-tune a general model on domain-specific text (e.g., medical, legal)
  • Incremental improvement – Add more training data without retraining from scratch
  • Error correction – Train on examples where the current model makes mistakes

Notes

  • The output model can be the same path as the input model (overwrites)
  • The -m flag accepts file paths, file://, http://, and https:// URIs
  • Retraining starts from the existing weights, so fewer iterations may be needed

Model File Format

Litsea models are stored as simple plain-text files.

Format Specification

<feature_name>\t<weight>
<feature_name>\t<weight>
...
<bias>
  • Each line (except the last) contains a feature name and its weight, separated by a tab character
  • Zero-weight features are omitted to keep the file compact
  • The last line contains the bias term as a single number

Example

BC1:IK	0.3456
BC2:KI	-0.1234
UW4:は	0.5678
UC4:I	0.2345
...
-0.0891

Bias Reconstruction

When loading a model, the bias is reconstructed using:

bias_bucket_weight = -bias_value * 2 - sum(all_feature_weights)

During prediction:

bias = -sum(all_model_weights) / 2.0
score = bias + sum(model[feature] for feature in input_attributes)

File Size

Model files are very compact:

ModelSizeFeatures
japanese.model~2.9 KBWikipedia-trained
korean.model~1.8 KBWikipedia-trained
chinese.model~1.3 KBWikipedia-trained
RWCP.model~22 KBOriginal TinySegmenter
JEITA_Genpaku_ChaSen_IPAdic.model~17 KBJEITA corpus

The compact size is a key advantage of Litsea – models can be embedded directly in applications or served over HTTP with minimal overhead.

Compatibility

  • Model files are encoding-agnostic (feature names are stored as-is)
  • The format is deterministic (features are sorted via BTreeMap)
  • Models are forward-compatible – new features in the input that are not in the model are simply ignored during prediction

Remote Model Loading

Litsea supports loading models from HTTP/HTTPS URLs in addition to local files.

Supported URI Schemes

SchemeExampleDescription
(none)./model.modelLocal file path (default)
file://file:///path/to/modelExplicit file URI
http://http://example.com/modelHTTP URL
https://https://example.com/modelHTTPS URL

CLI Usage

echo "テスト" | litsea segment -l japanese https://example.com/japanese.model

Library Usage

#![allow(unused)]
fn main() {
let mut learner = AdaBoost::new(0.01, 100);

// Local file
learner.load_model_from_path(Path::new("./models/japanese.model"))?; // local, synchronous

// HTTP URL
learner.load_model("https://example.com/models/japanese.model").await?;
}

Implementation Details

  • HTTP client: reqwest with rustls (no OpenSSL dependency)
  • Custom User-Agent: Litsea/<version>
  • The load_model method is async because HTTP loading requires an async runtime
  • For the CLI, tokio provides the async runtime

WASM Considerations

On wasm32 targets:

  • Local file paths are not supported – file system access is unavailable
  • file:// scheme is not supported
  • HTTP/HTTPS loading works via the browser’s fetch API (through reqwest’s WASM support)

Error messages guide users to use URLs instead of file paths when running in WASM.

Benchmarking

Litsea includes a Criterion benchmark suite for measuring performance.

Running Benchmarks

cargo bench --bench bench

Or via the Makefile:

make bench

Benchmark Suite

The benchmarks are defined in litsea/benches/bench.rs:

BenchmarkDescription
segment_short/adaboost/{ja,zh,ko}Segment a short sentence (AdaBoost)
segment_short/averaged_perceptron/{ja,zh,ko}Segment + POS tag a short sentence
segment_long_japanese/{adaboost,averaged_perceptron}Process the full Bocchan novel (~300 KB)
get_type_hiraganaCharacter type classification
add_corpusCorpus ingestion for training
predict_adaboostSingle AdaBoost prediction

Models are loaded synchronously with load_model_from_path — no async runtime is involved in the benchmarks.

HTML Reports

Criterion generates detailed HTML reports with statistics and comparison graphs at:

target/criterion/report/index.html

Open this file in a browser after running benchmarks to view:

  • Iteration times with confidence intervals
  • Throughput measurements
  • Comparison with previous runs (automatic regression detection)

Interpreting Results

Key performance factors:

  • Segmentation is linear in input length (O(n))
  • Character classification is a direct match on character ranges (a few nanoseconds; no setup cost)
  • Prediction at each position depends on the number of features (38-42, constant)
  • Model loading time is proportional to the model file size

Pre-trained Models

Litsea ships with several pre-trained models in the models/ directory.

Model Catalog

japanese.model

PropertyValue
LanguageJapanese
Training CorpusUD Japanese-GSD
Accuracy94.15%
Precision95.57%
Recall94.36%
File Size~2.9 KB

korean.model

PropertyValue
LanguageKorean
Training CorpusUD Korean-GSD
Accuracy85.08%
File Size~1.8 KB

chinese.model

PropertyValue
LanguageChinese (Simplified & Traditional)
Training CorpusUD Chinese-GSD
Accuracy80.72%
File Size~1.3 KB

RWCP.model

PropertyValue
LanguageJapanese
SourceExtracted from the original TinySegmenter
LicenseBSD 3-Clause (Taku Kudo)
File Size~22 KB

JEITA_Genpaku_ChaSen_IPAdic.model

PropertyValue
LanguageJapanese
Training CorpusJEITA Project Sugita Genpaku corpus
TokenizerChaSen with IPAdic
File Size~17 KB

POS Tagging Models

japanese_pos.model

PropertyValue
LanguageJapanese
AlgorithmAveraged Perceptron
Training CorpusUD Japanese-GSD (7,050 sentences)
Epochs10
Accuracy98.34%
Macro Precision97.87%
Macro Recall91.67%
File Size~11 MB

chinese_pos.model

PropertyValue
LanguageChinese (Simplified & Traditional)
AlgorithmAveraged Perceptron
Training CorpusUD Chinese-GSD (3,997 sentences)
Epochs10
Accuracy97.09%
Macro Precision97.31%
Macro Recall96.23%
File Size~19 MB

korean_pos.model

PropertyValue
LanguageKorean
AlgorithmAveraged Perceptron
Training CorpusUD Korean-GSD (4,400 sentences)
Epochs10
Accuracy95.33%
Macro Precision95.30%
Macro Recall87.69%
File Size~8.4 MB

Usage

echo "これはテストです。" | litsea segment --pos -l japanese models/japanese_pos.model

Output:

これ/PRON は/ADP テスト/NOUN です/AUX 。/PUNCT

Choosing a Model

  • For Japanese, use japanese.model for the best accuracy, or RWCP.model for compatibility with the original TinySegmenter
  • For Chinese, use chinese.model
  • For Korean, use korean.model
  • For POS tagging, use the corresponding *_pos.model (japanese_pos.model, chinese_pos.model, korean_pos.model) for joint word segmentation and POS tagging
  • For domain-specific needs, consider training your own model or retraining an existing one

Sample Data

The resources/ directory also contains sample data:

  • bocchan.txt – Sample Japanese corpus from the novel “Botchan” by Natsume Soseki (~307 KB). Used for benchmarking.

License

Litsea is distributed under a dual license.

MIT License

The main Litsea codebase is licensed under the MIT License:

MIT License

Copyright (c) 2025 Minoru OSUKA
Copyright (c) 2022 ICHINOSE Shogo

BSD 3-Clause License

Code originally developed by Taku Kudo (TinySegmenter) is licensed under the BSD 3-Clause License:

Copyright (c) 2008, Taku Kudo
All rights reserved.

Full License Text

The complete license text is available in the LICENSE file in the repository.