Text Analysis

Text analysis is the process of converting raw text into searchable tokens. When a document is indexed, the analyzer breaks text fields into individual terms; when a query is executed, the same analyzer processes the query text to ensure consistency.

The Analysis Pipeline

graph LR
    Input["Raw Text\n'The quick brown FOX jumps!'"]
    CF["UnicodeNormalizationCharFilter"]
    T["Tokenizer\nSplit into words"]
    F1["LowercaseFilter"]
    F2["StopFilter"]
    F3["StemFilter"]
    Output["Terms\n'quick', 'brown', 'fox', 'jump'"]

    Input --> CF --> T --> F1 --> F2 --> F3 --> Output

The analysis pipeline consists of:

Char Filters — normalize raw text at the character level before tokenization
Tokenizer — splits text into raw tokens (words, characters, n-grams)
Token Filters — transform, remove, or expand tokens (lowercase, stop words, stemming, synonyms)

The Analyzer Trait

All analyzers implement the Analyzer trait:

#![allow(unused)]
fn main() {
pub trait Analyzer: Send + Sync + Debug {
    fn analyze(&self, text: &str) -> Result<TokenStream>;
    fn name(&self) -> &str;
    fn as_any(&self) -> &dyn Any;
}
}

TokenStream is a Box<dyn Iterator<Item = Token> + Send> — a lazy iterator over tokens.

A Token contains:

Field	Type	Description
`text`	`String`	The token text
`position`	`usize`	Position in the original text
`start_offset`	`usize`	Start byte offset in original text
`end_offset`	`usize`	End byte offset in original text
`position_increment`	`usize`	Distance from previous token
`position_length`	`usize`	Span of the token (>1 for synonyms)
`boost`	`f32`	Token-level scoring weight
`stopped`	`bool`	Whether marked as a stop word
`metadata`	`Option<TokenMetadata>`	Additional token metadata

Built-in Analyzers

StandardAnalyzer

The default analyzer. Suitable for most Western languages.

Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilter → StopFilter (128 common English stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::standard::StandardAnalyzer;

let analyzer = StandardAnalyzer::default();
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering)
}

JapaneseAnalyzer

Uses morphological analysis for Japanese text segmentation.

Pipeline: UnicodeNormalizationCharFilter (NFKC) → JapaneseIterationMarkCharFilter → LinderaTokenizer → LowercaseFilter → StopFilter (Japanese stop words)

JapaneseAnalyzer::new takes the same arguments as LinderaTokenizer::new: the segmentation mode, a path to a Lindera dictionary directory, and an optional user dictionary path. laurus does not enable Lindera’s embed-* features by default, so a real filesystem path (typically an IPADIC build) is required at runtime.

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::language::japanese::JapaneseAnalyzer;

// Pass the path where you have unpacked the Lindera dictionary.
let analyzer = JapaneseAnalyzer::new(
    "normal",
    "/var/lib/lindera/ipadic",
    None,
)?;
// "東京都に住んでいる" → ["東京", "都", "住ん", "いる"]
}

When the analyzer is referenced from a Schema, supply the parameters through the structured AnalyzerSpec form (see PerFieldAnalyzer below).

KeywordAnalyzer

Treats the entire input as a single token. No tokenization or normalization.

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;

let analyzer = KeywordAnalyzer::new();
// "Hello World" → ["Hello World"]
}

Use this for fields that should match exactly (categories, tags, status codes).

SimpleAnalyzer

Tokenizes text without any filtering. The original case and all tokens are preserved. Useful when you need complete control over the analysis pipeline or want to test a tokenizer in isolation.

Pipeline: User-specified Tokenizer only (no char filters, no token filters)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::simple::SimpleAnalyzer;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use std::sync::Arc;

let tokenizer = Arc::new(RegexTokenizer::new()?);
let analyzer = SimpleAnalyzer::new(tokenizer);
// "Hello World" → ["Hello", "World"]
// (no lowercasing, no stop word removal)
}

Use this for testing tokenizers, or when you want to apply token filters manually in a separate step.

EnglishAnalyzer

An English-specific analyzer. Tokenizes, lowercases, and removes common English stop words.

Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilter → StopFilter (128 common English stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::language::english::EnglishAnalyzer;

let analyzer = EnglishAnalyzer::new()?;
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering, remaining tokens are lowercased)
}

PipelineAnalyzer

Build a custom pipeline by combining any char filters, a tokenizer, and any sequence of token filters:

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;
use laurus::analysis::token_filter::stop::StopFilter;
use laurus::analysis::token_filter::stem::StemFilter;

let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
    .add_char_filter(Arc::new(UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC)))
    .add_filter(Arc::new(LowercaseFilter::new()))
    .add_filter(Arc::new(StopFilter::new()))
    .add_filter(Arc::new(StemFilter::new()));  // Porter stemmer
}

PerFieldAnalyzer

PerFieldAnalyzer lets you assign different analyzers to different fields within the same engine:

graph LR
    PFA["PerFieldAnalyzer"]
    PFA -->|"title"| KW["KeywordAnalyzer"]
    PFA -->|"body"| STD["StandardAnalyzer"]
    PFA -->|"description_ja"| JP["JapaneseAnalyzer"]
    PFA -->|other fields| DEF["Default\n(StandardAnalyzer)"]

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;
use laurus::analysis::analyzer::per_field::PerFieldAnalyzer;

// Default analyzer for fields not explicitly configured
let per_field = PerFieldAnalyzer::new(
    Arc::new(StandardAnalyzer::default())
);

// Use KeywordAnalyzer for exact-match fields
per_field.add_analyzer("category", Arc::new(KeywordAnalyzer::new()));
per_field.add_analyzer("status", Arc::new(KeywordAnalyzer::new()));

let engine = Engine::builder(storage, schema)
    .analyzer(Arc::new(per_field))
    .build()
    .await?;
}

Note: The _id field is always analyzed with KeywordAnalyzer regardless of configuration.

Configuring per-field analyzers from a Schema

Most callers configure analyzers declaratively on the schema rather than wiring them up by hand. The analyzer setting on a text field accepts two shapes:

// 1. A bare name for a parameter-less built-in or a user-registered analyzer.
{ "analyzer": "standard" }
{ "analyzer": "english" }
{ "analyzer": "my_custom_pipeline" }

// 2. A structured object for a parameterised built-in preset. Today only
//    the Japanese preset uses this form (it requires a Lindera dictionary
//    path).
{
  "analyzer": {
    "language": "japanese",
    "mode": "normal",
    "dict": "/var/lib/lindera/ipadic"
  }
}

The bare string "japanese" is rejected because the preset cannot be constructed without a dictionary. Schemas that previously stored "analyzer": "japanese" must migrate to the structured form above.

For full pipelines that do not fit a preset, register the pipeline under schema.analyzers as an AnalyzerDefinition and reference it by name.

Char Filters

Char filters operate on the raw input text before it reaches the tokenizer. They perform character-level normalization such as Unicode normalization, character mapping, and pattern-based replacement. This ensures that the tokenizer receives clean, normalized text.

All char filters implement the CharFilter trait:

#![allow(unused)]
fn main() {
pub trait CharFilter: Send + Sync {
    fn filter(&self, input: &str) -> (String, Vec<Transformation>);
    fn name(&self) -> &'static str;
}
}

The Transformation records describe how character positions shifted, allowing the engine to map token positions back to the original text.

Char Filter	Description
`UnicodeNormalizationCharFilter`	Unicode normalization (NFC, NFD, NFKC, NFKD)
`MappingCharFilter`	Replaces character sequences based on a mapping dictionary
`PatternReplaceCharFilter`	Replaces characters matching a regex pattern
`JapaneseIterationMarkCharFilter`	Expands Japanese iteration marks (踊り字) to their base characters

UnicodeNormalizationCharFilter

Applies Unicode normalization to the input text. NFKC is recommended for search use cases because it normalizes both compatibility characters and composed forms.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};

let filter = UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC);
// "Ｓｏｎｙ" (fullwidth) → "Sony" (halfwidth)
// "㌂" → "アンペア"
}

Form	Description
NFC	Canonical decomposition followed by canonical composition
NFD	Canonical decomposition
NFKC	Compatibility decomposition followed by canonical composition
NFKD	Compatibility decomposition

MappingCharFilter

Replaces character sequences using a dictionary. Matches are found using the Aho-Corasick algorithm (leftmost-longest match).

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use laurus::analysis::char_filter::mapping::MappingCharFilter;

let mut mapping = HashMap::new();
mapping.insert("ph".to_string(), "f".to_string());
mapping.insert("qu".to_string(), "k".to_string());

let filter = MappingCharFilter::new(mapping)?;
// "phone queue" → "fone keue"
}

PatternReplaceCharFilter

Replaces all occurrences of a regex pattern with a fixed string.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;

// Remove hyphens
let filter = PatternReplaceCharFilter::new(r"-", "")?;
// "123-456-789" → "123456789"

// Normalize numbers
let filter = PatternReplaceCharFilter::new(r"\d+", "NUM")?;
// "Year 2024" → "Year NUM"
}

JapaneseIterationMarkCharFilter

Expands Japanese iteration marks (踊り字) to their base characters. Supports kanji (々), hiragana (ゝ, ゞ), and katakana (ヽ, ヾ) iteration marks.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::japanese_iteration_mark::JapaneseIterationMarkCharFilter;

let filter = JapaneseIterationMarkCharFilter::new(
    true,  // normalize kanji iteration marks
    true,  // normalize kana iteration marks
);
// "佐々木" → "佐佐木"
// "いすゞ" → "いすず"
}

Using Char Filters in a Pipeline

Add char filters to a PipelineAnalyzer with add_char_filter(). Multiple char filters are applied in the order they are added, all before the tokenizer runs.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;

let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
    .add_char_filter(Arc::new(
        UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC),
    ))
    .add_char_filter(Arc::new(
        PatternReplaceCharFilter::new(r"-", "")?,
    ))
    .add_filter(Arc::new(LowercaseFilter::new()));
// "Ｔｏｋｙｏ-2024" → NFKC → "Tokyo-2024" → remove hyphens → "Tokyo2024" → tokenize → lowercase → ["tokyo2024"]
}

Tokenizers

Tokenizer	Description
`RegexTokenizer`	Unicode word boundaries; splits on whitespace and punctuation
`UnicodeWordTokenizer`	Splits on Unicode word boundaries
`WhitespaceTokenizer`	Splits on whitespace only
`WholeTokenizer`	Returns the entire input as a single token
`LinderaTokenizer`	Japanese morphological analysis (Lindera/MeCab)
`NgramTokenizer`	Generates n-gram tokens of configurable size

Token Filters

Filter	Description
`LowercaseFilter`	Converts tokens to lowercase
`StopFilter`	Removes common words (“the”, “is”, “a”)
`StemFilter`	Reduces words to their root form (“running” → “run”)
`SynonymGraphFilter`	Expands tokens with synonyms from a dictionary
`BoostFilter`	Adjusts token boost values
`LimitFilter`	Limits the number of tokens
`StripFilter`	Strips leading/trailing whitespace from tokens
`FlattenGraphFilter`	Flattens token graphs (for synonym expansion)
`RemoveEmptyFilter`	Removes empty tokens

Synonym Expansion

The SynonymGraphFilter expands terms using a synonym dictionary:

#![allow(unused)]
fn main() {
use laurus::analysis::synonym::dictionary::SynonymDictionary;
use laurus::analysis::token_filter::synonym_graph::SynonymGraphFilter;

let mut dict = SynonymDictionary::new(None)?;
dict.add_synonym_group(vec!["ml".into(), "machine learning".into()]);
dict.add_synonym_group(vec!["ai".into(), "artificial intelligence".into()]);

// keep_original=true means original token is preserved alongside synonyms
let filter = SynonymGraphFilter::new(dict, true)
    .with_boost(0.8);  // synonyms get 80% weight
}

The boost parameter controls how much weight synonyms receive relative to original tokens. A value of 0.8 means synonym matches contribute 80% as much to the score as exact matches.

Keyboard shortcuts

Laurus Documentation