Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Text Analysis

Text analysis is the process of converting raw text into searchable tokens. When a document is indexed, the analyzer breaks text fields into individual terms; when a query is executed, the same analyzer processes the query text to ensure consistency.

The Analysis Pipeline

graph LR
    Input["Raw Text\n'The quick brown FOX jumps!'"]
    CF["UnicodeNormalizationCharFilter"]
    T["Tokenizer\nSplit into words"]
    F1["LowercaseFilter"]
    F2["StopFilter"]
    F3["StemFilter"]
    Output["Terms\n'quick', 'brown', 'fox', 'jump'"]

    Input --> CF --> T --> F1 --> F2 --> F3 --> Output

The analysis pipeline consists of:

  1. Char Filters — normalize raw text at the character level before tokenization
  2. Tokenizer — splits text into raw tokens (words, characters, n-grams)
  3. Token Filters — transform, remove, or expand tokens (lowercase, stop words, stemming, synonyms)

The Analyzer Trait

All analyzers implement the Analyzer trait:

#![allow(unused)]
fn main() {
pub trait Analyzer: Send + Sync + Debug {
    fn analyze(&self, text: &str) -> Result<TokenStream>;
    fn name(&self) -> &str;
    fn as_any(&self) -> &dyn Any;
}
}

TokenStream is a Box<dyn Iterator<Item = Token> + Send> — a lazy iterator over tokens.

A Token contains:

FieldTypeDescription
textStringThe token text
positionusizePosition in the original text
start_offsetusizeStart byte offset in original text
end_offsetusizeEnd byte offset in original text
position_incrementusizeDistance from previous token
position_lengthusizeSpan of the token (>1 for synonyms)
boostf32Token-level scoring weight
stoppedboolWhether marked as a stop word
metadataOption<TokenMetadata>Additional token metadata

Built-in Analyzers

StandardAnalyzer

The default analyzer. Suitable for most Western languages.

Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilterStopFilter (128 common English stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::standard::StandardAnalyzer;

let analyzer = StandardAnalyzer::default();
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering)
}

JapaneseAnalyzer

Uses morphological analysis for Japanese text segmentation.

Pipeline: UnicodeNormalizationCharFilter (NFKC) → JapaneseIterationMarkCharFilterLinderaTokenizerLowercaseFilterStopFilter (Japanese stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::japanese::JapaneseAnalyzer;

let analyzer = JapaneseAnalyzer::new()?;
// "東京都に住んでいる" → ["東京", "都", "に", "住ん", "で", "いる"]
}

KeywordAnalyzer

Treats the entire input as a single token. No tokenization or normalization.

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;

let analyzer = KeywordAnalyzer::new();
// "Hello World" → ["Hello World"]
}

Use this for fields that should match exactly (categories, tags, status codes).

SimpleAnalyzer

Tokenizes text without any filtering. The original case and all tokens are preserved. Useful when you need complete control over the analysis pipeline or want to test a tokenizer in isolation.

Pipeline: User-specified Tokenizer only (no char filters, no token filters)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::simple::SimpleAnalyzer;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use std::sync::Arc;

let tokenizer = Arc::new(RegexTokenizer::new()?);
let analyzer = SimpleAnalyzer::new(tokenizer);
// "Hello World" → ["Hello", "World"]
// (no lowercasing, no stop word removal)
}

Use this for testing tokenizers, or when you want to apply token filters manually in a separate step.

EnglishAnalyzer

An English-specific analyzer. Tokenizes, lowercases, and removes common English stop words.

Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilterStopFilter (128 common English stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::language::english::EnglishAnalyzer;

let analyzer = EnglishAnalyzer::new()?;
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering, remaining tokens are lowercased)
}

PipelineAnalyzer

Build a custom pipeline by combining any char filters, a tokenizer, and any sequence of token filters:

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;
use laurus::analysis::token_filter::stop::StopFilter;
use laurus::analysis::token_filter::stem::StemFilter;

let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
    .add_char_filter(Arc::new(UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC)))
    .add_filter(Arc::new(LowercaseFilter::new()))
    .add_filter(Arc::new(StopFilter::new()))
    .add_filter(Arc::new(StemFilter::new()));  // Porter stemmer
}

PerFieldAnalyzer

PerFieldAnalyzer lets you assign different analyzers to different fields within the same engine:

graph LR
    PFA["PerFieldAnalyzer"]
    PFA -->|"title"| KW["KeywordAnalyzer"]
    PFA -->|"body"| STD["StandardAnalyzer"]
    PFA -->|"description_ja"| JP["JapaneseAnalyzer"]
    PFA -->|other fields| DEF["Default\n(StandardAnalyzer)"]
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;
use laurus::analysis::analyzer::per_field::PerFieldAnalyzer;

// Default analyzer for fields not explicitly configured
let mut per_field = PerFieldAnalyzer::new(
    Arc::new(StandardAnalyzer::default())
);

// Use KeywordAnalyzer for exact-match fields
per_field.add_analyzer("category", Arc::new(KeywordAnalyzer::new()));
per_field.add_analyzer("status", Arc::new(KeywordAnalyzer::new()));

let engine = Engine::builder(storage, schema)
    .analyzer(Arc::new(per_field))
    .build()
    .await?;
}

Note: The _id field is always analyzed with KeywordAnalyzer regardless of configuration.

Char Filters

Char filters operate on the raw input text before it reaches the tokenizer. They perform character-level normalization such as Unicode normalization, character mapping, and pattern-based replacement. This ensures that the tokenizer receives clean, normalized text.

All char filters implement the CharFilter trait:

#![allow(unused)]
fn main() {
pub trait CharFilter: Send + Sync {
    fn filter(&self, input: &str) -> (String, Vec<Transformation>);
    fn name(&self) -> &'static str;
}
}

The Transformation records describe how character positions shifted, allowing the engine to map token positions back to the original text.

Char FilterDescription
UnicodeNormalizationCharFilterUnicode normalization (NFC, NFD, NFKC, NFKD)
MappingCharFilterReplaces character sequences based on a mapping dictionary
PatternReplaceCharFilterReplaces characters matching a regex pattern
JapaneseIterationMarkCharFilterExpands Japanese iteration marks (踊り字) to their base characters

UnicodeNormalizationCharFilter

Applies Unicode normalization to the input text. NFKC is recommended for search use cases because it normalizes both compatibility characters and composed forms.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};

let filter = UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC);
// "Sony" (fullwidth) → "Sony" (halfwidth)
// "㌂" → "アンペア"
}
FormDescription
NFCCanonical decomposition followed by canonical composition
NFDCanonical decomposition
NFKCCompatibility decomposition followed by canonical composition
NFKDCompatibility decomposition

MappingCharFilter

Replaces character sequences using a dictionary. Matches are found using the Aho-Corasick algorithm (leftmost-longest match).

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use laurus::analysis::char_filter::mapping::MappingCharFilter;

let mut mapping = HashMap::new();
mapping.insert("ph".to_string(), "f".to_string());
mapping.insert("qu".to_string(), "k".to_string());

let filter = MappingCharFilter::new(mapping)?;
// "phone queue" → "fone keue"
}

PatternReplaceCharFilter

Replaces all occurrences of a regex pattern with a fixed string.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;

// Remove hyphens
let filter = PatternReplaceCharFilter::new(r"-", "")?;
// "123-456-789" → "123456789"

// Normalize numbers
let filter = PatternReplaceCharFilter::new(r"\d+", "NUM")?;
// "Year 2024" → "Year NUM"
}

JapaneseIterationMarkCharFilter

Expands Japanese iteration marks (踊り字) to their base characters. Supports kanji (), hiragana (, ), and katakana (, ) iteration marks.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::japanese_iteration_mark::JapaneseIterationMarkCharFilter;

let filter = JapaneseIterationMarkCharFilter::new(
    true,  // normalize kanji iteration marks
    true,  // normalize kana iteration marks
);
// "佐々木" → "佐佐木"
// "いすゞ" → "いすず"
}

Using Char Filters in a Pipeline

Add char filters to a PipelineAnalyzer with add_char_filter(). Multiple char filters are applied in the order they are added, all before the tokenizer runs.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;

let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
    .add_char_filter(Arc::new(
        UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC),
    ))
    .add_char_filter(Arc::new(
        PatternReplaceCharFilter::new(r"-", "")?,
    ))
    .add_filter(Arc::new(LowercaseFilter::new()));
// "Tokyo-2024" → NFKC → "Tokyo-2024" → remove hyphens → "Tokyo2024" → tokenize → lowercase → ["tokyo2024"]
}

Tokenizers

TokenizerDescription
RegexTokenizerUnicode word boundaries; splits on whitespace and punctuation
UnicodeWordTokenizerSplits on Unicode word boundaries
WhitespaceTokenizerSplits on whitespace only
WholeTokenizerReturns the entire input as a single token
LinderaTokenizerJapanese morphological analysis (Lindera/MeCab)
NgramTokenizerGenerates n-gram tokens of configurable size

Token Filters

FilterDescription
LowercaseFilterConverts tokens to lowercase
StopFilterRemoves common words (“the”, “is”, “a”)
StemFilterReduces words to their root form (“running” → “run”)
SynonymGraphFilterExpands tokens with synonyms from a dictionary
BoostFilterAdjusts token boost values
LimitFilterLimits the number of tokens
StripFilterStrips leading/trailing whitespace from tokens
FlattenGraphFilterFlattens token graphs (for synonym expansion)
RemoveEmptyFilterRemoves empty tokens

Synonym Expansion

The SynonymGraphFilter expands terms using a synonym dictionary:

#![allow(unused)]
fn main() {
use laurus::analysis::synonym::dictionary::SynonymDictionary;
use laurus::analysis::token_filter::synonym_graph::SynonymGraphFilter;

let mut dict = SynonymDictionary::new(None)?;
dict.add_synonym_group(vec!["ml".into(), "machine learning".into()]);
dict.add_synonym_group(vec!["ai".into(), "artificial intelligence".into()]);

// keep_original=true means original token is preserved alongside synonyms
let filter = SynonymGraphFilter::new(dict, true)
    .with_boost(0.8);  // synonyms get 80% weight
}

The boost parameter controls how much weight synonyms receive relative to original tokens. A value of 0.8 means synonym matches contribute 80% as much to the score as exact matches.