Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Laurus

A fast, featureful hybrid search library for Rust.

Laurus is a pure-Rust library that combines lexical search (keyword matching via inverted index) and vector search (semantic similarity via embeddings) into a single, unified engine. It is designed to be embedded directly into your Rust application — no external server required.

Key Features

FeatureDescription
Lexical SearchFull-text search powered by an inverted index with BM25 scoring
Vector SearchApproximate nearest neighbor (ANN) search using Flat, HNSW, or IVF indexes
Hybrid SearchCombine lexical and vector results with fusion algorithms (RRF, WeightedSum)
Text AnalysisPluggable analyzer pipeline — tokenizers, filters, stemmers, synonyms
EmbeddingsBuilt-in support for Candle (local BERT/CLIP), OpenAI API, or custom embedders
StoragePluggable backends — in-memory, file-based, or memory-mapped
Query DSLHuman-readable query syntax for lexical, vector, and hybrid search
Pure RustNo C/C++ dependencies in the core — safe, portable, easy to build

How It Works

graph LR
    subgraph Your Application
        D["Document"]
        Q["Query"]
    end

    subgraph Laurus Engine
        SCH["Schema"]
        AN["Analyzer"]
        EM["Embedder"]
        LI["Lexical Index\n(Inverted Index)"]
        VI["Vector Index\n(HNSW / Flat / IVF)"]
        FU["Fusion\n(RRF / WeightedSum)"]
    end

    D --> SCH
    SCH --> AN --> LI
    SCH --> EM --> VI
    Q --> LI --> FU
    Q --> VI --> FU
    FU --> R["Ranked Results"]
  1. Define a Schema — declare your fields and their types (text, integer, vector, etc.)
  2. Build an Engine — attach an analyzer for text and an embedder for vectors
  3. Index Documents — the engine routes each field to the correct index automatically
  4. Search — run lexical, vector, or hybrid queries and get ranked results

Document Map

SectionWhat You Will Learn
Getting StartedInstall Laurus and run your first search in minutes
ArchitectureUnderstand the Engine, its components, and data flow
Core ConceptsSchema, text analysis, embeddings, and storage
IndexingHow inverted indexes and vector indexes work internally
SearchQuery types, vector search, and hybrid fusion
Advanced FeaturesQuery DSL, ID management, WAL, and compaction
API ReferenceKey types and methods at a glance

Quick Example

use std::sync::Arc;
use laurus::{Document, Engine, Schema, SearchRequestBuilder, Result};
use laurus::lexical::{TextOption, TermQuery};
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> Result<()> {
    // 1. Storage
    let storage = Arc::new(MemoryStorage::new(Default::default()));

    // 2. Schema
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_default_field("body")
        .build();

    // 3. Engine
    let engine = Engine::builder(storage, schema).build().await?;

    // 4. Index a document
    let doc = Document::builder()
        .add_text("title", "Hello Laurus")
        .add_text("body", "A fast search library for Rust")
        .build();
    engine.add_document("doc-1", doc).await?;
    engine.commit().await?;

    // 5. Search
    let request = SearchRequestBuilder::new()
        .lexical_search_request(
            laurus::LexicalSearchRequest::new(
                Box::new(TermQuery::new("body", "rust"))
            )
        )
        .limit(10)
        .build();
    let results = engine.search(request).await?;

    for r in &results {
        println!("{}: score={:.4}", r.id, r.score);
    }
    Ok(())
}

License

Laurus is dual-licensed under MIT and Apache 2.0.

Getting Started

Welcome to Laurus! This section will help you install the library and run your first search.

What You Will Build

By the end of this guide, you will have a working search engine that can:

  • Index text documents
  • Perform keyword (lexical) search
  • Perform semantic (vector) search
  • Combine both with hybrid search

Prerequisites

  • Rust 1.85 or later (edition 2024)
  • Cargo (included with Rust)
  • Tokio runtime (Laurus uses async APIs)

Steps

  1. Installation — Add Laurus to your project and choose feature flags
  2. Quick Start — Build a complete search engine in 5 steps

Workflow Overview

Building a search application with Laurus follows a consistent pattern:

graph LR
    A["1. Create\nStorage"] --> B["2. Define\nSchema"]
    B --> C["3. Build\nEngine"]
    C --> D["4. Index\nDocuments"]
    D --> E["5. Search"]
StepWhat Happens
Create StorageChoose where data lives — in memory, on disk, or memory-mapped
Define SchemaDeclare fields and their types (text, integer, vector, etc.)
Build EngineAttach an analyzer (for text) and an embedder (for vectors)
Index DocumentsAdd documents; the engine routes fields to the correct index
SearchRun lexical, vector, or hybrid queries and get ranked results

Installation

Add Laurus to Your Project

Add laurus and tokio (async runtime) to your Cargo.toml:

[dependencies]
laurus = "0.1.0"
tokio = { version = "1", features = ["full"] }

Feature Flags

Laurus ships with a minimal default feature set. Enable additional features as needed:

FeatureDescriptionUse Case
(default)Core library (lexical search, storage, analyzers — no embedding)Keyword search only
embeddings-candleLocal BERT embeddings via Hugging Face CandleVector search without external API
embeddings-openaiOpenAI API embeddings (text-embedding-3-small, etc.)Cloud-based vector search
embeddings-multimodalCLIP embeddings for text + image via CandleMultimodal (text-to-image) search
embeddings-allAll embedding features aboveFull embedding support

Examples

Lexical search only (no embeddings needed):

[dependencies]
laurus = "0.1.0"

Vector search with local model (no API key required):

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-candle"] }

Vector search with OpenAI:

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-openai"] }

Everything:

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-all"] }

Verify Installation

Create a minimal program to verify that Laurus compiles:

use laurus::Result;

#[tokio::main]
async fn main() -> Result<()> {
    println!("Laurus version: {}", laurus::VERSION);
    Ok(())
}
cargo run

If you see the version printed, you are ready to proceed to the Quick Start.

Quick Start

This tutorial walks you through building a complete search engine in 5 steps. By the end, you will be able to index documents and search them by keyword.

Step 1 — Create Storage

Storage determines where Laurus persists index data. For development and testing, use MemoryStorage:

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::storage::memory::MemoryStorage;
use laurus::Storage;

let storage: Arc<dyn Storage> = Arc::new(
    MemoryStorage::new(Default::default())
);
}

Tip: For production, consider FileStorage (with optional use_mmap for memory-mapped I/O). See Storage for details.

Step 2 — Define a Schema

A Schema declares the fields in your documents and how each field should be indexed:

#![allow(unused)]
fn main() {
use laurus::Schema;
use laurus::lexical::TextOption;

let schema = Schema::builder()
    .add_text_field("title", TextOption::default())
    .add_text_field("body", TextOption::default())
    .add_default_field("body")  // used when no field is specified in a query
    .build();
}

Each field has a type. Common types include:

MethodField TypeExample Values
add_text_fieldText (full-text searchable)"Hello world"
add_integer_field64-bit integer42
add_float_field64-bit float3.14
add_boolean_fieldBooleantrue / false
add_datetime_fieldUTC datetime2024-01-15T10:30:00Z
add_hnsw_fieldVector (HNSW index)[0.1, 0.2, ...]
add_flat_fieldVector (Flat index)[0.1, 0.2, ...]

See Schema & Fields for the full list.

Step 3 — Build an Engine

The Engine ties storage, schema, and runtime components together:

#![allow(unused)]
fn main() {
use laurus::Engine;

let engine = Engine::builder(storage, schema)
    .build()
    .await?;
}

When you only use text fields, the default StandardAnalyzer is used automatically. To customize analysis or add vector embeddings, see Architecture.

Step 4 — Index Documents

Create documents with the DocumentBuilder and add them to the engine:

#![allow(unused)]
fn main() {
use laurus::Document;

// Each document needs a unique external ID (string)
let doc = Document::builder()
    .add_text("title", "Introduction to Rust")
    .add_text("body", "Rust is a systems programming language focused on safety and performance.")
    .build();
engine.add_document("doc-1", doc).await?;

let doc = Document::builder()
    .add_text("title", "Python for Data Science")
    .add_text("body", "Python is widely used in machine learning and data analysis.")
    .build();
engine.add_document("doc-2", doc).await?;

let doc = Document::builder()
    .add_text("title", "Web Development with JavaScript")
    .add_text("body", "JavaScript powers interactive web applications and server-side code with Node.js.")
    .build();
engine.add_document("doc-3", doc).await?;

// Commit to make documents searchable
engine.commit().await?;
}

Important: Documents are not searchable until commit() is called.

Use SearchRequestBuilder with a query to search the index:

#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, LexicalSearchRequest};
use laurus::lexical::TermQuery;

// Search for "rust" in the "body" field
let request = SearchRequestBuilder::new()
    .lexical_search_request(
        LexicalSearchRequest::new(
            Box::new(TermQuery::new("body", "rust"))
        )
    )
    .limit(10)
    .build();

let results = engine.search(request).await?;

for result in &results {
    println!("ID: {}, Score: {:.4}", result.id, result.score);
    if let Some(doc) = &result.document {
        if let Some(title) = doc.get("title") {
            println!("  Title: {:?}", title);
        }
    }
}
}

Complete Example

Here is the full program that you can copy, paste, and run:

use std::sync::Arc;
use laurus::{
    Document, Engine, LexicalSearchRequest,
    Result, Schema, SearchRequestBuilder,
};
use laurus::lexical::{TextOption, TermQuery};
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> Result<()> {
    // 1. Storage
    let storage = Arc::new(MemoryStorage::new(Default::default()));

    // 2. Schema
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_default_field("body")
        .build();

    // 3. Engine
    let engine = Engine::builder(storage, schema).build().await?;

    // 4. Index documents
    for (id, title, body) in [
        ("doc-1", "Introduction to Rust", "Rust is a systems programming language focused on safety."),
        ("doc-2", "Python for Data Science", "Python is widely used in machine learning."),
        ("doc-3", "Web Development", "JavaScript powers interactive web applications."),
    ] {
        let doc = Document::builder()
            .add_text("title", title)
            .add_text("body", body)
            .build();
        engine.add_document(id, doc).await?;
    }
    engine.commit().await?;

    // 5. Search
    let request = SearchRequestBuilder::new()
        .lexical_search_request(
            LexicalSearchRequest::new(
                Box::new(TermQuery::new("body", "rust"))
            )
        )
        .limit(10)
        .build();

    let results = engine.search(request).await?;
    for r in &results {
        println!("{}: score={:.4}", r.id, r.score);
    }

    Ok(())
}

Next Steps

Core Concepts

This section covers the foundational building blocks of Laurus. Understanding these concepts will help you design effective schemas and configure your search engine.

Topics

Schema & Fields

How to define the structure of your documents. Covers:

  • Schema and SchemaBuilder
  • Lexical field types (Text, Integer, Float, Boolean, DateTime, Geo, Bytes)
  • Vector field types (Flat, HNSW, IVF)
  • Document and DocumentBuilder
  • DataValue — the unified value type

Text Analysis

How text is processed before indexing. Covers:

  • The Analyzer trait and the analysis pipeline
  • Built-in analyzers (Standard, Japanese, Keyword, Pipeline)
  • PerFieldAnalyzer — different analyzers for different fields
  • Tokenizers and token filters

Embeddings

How text and images are converted to vectors. Covers:

  • The Embedder trait
  • Built-in embedders (Candle BERT, OpenAI, CLIP, Precomputed)
  • PerFieldEmbedder — different embedders for different fields

Storage

Where index data is stored. Covers:

  • The Storage trait
  • Storage backends (Memory, File, Mmap)
  • PrefixedStorage for component isolation
  • Choosing the right backend for your use case

Schema & Fields

The Schema defines the structure of your documents — what fields exist and how each field is indexed. It is the single source of truth for the Engine.

For the TOML file format used by the CLI, see Schema Format Reference.

Schema

A Schema is a collection of named fields. Each field is either a lexical field (for keyword search) or a vector field (for similarity search).

#![allow(unused)]
fn main() {
use laurus::Schema;
use laurus::lexical::TextOption;
use laurus::lexical::core::field::IntegerOption;
use laurus::vector::HnswOption;

let schema = Schema::builder()
    .add_text_field("title", TextOption::default())
    .add_text_field("body", TextOption::default())
    .add_integer_field("year", IntegerOption::default())
    .add_hnsw_field("embedding", HnswOption::default())
    .add_default_field("body")
    .build();
}

Default Fields

add_default_field() specifies which field(s) are searched when a query does not explicitly name a field. This is used by the Query DSL parser.

Field Types

graph TB
    FO["FieldOption"]

    FO --> T["Text"]
    FO --> I["Integer"]
    FO --> FL["Float"]
    FO --> B["Boolean"]
    FO --> DT["DateTime"]
    FO --> G["Geo"]
    FO --> BY["Bytes"]

    FO --> FLAT["Flat"]
    FO --> HNSW["HNSW"]
    FO --> IVF["IVF"]

Lexical Fields

Lexical fields are indexed using an inverted index and support keyword-based queries.

TypeRust TypeSchemaBuilder MethodDescription
TextTextOptionadd_text_field()Full-text searchable; tokenized by the analyzer
IntegerIntegerOptionadd_integer_field()64-bit signed integer; supports range queries
FloatFloatOptionadd_float_field()64-bit floating point; supports range queries
BooleanBooleanOptionadd_boolean_field()true / false
DateTimeDateTimeOptionadd_datetime_field()UTC timestamp; supports range queries
GeoGeoOptionadd_geo_field()Latitude/longitude pair; supports radius and bounding box queries
BytesBytesOptionadd_bytes_field()Raw binary data

Text Field Options

TextOption controls how text is indexed:

#![allow(unused)]
fn main() {
use laurus::lexical::TextOption;

// Default: indexed + stored
let opt = TextOption::default();

// Customize: indexed + stored + term vectors
let opt = TextOption::default()
    .set_indexed(true)
    .set_stored(true)
    .set_term_vectors(true);
}
OptionDefaultDescription
indexedtrueWhether the field is searchable
storedtrueWhether the original value is stored for retrieval
term_vectorsfalseWhether term positions are stored (needed for phrase queries)

Vector Fields

Vector fields are indexed using vector indexes for approximate nearest neighbor (ANN) search.

TypeRust TypeSchemaBuilder MethodDescription
FlatFlatOptionadd_flat_field()Brute-force linear scan; exact results
HNSWHnswOptionadd_hnsw_field()Hierarchical Navigable Small World graph; fast approximate
IVFIvfOptionadd_ivf_field()Inverted File Index; cluster-based approximate

HNSW Field Options (most common)

#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;

let opt = HnswOption {
    dimension: 384,                          // vector dimensions
    distance: DistanceMetric::Cosine,        // distance metric
    m: 16,                                   // max connections per layer
    ef_construction: 200,                    // construction search width
    base_weight: 1.0,                        // default scoring weight
    quantizer: None,                         // optional quantization
};
}

See Vector Indexing for detailed parameter guidance.

Document

A Document is a collection of named field values. Use DocumentBuilder to construct documents:

#![allow(unused)]
fn main() {
use laurus::Document;

let doc = Document::builder()
    .add_text("title", "Introduction to Rust")
    .add_text("body", "Rust is a systems programming language.")
    .add_integer("year", 2024)
    .add_float("rating", 4.8)
    .add_boolean("published", true)
    .build();
}

Indexing Documents

The Engine provides two methods for adding documents, each with different semantics:

MethodBehaviorUse Case
put_document(id, doc)Upsert — if a document with the same ID exists, it is replacedStandard document indexing
add_document(id, doc)Append — adds the document as a new chunk; multiple chunks can share the same IDChunked/split documents (e.g., long articles split into paragraphs)
#![allow(unused)]
fn main() {
// Upsert: replaces any existing document with id "doc1"
engine.put_document("doc1", doc).await?;

// Append: adds another chunk under the same id "doc1"
engine.add_document("doc1", chunk2).await?;

// Always commit after indexing
engine.commit().await?;
}

Retrieving Documents

Use get_documents to retrieve all documents (including chunks) by external ID:

#![allow(unused)]
fn main() {
let docs = engine.get_documents("doc1").await?;
for doc in &docs {
    if let Some(title) = doc.get("title") {
        println!("Title: {:?}", title);
    }
}
}

Deleting Documents

Delete all documents and chunks sharing an external ID:

#![allow(unused)]
fn main() {
engine.delete_documents("doc1").await?;
engine.commit().await?;
}

Document Lifecycle

graph LR
    A["Build Document"] --> B["put/add_document()"]
    B --> C["WAL"]
    C --> D["commit()"]
    D --> E["Searchable"]
    E --> F["get_documents()"]
    E --> G["delete_documents()"]

Important: Documents are not searchable until commit() is called.

DocumentBuilder Methods

MethodValue TypeDescription
add_text(name, value)StringAdd a text field
add_integer(name, value)i64Add an integer field
add_float(name, value)f64Add a float field
add_boolean(name, value)boolAdd a boolean field
add_datetime(name, value)DateTime<Utc>Add a datetime field
add_vector(name, value)Vec<f32>Add a pre-computed vector field
add_geo(name, lat, lon)(f64, f64)Add a geographic point
add_bytes(name, data)Vec<u8>Add binary data
add_field(name, value)DataValueAdd any value type

DataValue

DataValue is the unified value enum that represents any field value in Laurus:

#![allow(unused)]
fn main() {
pub enum DataValue {
    Null,
    Bool(bool),
    Int64(i64),
    Float64(f64),
    Text(String),
    Bytes(Vec<u8>, Option<String>),  // (data, optional MIME type)
    Vector(Vec<f32>),
    DateTime(DateTime<Utc>),
    Geo(f64, f64),          // (latitude, longitude)
}
}

DataValue implements From<T> for common types, so you can use .into() conversions:

#![allow(unused)]
fn main() {
use laurus::DataValue;

let v: DataValue = "hello".into();       // Text
let v: DataValue = 42i64.into();         // Int64
let v: DataValue = 3.14f64.into();       // Float64
let v: DataValue = true.into();          // Bool
let v: DataValue = vec![0.1f32, 0.2].into(); // Vector
}

Reserved Fields

The _id field is reserved by Laurus for internal use. It stores the external document ID and is always indexed with KeywordAnalyzer (exact match). You do not need to add it to your schema — it is managed automatically.

Schema Design Tips

  1. Separate lexical and vector fields — a field is either lexical or vector, never both. For hybrid search, create separate fields (e.g., body for text, body_vec for vector).

  2. Use KeywordAnalyzer for exact-match fields — category, status, and tag fields should use KeywordAnalyzer via PerFieldAnalyzer to avoid tokenization.

  3. Choose the right vector index — use HNSW for most cases, Flat for small datasets, IVF for very large datasets. See Vector Indexing.

  4. Set default fields — if you use the Query DSL, set default fields so users can write hello instead of body:hello.

  5. Use the schema generator — run laurus create schema to interactively build a schema TOML file instead of writing it by hand. See CLI Commands.

Text Analysis

Text analysis is the process of converting raw text into searchable tokens. When a document is indexed, the analyzer breaks text fields into individual terms; when a query is executed, the same analyzer processes the query text to ensure consistency.

The Analysis Pipeline

graph LR
    Input["Raw Text\n'The quick brown FOX jumps!'"]
    CF["UnicodeNormalizationCharFilter"]
    T["Tokenizer\nSplit into words"]
    F1["LowercaseFilter"]
    F2["StopFilter"]
    F3["StemFilter"]
    Output["Terms\n'quick', 'brown', 'fox', 'jump'"]

    Input --> CF --> T --> F1 --> F2 --> F3 --> Output

The analysis pipeline consists of:

  1. Char Filters — normalize raw text at the character level before tokenization
  2. Tokenizer — splits text into raw tokens (words, characters, n-grams)
  3. Token Filters — transform, remove, or expand tokens (lowercase, stop words, stemming, synonyms)

The Analyzer Trait

All analyzers implement the Analyzer trait:

#![allow(unused)]
fn main() {
pub trait Analyzer: Send + Sync + Debug {
    fn analyze(&self, text: &str) -> Result<TokenStream>;
    fn name(&self) -> &str;
    fn as_any(&self) -> &dyn Any;
}
}

TokenStream is a Box<dyn Iterator<Item = Token> + Send> — a lazy iterator over tokens.

A Token contains:

FieldTypeDescription
textStringThe token text
positionusizePosition in the original text
start_offsetusizeStart byte offset in original text
end_offsetusizeEnd byte offset in original text
position_incrementusizeDistance from previous token
position_lengthusizeSpan of the token (>1 for synonyms)
boostf32Token-level scoring weight
stoppedboolWhether marked as a stop word
metadataOption<TokenMetadata>Additional token metadata

Built-in Analyzers

StandardAnalyzer

The default analyzer. Suitable for most Western languages.

Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilterStopFilter (128 common English stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::standard::StandardAnalyzer;

let analyzer = StandardAnalyzer::default();
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering)
}

JapaneseAnalyzer

Uses morphological analysis for Japanese text segmentation.

Pipeline: UnicodeNormalizationCharFilter (NFKC) → JapaneseIterationMarkCharFilterLinderaTokenizerLowercaseFilterStopFilter (Japanese stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::japanese::JapaneseAnalyzer;

let analyzer = JapaneseAnalyzer::new()?;
// "東京都に住んでいる" → ["東京", "都", "に", "住ん", "で", "いる"]
}

KeywordAnalyzer

Treats the entire input as a single token. No tokenization or normalization.

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;

let analyzer = KeywordAnalyzer::new();
// "Hello World" → ["Hello World"]
}

Use this for fields that should match exactly (categories, tags, status codes).

SimpleAnalyzer

Tokenizes text without any filtering. The original case and all tokens are preserved. Useful when you need complete control over the analysis pipeline or want to test a tokenizer in isolation.

Pipeline: User-specified Tokenizer only (no char filters, no token filters)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::simple::SimpleAnalyzer;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use std::sync::Arc;

let tokenizer = Arc::new(RegexTokenizer::new()?);
let analyzer = SimpleAnalyzer::new(tokenizer);
// "Hello World" → ["Hello", "World"]
// (no lowercasing, no stop word removal)
}

Use this for testing tokenizers, or when you want to apply token filters manually in a separate step.

EnglishAnalyzer

An English-specific analyzer. Tokenizes, lowercases, and removes common English stop words.

Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilterStopFilter (128 common English stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::language::english::EnglishAnalyzer;

let analyzer = EnglishAnalyzer::new()?;
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering, remaining tokens are lowercased)
}

PipelineAnalyzer

Build a custom pipeline by combining any char filters, a tokenizer, and any sequence of token filters:

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;
use laurus::analysis::token_filter::stop::StopFilter;
use laurus::analysis::token_filter::stem::StemFilter;

let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
    .add_char_filter(Arc::new(UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC)))
    .add_filter(Arc::new(LowercaseFilter::new()))
    .add_filter(Arc::new(StopFilter::new()))
    .add_filter(Arc::new(StemFilter::new()));  // Porter stemmer
}

PerFieldAnalyzer

PerFieldAnalyzer lets you assign different analyzers to different fields within the same engine:

graph LR
    PFA["PerFieldAnalyzer"]
    PFA -->|"title"| KW["KeywordAnalyzer"]
    PFA -->|"body"| STD["StandardAnalyzer"]
    PFA -->|"description_ja"| JP["JapaneseAnalyzer"]
    PFA -->|other fields| DEF["Default\n(StandardAnalyzer)"]
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;
use laurus::analysis::analyzer::per_field::PerFieldAnalyzer;

// Default analyzer for fields not explicitly configured
let mut per_field = PerFieldAnalyzer::new(
    Arc::new(StandardAnalyzer::default())
);

// Use KeywordAnalyzer for exact-match fields
per_field.add_analyzer("category", Arc::new(KeywordAnalyzer::new()));
per_field.add_analyzer("status", Arc::new(KeywordAnalyzer::new()));

let engine = Engine::builder(storage, schema)
    .analyzer(Arc::new(per_field))
    .build()
    .await?;
}

Note: The _id field is always analyzed with KeywordAnalyzer regardless of configuration.

Char Filters

Char filters operate on the raw input text before it reaches the tokenizer. They perform character-level normalization such as Unicode normalization, character mapping, and pattern-based replacement. This ensures that the tokenizer receives clean, normalized text.

All char filters implement the CharFilter trait:

#![allow(unused)]
fn main() {
pub trait CharFilter: Send + Sync {
    fn filter(&self, input: &str) -> (String, Vec<Transformation>);
    fn name(&self) -> &'static str;
}
}

The Transformation records describe how character positions shifted, allowing the engine to map token positions back to the original text.

Char FilterDescription
UnicodeNormalizationCharFilterUnicode normalization (NFC, NFD, NFKC, NFKD)
MappingCharFilterReplaces character sequences based on a mapping dictionary
PatternReplaceCharFilterReplaces characters matching a regex pattern
JapaneseIterationMarkCharFilterExpands Japanese iteration marks (踊り字) to their base characters

UnicodeNormalizationCharFilter

Applies Unicode normalization to the input text. NFKC is recommended for search use cases because it normalizes both compatibility characters and composed forms.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};

let filter = UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC);
// "Sony" (fullwidth) → "Sony" (halfwidth)
// "㌂" → "アンペア"
}
FormDescription
NFCCanonical decomposition followed by canonical composition
NFDCanonical decomposition
NFKCCompatibility decomposition followed by canonical composition
NFKDCompatibility decomposition

MappingCharFilter

Replaces character sequences using a dictionary. Matches are found using the Aho-Corasick algorithm (leftmost-longest match).

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use laurus::analysis::char_filter::mapping::MappingCharFilter;

let mut mapping = HashMap::new();
mapping.insert("ph".to_string(), "f".to_string());
mapping.insert("qu".to_string(), "k".to_string());

let filter = MappingCharFilter::new(mapping)?;
// "phone queue" → "fone keue"
}

PatternReplaceCharFilter

Replaces all occurrences of a regex pattern with a fixed string.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;

// Remove hyphens
let filter = PatternReplaceCharFilter::new(r"-", "")?;
// "123-456-789" → "123456789"

// Normalize numbers
let filter = PatternReplaceCharFilter::new(r"\d+", "NUM")?;
// "Year 2024" → "Year NUM"
}

JapaneseIterationMarkCharFilter

Expands Japanese iteration marks (踊り字) to their base characters. Supports kanji (), hiragana (, ), and katakana (, ) iteration marks.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::japanese_iteration_mark::JapaneseIterationMarkCharFilter;

let filter = JapaneseIterationMarkCharFilter::new(
    true,  // normalize kanji iteration marks
    true,  // normalize kana iteration marks
);
// "佐々木" → "佐佐木"
// "いすゞ" → "いすず"
}

Using Char Filters in a Pipeline

Add char filters to a PipelineAnalyzer with add_char_filter(). Multiple char filters are applied in the order they are added, all before the tokenizer runs.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;

let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
    .add_char_filter(Arc::new(
        UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC),
    ))
    .add_char_filter(Arc::new(
        PatternReplaceCharFilter::new(r"-", "")?,
    ))
    .add_filter(Arc::new(LowercaseFilter::new()));
// "Tokyo-2024" → NFKC → "Tokyo-2024" → remove hyphens → "Tokyo2024" → tokenize → lowercase → ["tokyo2024"]
}

Tokenizers

TokenizerDescription
RegexTokenizerUnicode word boundaries; splits on whitespace and punctuation
UnicodeWordTokenizerSplits on Unicode word boundaries
WhitespaceTokenizerSplits on whitespace only
WholeTokenizerReturns the entire input as a single token
LinderaTokenizerJapanese morphological analysis (Lindera/MeCab)
NgramTokenizerGenerates n-gram tokens of configurable size

Token Filters

FilterDescription
LowercaseFilterConverts tokens to lowercase
StopFilterRemoves common words (“the”, “is”, “a”)
StemFilterReduces words to their root form (“running” → “run”)
SynonymGraphFilterExpands tokens with synonyms from a dictionary
BoostFilterAdjusts token boost values
LimitFilterLimits the number of tokens
StripFilterStrips leading/trailing whitespace from tokens
FlattenGraphFilterFlattens token graphs (for synonym expansion)
RemoveEmptyFilterRemoves empty tokens

Synonym Expansion

The SynonymGraphFilter expands terms using a synonym dictionary:

#![allow(unused)]
fn main() {
use laurus::analysis::synonym::dictionary::SynonymDictionary;
use laurus::analysis::token_filter::synonym_graph::SynonymGraphFilter;

let mut dict = SynonymDictionary::new(None)?;
dict.add_synonym_group(vec!["ml".into(), "machine learning".into()]);
dict.add_synonym_group(vec!["ai".into(), "artificial intelligence".into()]);

// keep_original=true means original token is preserved alongside synonyms
let filter = SynonymGraphFilter::new(dict, true)
    .with_boost(0.8);  // synonyms get 80% weight
}

The boost parameter controls how much weight synonyms receive relative to original tokens. A value of 0.8 means synonym matches contribute 80% as much to the score as exact matches.

Embeddings

Embeddings convert text (or images) into dense numeric vectors that capture semantic meaning. Two texts with similar meanings produce vectors that are close together in vector space, enabling similarity-based search.

The Embedder Trait

All embedders implement the Embedder trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Embedder: Send + Sync + Debug {
    async fn embed(&self, input: &EmbedInput<'_>) -> Result<Vector>;
    async fn embed_batch(&self, inputs: &[EmbedInput<'_>]) -> Result<Vec<Vector>>;
    fn supported_input_types(&self) -> Vec<EmbedInputType>;
    fn name(&self) -> &str;
    fn as_any(&self) -> &dyn Any;
}
}

The embed() method returns a Vector (a struct wrapping Vec<f32>).

EmbedInput supports two modalities:

VariantDescription
EmbedInput::Text(&str)Text input
EmbedInput::Bytes(&[u8], Option<&str>)Binary input with optional MIME type (for images)

Built-in Embedders

CandleBertEmbedder

Runs a BERT model locally using Hugging Face Candle. No API key required.

Feature flag: embeddings-candle

#![allow(unused)]
fn main() {
use laurus::CandleBertEmbedder;

// Downloads model on first run (~80MB)
let embedder = CandleBertEmbedder::new(
    "sentence-transformers/all-MiniLM-L6-v2"  // model name
)?;
// Output: 384-dimensional vector
}
PropertyValue
Modelsentence-transformers/all-MiniLM-L6-v2
Dimensions384
RuntimeLocal (CPU)
First-run download~80 MB

OpenAIEmbedder

Calls the OpenAI Embeddings API. Requires an API key.

Feature flag: embeddings-openai

#![allow(unused)]
fn main() {
use laurus::OpenAIEmbedder;

let embedder = OpenAIEmbedder::new(
    api_key,
    "text-embedding-3-small".to_string()
).await?;
// Output: 1536-dimensional vector
}
PropertyValue
Modeltext-embedding-3-small (or any OpenAI model)
Dimensions1536 (for text-embedding-3-small)
RuntimeRemote API call
RequiresOPENAI_API_KEY environment variable

CandleClipEmbedder

Runs a CLIP model locally for multimodal (text + image) embeddings.

Feature flag: embeddings-multimodal

#![allow(unused)]
fn main() {
use laurus::CandleClipEmbedder;

let embedder = CandleClipEmbedder::new(
    "openai/clip-vit-base-patch32"
)?;
// Text or images → 512-dimensional vector
}
PropertyValue
Modelopenai/clip-vit-base-patch32
Dimensions512
Input typesText AND images
Use caseText-to-image search, image-to-image search

PrecomputedEmbedder

Use pre-computed vectors directly without any embedding computation. Useful when vectors are generated externally.

#![allow(unused)]
fn main() {
use laurus::PrecomputedEmbedder;

let embedder = PrecomputedEmbedder::new();  // no parameters needed
}

When using PrecomputedEmbedder, you provide vectors directly in documents instead of text for embedding:

#![allow(unused)]
fn main() {
let doc = Document::builder()
    .add_vector("embedding", vec![0.1, 0.2, 0.3, ...])
    .build();
}

PerFieldEmbedder

PerFieldEmbedder routes embedding requests to field-specific embedders:

graph LR
    PFE["PerFieldEmbedder"]
    PFE -->|"text_vec"| BERT["CandleBertEmbedder\n(384 dim)"]
    PFE -->|"image_vec"| CLIP["CandleClipEmbedder\n(512 dim)"]
    PFE -->|other fields| DEF["Default Embedder"]
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::PerFieldEmbedder;

let bert = Arc::new(CandleBertEmbedder::new("...")?);
let clip = Arc::new(CandleClipEmbedder::new("...")?);


let mut per_field = PerFieldEmbedder::new(bert.clone());
per_field.add_embedder("text_vec", bert.clone());
per_field.add_embedder("image_vec", clip.clone());

let engine = Engine::builder(storage, schema)
    .embedder(Arc::new(per_field))
    .build()
    .await?;
}

This is especially useful when:

  • Different vector fields need different models (e.g., BERT for text, CLIP for images)
  • Different fields have different vector dimensions
  • You want to mix local and remote embedders

How Embeddings Are Used

At Index Time

When you add a text value to a vector field, the engine automatically embeds it:

#![allow(unused)]
fn main() {
let doc = Document::builder()
    .add_text("text_vec", "Rust is a systems programming language")
    .build();
engine.add_document("doc-1", doc).await?;
// The embedder converts the text to a vector before indexing
}

At Search Time

When you search with text, the engine embeds the query text as well:

#![allow(unused)]
fn main() {
// Builder API
let request = VectorSearchRequestBuilder::new()
    .add_text("text_vec", "systems programming")
    .build();

// Query DSL
let request = vector_parser.parse(r#"text_vec:~"systems programming""#).await?;
}

Both approaches embed the query text using the same embedder that was used at index time, ensuring consistent vector spaces.

Choosing an Embedder

ScenarioRecommended Embedder
Quick prototyping, offline useCandleBertEmbedder
Production with high accuracyOpenAIEmbedder
Text + image searchCandleClipEmbedder
Pre-computed vectors from external pipelinePrecomputedEmbedder
Multiple models per fieldPerFieldEmbedder wrapping others

Storage

Laurus uses a pluggable storage layer that abstracts how and where index data is persisted. All components — lexical index, vector index, and document log — share a single storage backend.

The Storage Trait

All backends implement the Storage trait:

#![allow(unused)]
fn main() {
pub trait Storage: Send + Sync + Debug {
    fn loading_mode(&self) -> LoadingMode;
    fn open_input(&self, name: &str) -> Result<Box<dyn StorageInput>>;
    fn create_output(&self, name: &str) -> Result<Box<dyn StorageOutput>>;
    fn file_exists(&self, name: &str) -> bool;
    fn delete_file(&self, name: &str) -> Result<()>;
    fn list_files(&self) -> Result<Vec<String>>;
    fn file_size(&self, name: &str) -> Result<u64>;
    // ... additional methods
}
}

This interface is file-oriented: all data (index segments, metadata, WAL entries, documents) is stored as named files accessed through streaming StorageInput / StorageOutput handles.

Storage Backends

MemoryStorage

All data lives in memory. Fast and simple, but not durable.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::memory::MemoryStorage;

let storage: Arc<dyn Storage> = Arc::new(
    MemoryStorage::new(Default::default())
);
}
PropertyValue
DurabilityNone (data lost on process exit)
SpeedFastest
Use caseTesting, prototyping, ephemeral data

FileStorage

Standard file-system based persistence. Each key maps to a file on disk.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::file::{FileStorage, FileStorageConfig};

let config = FileStorageConfig::new("/tmp/laurus-data");
let storage: Arc<dyn Storage> = Arc::new(FileStorage::new("/tmp/laurus-data", config)?);
}
PropertyValue
DurabilityFull (persisted to disk)
SpeedModerate (disk I/O)
Use caseGeneral production use

FileStorage with Memory Mapping

FileStorage supports memory-mapped file access via the use_mmap configuration flag. When enabled, the OS manages paging between memory and disk.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::file::{FileStorage, FileStorageConfig};

let mut config = FileStorageConfig::new("/tmp/laurus-data");
config.use_mmap = true;  // enable memory-mapped I/O
let storage: Arc<dyn Storage> = Arc::new(FileStorage::new("/tmp/laurus-data", config)?);
}
PropertyValue
DurabilityFull (persisted to disk)
SpeedFast (OS-managed memory mapping)
Use caseLarge datasets, read-heavy workloads

StorageFactory

You can also create storage via configuration:

#![allow(unused)]
fn main() {
use laurus::storage::{StorageConfig, StorageFactory};
use laurus::storage::memory::MemoryStorageConfig;

let storage = StorageFactory::create(
    StorageConfig::Memory(MemoryStorageConfig::default())
)?;
}

PrefixedStorage

The engine uses PrefixedStorage to isolate components within a single storage backend:

graph TB
    E["Engine"]
    E --> P1["PrefixedStorage\nprefix = 'lexical/'"]
    E --> P2["PrefixedStorage\nprefix = 'vector/'"]
    E --> P3["PrefixedStorage\nprefix = 'documents/'"]
    P1 --> S["Storage Backend"]
    P2 --> S
    P3 --> S

When the lexical store writes a key segments/seg-001.dict, it is actually stored as lexical/segments/seg-001.dict in the underlying backend. This ensures no key collisions between components.

You do not need to create PrefixedStorage yourself — the EngineBuilder handles this automatically.

Choosing a Backend

FactorMemoryStorageFileStorageFileStorage (mmap)
DurabilityNoneFullFull
Read speedFastestModerateFast
Write speedFastestModerateModerate
Memory usageProportional to data sizeLowOS-managed
Max data sizeLimited by RAMLimited by diskLimited by disk + address space
Best forTests, small datasetsGeneral useLarge read-heavy datasets

Recommendations

  • Development / Testing: Use MemoryStorage for fast iteration without file cleanup
  • Production (general): Use FileStorage for reliable persistence
  • Production (large scale): Use FileStorage with use_mmap = true when you have large indexes and want to leverage OS page cache

Next Steps

Indexing

This section explains how Laurus stores and organizes data internally. Understanding the indexing layer will help you choose the right field types and tune performance.

Topics

Lexical Indexing

How text, numeric, and geographic fields are indexed using an inverted index. Covers:

  • The inverted index structure (term dictionary, posting lists)
  • BKD trees for numeric range queries
  • Segment files and their formats
  • BM25 scoring

Vector Indexing

How vector fields are indexed for approximate nearest neighbor search. Covers:

  • Index types: Flat, HNSW, IVF
  • Parameter tuning (m, ef_construction, n_clusters, n_probe)
  • Distance metrics (Cosine, Euclidean, DotProduct)
  • Quantization (SQ8, PQ)

Lexical Indexing

Lexical indexing powers keyword-based search. When a document’s text field is indexed, Laurus builds an inverted index — a data structure that maps terms to the documents containing them.

How Lexical Indexing Works

sequenceDiagram
    participant Doc as Document
    participant Analyzer
    participant Writer as IndexWriter
    participant Seg as Segment

    Doc->>Analyzer: "The quick brown fox"
    Analyzer->>Analyzer: Tokenize + Filter
    Analyzer-->>Writer: ["quick", "brown", "fox"]
    Writer->>Writer: Buffer in memory
    Writer->>Seg: Flush to segment on commit()

Step by Step

  1. Analyze: The text passes through the configured analyzer (tokenizer + filters), producing a stream of normalized terms
  2. Buffer: Terms are stored in an in-memory write buffer, organized by field
  3. Commit: On commit(), the buffer is flushed to a new segment on storage

The Inverted Index

An inverted index is essentially a map from terms to document lists:

graph LR
    subgraph "Term Dictionary"
        T1["'brown'"]
        T2["'fox'"]
        T3["'quick'"]
        T4["'rust'"]
    end

    subgraph "Posting Lists"
        P1["doc_1, doc_3"]
        P2["doc_1"]
        P3["doc_1, doc_2"]
        P4["doc_2, doc_3"]
    end

    T1 --> P1
    T2 --> P2
    T3 --> P3
    T4 --> P4
ComponentDescription
Term DictionarySorted list of all unique terms in the index; supports fast prefix lookup
Posting ListsFor each term, a list of document IDs and metadata (term frequency, positions)
Doc ValuesColumn-oriented storage for sort/filter operations on numeric and date fields

Posting List Contents

Each entry in a posting list contains:

FieldDescription
Document IDInternal u64 identifier
Term FrequencyHow many times the term appears in this document
Positions (optional)Where in the document the term appears (needed for phrase queries)
WeightScore weight for this posting

Numeric and Date Fields

Integer, float, and datetime fields are indexed using a BKD tree — a space-partitioning data structure optimized for range queries:

graph TB
    Root["BKD Root"]
    Root --> L["values < 50"]
    Root --> R["values >= 50"]
    L --> LL["values < 25"]
    L --> LR["25 <= values < 50"]
    R --> RL["50 <= values < 75"]
    R --> RR["values >= 75"]

BKD trees allow efficient evaluation of range queries like price:[10 TO 100] or date:[2024-01-01 TO 2024-12-31].

Geo Fields

Geographic fields store latitude/longitude pairs. They are indexed using a spatial data structure that supports:

  • Radius queries: find all points within N kilometers of a center point
  • Bounding box queries: find all points within a rectangular area

Segments

The lexical index is organized into segments. Each segment is an immutable, self-contained mini-index:

graph TB
    LI["Lexical Index"]
    LI --> S1["Segment 0"]
    LI --> S2["Segment 1"]
    LI --> S3["Segment 2"]

    S1 --- F1[".dict (terms)"]
    S1 --- F2[".post (postings)"]
    S1 --- F3[".bkd (numerics)"]
    S1 --- F4[".docs (doc store)"]
    S1 --- F5[".dv (doc values)"]
    S1 --- F6[".meta (metadata)"]
    S1 --- F7[".lens (field lengths)"]
File ExtensionContents
.dictTerm dictionary (sorted terms + metadata offsets)
.postPosting lists (document IDs, term frequencies, positions)
.bkdBKD tree data for numeric and date fields
.docsStored field values (the original document content)
.dvDoc values for sorting and filtering
.metaSegment metadata (doc count, term count, etc.)
.lensField length norms (for BM25 scoring)

Segment Lifecycle

  1. Create: A new segment is created each time commit() is called
  2. Search: All segments are searched in parallel and results are merged
  3. Merge: Periodically, multiple small segments are merged into larger ones to improve query performance
  4. Delete: When a document is deleted, its ID is added to a deletion bitmap rather than physically removed (see Deletions & Compaction)

BM25 Scoring

Laurus uses the BM25 algorithm to score lexical search results. BM25 considers:

  • Term Frequency (TF): how often the term appears in the document (more = better, with diminishing returns)
  • Inverse Document Frequency (IDF): how rare the term is across all documents (rarer = more important)
  • Field Length Normalization: shorter fields are boosted relative to longer ones

The formula:

score(q, d) = IDF(q) * (TF(q, d) * (k1 + 1)) / (TF(q, d) + k1 * (1 - b + b * |d| / avgdl))

Where k1 = 1.2 and b = 0.75 are the default tuning parameters.

SIMD Optimization

Vector distance calculations leverage SIMD (Single Instruction, Multiple Data) instructions when available, providing significant speedups for similarity computations in vector search.

Code Example

use std::sync::Arc;
use laurus::{Document, Engine, Schema};
use laurus::lexical::TextOption;
use laurus::lexical::core::field::IntegerOption;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_integer_field("year", IntegerOption::default())
        .build();

    let engine = Engine::builder(storage, schema).build().await?;

    // Index documents
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming")
        .add_text("body", "Rust is a systems programming language.")
        .add_integer("year", 2024)
        .build()
    ).await?;

    // Commit to flush segments to storage
    engine.commit().await?;

    Ok(())
}

Next Steps

Vector Indexing

Vector indexing powers similarity-based search. When a document’s vector field is indexed, Laurus stores the embedding vector in a specialized index structure that enables fast approximate nearest neighbor (ANN) retrieval.

How Vector Indexing Works

sequenceDiagram
    participant Doc as Document
    participant Embedder
    participant Normalize as Normalizer
    participant Index as Vector Index

    Doc->>Embedder: "Rust is a systems language"
    Embedder-->>Normalize: [0.12, -0.45, 0.78, ...]
    Normalize->>Normalize: L2 normalize
    Normalize-->>Index: [0.14, -0.52, 0.90, ...]
    Index->>Index: Insert into index structure

Step by Step

  1. Embed: The text (or image) is converted to a vector by the configured embedder
  2. Normalize: The vector is L2-normalized (for cosine similarity)
  3. Index: The vector is inserted into the configured index structure (Flat, HNSW, or IVF)
  4. Commit: On commit(), the index is flushed to persistent storage

Index Types

Laurus supports three vector index types, each with different performance characteristics:

Comparison

PropertyFlatHNSWIVF
Accuracy100% (exact)~95-99% (approximate)~90-98% (approximate)
Search speedO(n) linear scanO(log n) graph walkO(n/k) cluster scan
Memory usageLowHigher (graph edges)Moderate (centroids)
Index build timeFastModerateSlower (clustering)
Best for< 10K vectors10K - 10M vectors> 1M vectors

Flat Index

The simplest index. Compares the query vector against every stored vector (brute-force).

#![allow(unused)]
fn main() {
use laurus::vector::FlatOption;
use laurus::vector::core::distance::DistanceMetric;

let opt = FlatOption {
    dimension: 384,
    distance: DistanceMetric::Cosine,
    ..Default::default()
};
}
  • Pros: 100% recall (exact results), simple, low memory
  • Cons: Slow for large datasets (linear scan)
  • Use when: You have fewer than ~10,000 vectors, or you need exact results

HNSW Index

Hierarchical Navigable Small World graph. The default and most commonly used index type.

graph TB
    subgraph "Layer 2 (sparse)"
        A2["A"] --- C2["C"]
    end

    subgraph "Layer 1 (medium)"
        A1["A"] --- B1["B"]
        A1 --- C1["C"]
        B1 --- D1["D"]
        C1 --- D1
    end

    subgraph "Layer 0 (dense - all vectors)"
        A0["A"] --- B0["B"]
        A0 --- C0["C"]
        B0 --- D0["D"]
        B0 --- E0["E"]
        C0 --- D0
        C0 --- F0["F"]
        D0 --- E0
        E0 --- F0
    end

    A2 -.->|"entry point"| A1
    A1 -.-> A0
    C2 -.-> C1
    C1 -.-> C0
    B1 -.-> B0
    D1 -.-> D0

The HNSW algorithm searches from the top (sparse) layer down to the bottom (dense) layer, narrowing the search space at each level.

#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;

let opt = HnswOption {
    dimension: 384,
    distance: DistanceMetric::Cosine,
    m: 16,                  // max connections per node per layer
    ef_construction: 200,   // search width during index building
    ..Default::default()
};
}

HNSW Parameters

ParameterDefaultDescriptionImpact
m16Max bi-directional connections per layerHigher = better recall, more memory
ef_construction200Search width during index buildingHigher = better recall, slower build
dimension128Vector dimensionsMust match embedder output
distanceCosineDistance metricSee Distance Metrics below

Tuning tips:

  • Increase m (e.g., 32 or 64) for higher recall at the cost of memory
  • Increase ef_construction (e.g., 400) for better index quality at the cost of build time
  • At search time, the ef_search parameter (set in the search request) controls the search width

IVF Index

Inverted File Index. Partitions vectors into clusters, then only searches relevant clusters.

graph TB
    Q["Query Vector"]
    Q --> C1["Cluster 1\n(centroid)"]
    Q --> C2["Cluster 2\n(centroid)"]

    C1 --> V1["vec_3"]
    C1 --> V2["vec_7"]
    C1 --> V3["vec_12"]

    C2 --> V4["vec_1"]
    C2 --> V5["vec_9"]
    C2 --> V6["vec_15"]

    style C1 fill:#f9f,stroke:#333
    style C2 fill:#f9f,stroke:#333
#![allow(unused)]
fn main() {
use laurus::vector::IvfOption;
use laurus::vector::core::distance::DistanceMetric;

let opt = IvfOption {
    dimension: 384,
    distance: DistanceMetric::Cosine,
    n_clusters: 100,   // number of clusters
    n_probe: 10,       // clusters to search at query time
    ..Default::default()
};
}

IVF Parameters

ParameterDefaultDescriptionImpact
n_clusters100Number of Voronoi cellsMore clusters = faster search, lower recall
n_probe1Clusters to search at query timeHigher = better recall, slower search
dimension(required)Vector dimensionsMust match embedder output
distanceCosineDistance metricSee Distance Metrics below

Tuning tips:

  • Set n_clusters to roughly sqrt(n) where n is the number of vectors
  • Set n_probe to 5-20% of n_clusters for a good recall/speed trade-off
  • IVF requires a training phase — initial indexing may be slower

Distance Metrics

MetricDescriptionRangeBest For
Cosine1 - cosine similarity[0, 2]Text embeddings (most common)
EuclideanL2 distance[0, +inf)Spatial data
ManhattanL1 distance[0, +inf)Feature vectors
DotProductNegative inner product(-inf, +inf)Pre-normalized vectors
AngularAngular distance[0, pi]Directional similarity
#![allow(unused)]
fn main() {
use laurus::vector::core::distance::DistanceMetric;

let metric = DistanceMetric::Cosine;      // Default for text
let metric = DistanceMetric::Euclidean;    // For spatial data
let metric = DistanceMetric::Manhattan;    // L1 distance
let metric = DistanceMetric::DotProduct;   // For pre-normalized vectors
let metric = DistanceMetric::Angular;      // Angular distance
}

Note: For cosine similarity, vectors are automatically L2-normalized before indexing. Lower distance = more similar.

Quantization

Quantization reduces memory usage by compressing vectors at the cost of some accuracy:

MethodEnum VariantDescriptionMemory Reduction
Scalar 8-bitScalar8BitScalar quantization to 8-bit integers~4x
Product QuantizationProductQuantization { subvector_count }Splits vectors into sub-vectors and quantizes each~16-64x
#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::quantization::QuantizationMethod;

let opt = HnswOption {
    dimension: 384,
    quantizer: Some(QuantizationMethod::Scalar8Bit),
    ..Default::default()
};
}

Segment Files

Each vector index type stores its data in a single segment file:

Index TypeFile ExtensionContents
HNSW.hnswGraph structure, vectors, and metadata
Flat.flatRaw vectors and metadata
IVF.ivfCluster centroids, assigned vectors, and metadata

Code Example

use std::sync::Arc;
use laurus::{Document, Engine, Schema};
use laurus::lexical::TextOption;
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_hnsw_field("embedding", HnswOption {
            dimension: 384,
            distance: DistanceMetric::Cosine,
            m: 16,
            ef_construction: 200,
            ..Default::default()
        })
        .build();

    // With an embedder, text in vector fields is automatically embedded
    let engine = Engine::builder(storage, schema)
        .embedder(my_embedder)
        .build()
        .await?;

    // Add text to the vector field — it will be embedded automatically
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming")
        .add_text("embedding", "Rust is a systems programming language.")
        .build()
    ).await?;

    engine.commit().await?;

    Ok(())
}

Next Steps

Search

This section covers how to query your indexed data. Laurus supports three search modes that can be used independently or combined.

Topics

Keyword-based search using an inverted index. Covers:

  • All query types: Term, Phrase, Boolean, Fuzzy, Wildcard, Range, Geo, Span
  • BM25 scoring and field boosts
  • Using the Query DSL for text-based queries

Semantic similarity search using vector embeddings. Covers:

  • VectorSearchRequestBuilder API
  • Multi-field vector search and score modes
  • Filtered vector search

Combining lexical and vector search for best-of-both-worlds results. Covers:

  • SearchRequestBuilder API
  • Fusion algorithms (RRF, WeightedSum)
  • Filtered hybrid search
  • Pagination with offset/limit

Spelling Correction

Suggest corrections for misspelled query terms. Covers:

  • SpellingCorrector and “Did you mean?” features
  • Custom dictionaries and configuration
  • Learning from index terms and user queries

Lexical Search

Lexical search finds documents by matching keywords against an inverted index. Laurus provides a rich set of query types that cover exact matching, phrase matching, fuzzy matching, and more.

Basic Usage

#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, LexicalSearchRequest};
use laurus::lexical::TermQuery;

let request = SearchRequestBuilder::new()
    .lexical_search_request(
        LexicalSearchRequest::new(
            Box::new(TermQuery::new("body", "rust"))
        )
    )
    .limit(10)
    .build();

let results = engine.search(request).await?;
}

Query Types

TermQuery

Matches documents containing an exact term in a specific field.

#![allow(unused)]
fn main() {
use laurus::lexical::TermQuery;

// Find documents where "body" contains the term "rust"
let query = TermQuery::new("body", "rust");
}

Note: Terms are matched after analysis. If the field uses StandardAnalyzer, both the indexed text and the query term are lowercased, so TermQuery::new("body", "rust") will match “Rust” in the original text.

PhraseQuery

Matches documents containing an exact sequence of terms.

#![allow(unused)]
fn main() {
use laurus::lexical::query::phrase::PhraseQuery;

// Find documents containing the exact phrase "machine learning"
let query = PhraseQuery::new("body", vec!["machine".to_string(), "learning".to_string()]);

// Or use the convenience method from a phrase string:
let query = PhraseQuery::from_phrase("body", "machine learning");
}

Phrase queries require term positions to be stored (the default for TextOption).

BooleanQuery

Combines multiple queries with boolean logic.

#![allow(unused)]
fn main() {
use laurus::lexical::query::boolean::{BooleanQuery, BooleanQueryBuilder, Occur};

let query = BooleanQueryBuilder::new()
    .must(Box::new(TermQuery::new("body", "rust")))       // AND
    .must(Box::new(TermQuery::new("body", "programming"))) // AND
    .must_not(Box::new(TermQuery::new("body", "python")))  // NOT
    .build();
}
OccurMeaningDSL Equivalent
MustDocument MUST match+term or AND
ShouldDocument SHOULD match (boosts score)term or OR
MustNotDocument MUST NOT match-term or NOT
FilterMUST match, but does not affect score(no DSL equivalent)

FuzzyQuery

Matches terms within a specified edit distance (Levenshtein distance).

#![allow(unused)]
fn main() {
use laurus::lexical::query::fuzzy::FuzzyQuery;

// Find documents matching "programing" within edit distance 2
// This will match "programming", "programing", etc.
let query = FuzzyQuery::new("body", "programing");  // default max_edits = 2
}

WildcardQuery

Matches terms using wildcard patterns.

#![allow(unused)]
fn main() {
use laurus::lexical::query::wildcard::WildcardQuery;

// '?' matches exactly one character, '*' matches zero or more
let query = WildcardQuery::new("filename", "*.pdf")?;
let query = WildcardQuery::new("body", "pro*")?;
let query = WildcardQuery::new("body", "col?r")?;  // matches "color" and "colour"
}

PrefixQuery

Matches documents containing terms that start with a specific prefix.

#![allow(unused)]
fn main() {
use laurus::lexical::query::prefix::PrefixQuery;

// Find documents where "body" contains terms starting with "pro"
// This matches "programming", "program", "production", etc.
let query = PrefixQuery::new("body", "pro");
}

RegexpQuery

Matches documents containing terms that match a regular expression pattern.

#![allow(unused)]
fn main() {
use laurus::lexical::query::regexp::RegexpQuery;

// Find documents where "body" contains terms matching the regex
let query = RegexpQuery::new("body", "^pro.*ing$")?;

// Match version-like patterns
let query = RegexpQuery::new("version", r"^v\d+\.\d+")?;
}

Note: RegexpQuery::new() returns Result because the regex pattern is validated at construction time. Invalid patterns will produce an error.

NumericRangeQuery

Matches documents with numeric field values within a range.

#![allow(unused)]
fn main() {
use laurus::lexical::NumericRangeQuery;
use laurus::lexical::core::field::NumericType;

// Find documents where "price" is between 10.0 and 100.0 (inclusive)
let query = NumericRangeQuery::new(
    "price",
    NumericType::Float,
    Some(10.0),   // min
    Some(100.0),  // max
    true,         // include min
    true,         // include max
);

// Open-ended range: price >= 50
let query = NumericRangeQuery::new(
    "price",
    NumericType::Float,
    Some(50.0),
    None,     // no upper bound
    true,
    false,
);
}

GeoQuery

Matches documents by geographic location.

#![allow(unused)]
fn main() {
use laurus::lexical::query::geo::GeoQuery;

// Find documents within 10km of Tokyo Station (35.6812, 139.7671)
let query = GeoQuery::within_radius("location", 35.6812, 139.7671, 10.0)?; // radius in kilometers

// Find documents within a bounding box (min_lat, min_lon, max_lat, max_lon)
let query = GeoQuery::within_bounding_box(
    "location",
    35.0, 139.0,  // min (lat, lon)
    36.0, 140.0,  // max (lat, lon)
)?;
}

SpanQuery

Matches terms based on their proximity within a document. Use SpanTermQuery and SpanNearQuery to build proximity queries:

#![allow(unused)]
fn main() {
use laurus::lexical::query::span::{SpanQuery, SpanTermQuery, SpanNearQuery};

// Find documents where "quick" appears near "fox" (within 3 positions)
let query = SpanNearQuery::new(
    "body",
    vec![
        Box::new(SpanTermQuery::new("body", "quick")) as Box<dyn SpanQuery>,
        Box::new(SpanTermQuery::new("body", "fox")) as Box<dyn SpanQuery>,
    ],
    3,    // slop (max distance between terms)
    true, // in_order (terms must appear in order)
);
}

Scoring

Lexical search results are scored using BM25. The score reflects how relevant a document is to the query:

  • Higher term frequency in the document increases the score
  • Rarer terms across the index increase the score
  • Shorter documents are boosted relative to longer ones

Field Boosts

You can boost specific fields to influence relevance:

#![allow(unused)]
fn main() {
use laurus::LexicalSearchRequest;

let mut request = LexicalSearchRequest::new(Box::new(query));
request.field_boosts.insert("title".to_string(), 2.0);  // title matches count double
request.field_boosts.insert("body".to_string(), 1.0);
}

LexicalSearchRequest Options

OptionDefaultDescription
query(required)The query to execute
limit10Maximum number of results
load_documentstrueWhether to load full document content
min_score0.0Minimum score threshold
timeout_msNoneSearch timeout in milliseconds
parallelfalseEnable parallel search across segments
sort_byScoreSort by relevance score, or by a field (asc / desc)
field_boostsemptyPer-field score multipliers

Builder Methods

LexicalSearchRequest supports a builder-style API for setting options:

#![allow(unused)]
fn main() {
use laurus::LexicalSearchRequest;
use laurus::lexical::TermQuery;

let request = LexicalSearchRequest::new(Box::new(TermQuery::new("body", "rust")))
    .limit(20)
    .min_score(0.5)
    .timeout_ms(5000)
    .parallel(true)
    .sort_by_field_desc("date")
    .with_field_boost("title", 2.0)
    .with_field_boost("body", 1.0);
}

Using the Query DSL

Instead of building queries programmatically, you can use the text-based Query DSL:

#![allow(unused)]
fn main() {
use laurus::lexical::QueryParser;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use std::sync::Arc;

let analyzer = Arc::new(StandardAnalyzer::default());
let parser = QueryParser::new(analyzer).with_default_field("body");

// Simple term
let query = parser.parse("rust")?;

// Boolean
let query = parser.parse("rust AND programming")?;

// Phrase
let query = parser.parse("\"machine learning\"")?;

// Field-specific
let query = parser.parse("title:rust AND body:programming")?;

// Fuzzy
let query = parser.parse("programing~2")?;

// Range
let query = parser.parse("year:[2020 TO 2024]")?;
}

See Query DSL for the complete syntax reference.

Next Steps

Vector Search

Vector search finds documents by semantic similarity. Instead of matching keywords, it compares the meaning of the query against document embeddings in vector space.

Basic Usage

Builder API

#![allow(unused)]
fn main() {
use laurus::SearchRequestBuilder;
use laurus::vector::VectorSearchRequestBuilder;

let request = SearchRequestBuilder::new()
    .vector_search_request(
        VectorSearchRequestBuilder::new()
            .add_text("embedding", "systems programming language")
            .limit(10)
            .build()
    )
    .build();

let results = engine.search(request).await?;
}

The add_text() method stores the text as a query payload. At search time, the engine embeds it using the configured embedder and then searches the vector index.

Query DSL

#![allow(unused)]
fn main() {
use laurus::vector::VectorQueryParser;

let parser = VectorQueryParser::new(embedder.clone())
    .with_default_field("embedding");

let request = parser.parse(r#"embedding:~"systems programming""#).await?;
}

VectorSearchRequestBuilder

The builder API provides fine-grained control:

#![allow(unused)]
fn main() {
use laurus::vector::VectorSearchRequestBuilder;
use laurus::vector::store::request::QueryVector;

let request = VectorSearchRequestBuilder::new()
    // Text query (will be embedded at search time)
    .add_text("text_vec", "machine learning")

    // Or use a pre-computed vector directly
    .add_vector("embedding", vec![0.1, 0.2, 0.3, /* ... */])

    // Search parameters
    .limit(20)

    .build();
}

Methods

MethodDescription
add_text(field, text)Add a text query for a specific field (embedded at search time)
add_vector(field, vector)Add a pre-computed query vector for a specific field
add_vector_with_weight(field, vector, weight)Add a pre-computed vector with an explicit weight
add_payload(field, payload)Add a generic DataValue payload to be embedded
add_bytes(field, bytes, mime)Add a binary payload (e.g., image bytes for multimodal)
field(name)Restrict search to a specific field
fields(names)Restrict search to multiple fields
limit(n)Maximum number of results (default: 10)
score_mode(VectorScoreMode)Score combination mode (WeightedSum, MaxSim, LateInteraction)
min_score(f32)Minimum score threshold (default: 0.0)
overfetch(f32)Overfetch factor for better result quality (default: 1.0)
build()Build the VectorSearchRequest

You can search across multiple vector fields in a single request:

#![allow(unused)]
fn main() {
let request = VectorSearchRequestBuilder::new()
    .add_text("text_vec", "cute kitten")
    .add_text("image_vec", "fluffy cat")
    .build();
}

Each clause produces a vector that is searched against its respective field. Results are combined using the configured score mode.

Score Modes

ModeDescription
WeightedSum (default)Sum of (similarity * weight) across all clauses
MaxSimMaximum similarity score across clauses
LateInteractionColBERT-style late interaction scoring

Weights

Use the ^ boost syntax in DSL or weight in QueryVector to adjust how much each field contributes:

text_vec:~"cute kitten"^1.0 image_vec:~"fluffy cat"^0.5

This means text similarity counts twice as much as image similarity.

You can apply lexical filters to narrow the vector search results:

#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, LexicalSearchRequest};
use laurus::lexical::TermQuery;
use laurus::vector::VectorSearchRequestBuilder;

// Vector search with a category filter
let request = SearchRequestBuilder::new()
    .vector_search_request(
        VectorSearchRequestBuilder::new()
            .add_text("embedding", "machine learning")
            .build()
    )
    .filter_query(Box::new(TermQuery::new("category", "tutorial")))
    .limit(10)
    .build();

let results = engine.search(request).await?;
}

The filter query runs first on the lexical index to identify allowed document IDs, then the vector search is restricted to those IDs.

Filter with Numeric Range

#![allow(unused)]
fn main() {
use laurus::lexical::NumericRangeQuery;
use laurus::lexical::core::field::NumericType;

let request = SearchRequestBuilder::new()
    .vector_search_request(
        VectorSearchRequestBuilder::new()
            .add_text("embedding", "type systems")
            .build()
    )
    .filter_query(Box::new(NumericRangeQuery::new(
        "year", NumericType::Integer,
        Some(2020.0), Some(2024.0), true, true
    )))
    .limit(10)
    .build();
}

Distance Metrics

The distance metric is configured per field in the schema (see Vector Indexing):

MetricDescriptionLower = More Similar
Cosine1 - cosine similarityYes
EuclideanL2 distanceYes
ManhattanL1 distanceYes
DotProductNegative inner productYes
AngularAngular distanceYes
use std::sync::Arc;
use laurus::{Document, Engine, Schema, SearchRequestBuilder, PerFieldEmbedder};
use laurus::lexical::TextOption;
use laurus::vector::HnswOption;
use laurus::vector::VectorSearchRequestBuilder;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));

    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_hnsw_field("text_vec", HnswOption {
            dimension: 384,
            ..Default::default()
        })
        .build();

    // Set up per-field embedder
    let embedder = Arc::new(my_embedder);
    let mut pfe = PerFieldEmbedder::new(embedder.clone());
    pfe.add_embedder("text_vec", embedder.clone());

    let engine = Engine::builder(storage, schema)
        .embedder(Arc::new(pfe))
        .build()
        .await?;

    // Index documents (text in vector field is auto-embedded)
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming")
        .add_text("text_vec", "Rust is a systems programming language.")
        .build()
    ).await?;
    engine.commit().await?;

    // Search by semantic similarity
    let results = engine.search(
        SearchRequestBuilder::new()
            .vector_search_request(
                VectorSearchRequestBuilder::new()
                    .add_text("text_vec", "systems language")
                    .build()
            )
            .limit(5)
            .build()
    ).await?;

    for r in &results {
        println!("{}: score={:.4}", r.id, r.score);
    }

    Ok(())
}

Next Steps

Hybrid Search

Hybrid search combines lexical search (keyword matching) with vector search (semantic similarity) to deliver results that are both precise and semantically relevant. This is Laurus’s most powerful search mode.

Search TypeStrengthsWeaknesses
Lexical onlyExact keyword matching, handles rare terms wellMisses synonyms and paraphrases
Vector onlyUnderstands meaning, handles synonymsMay miss exact keywords, less precise
HybridBest of both worldsSlightly more complex to configure

How It Works

sequenceDiagram
    participant User
    participant Engine
    participant Lexical as LexicalStore
    participant Vector as VectorStore
    participant Fusion

    User->>Engine: SearchRequest\n(lexical + vector)

    par Execute in parallel
        Engine->>Lexical: BM25 keyword search
        Lexical-->>Engine: Ranked hits (by relevance)
    and
        Engine->>Vector: ANN similarity search
        Vector-->>Engine: Ranked hits (by distance)
    end

    Engine->>Fusion: Merge two result sets
    Note over Fusion: RRF or WeightedSum
    Fusion-->>Engine: Unified ranked list
    Engine-->>User: Vec of SearchResult

Basic Usage

Builder API

#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, LexicalSearchRequest, FusionAlgorithm};
use laurus::lexical::TermQuery;
use laurus::vector::VectorSearchRequestBuilder;

let request = SearchRequestBuilder::new()
    // Lexical component
    .lexical_search_request(
        LexicalSearchRequest::new(
            Box::new(TermQuery::new("body", "rust"))
        )
    )
    // Vector component
    .vector_search_request(
        VectorSearchRequestBuilder::new()
            .add_text("text_vec", "systems programming")
            .build()
    )
    // Fusion algorithm
    .fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
    .limit(10)
    .build();

let results = engine.search(request).await?;
}

Query DSL

Mix lexical and vector clauses in a single query string:

#![allow(unused)]
fn main() {
use laurus::UnifiedQueryParser;
use laurus::lexical::QueryParser;
use laurus::vector::VectorQueryParser;

let unified = UnifiedQueryParser::new(
    QueryParser::new(analyzer).with_default_field("body"),
    VectorQueryParser::new(embedder),
);

// Lexical + vector in one query
let request = unified.parse(r#"body:rust text_vec:~"systems programming""#).await?;
let results = engine.search(request).await?;
}

The ~"..." syntax identifies vector clauses. Everything else is parsed as lexical.

Fusion Algorithms

When both lexical and vector results exist, they must be merged into a single ranked list. Laurus supports two fusion algorithms:

RRF (Reciprocal Rank Fusion)

The default algorithm. Combines results based on their rank positions rather than raw scores.

score(doc) = sum( 1 / (k + rank_i) )

Where rank_i is the position of the document in each result list, and k is a smoothing parameter (default 60).

#![allow(unused)]
fn main() {
use laurus::FusionAlgorithm;

let fusion = FusionAlgorithm::RRF { k: 60.0 };
}

Advantages:

  • Robust to different score distributions between lexical and vector results
  • No need to tune weights
  • Works well out of the box

WeightedSum

Linearly combines normalized lexical and vector scores:

score(doc) = lexical_weight * lexical_score + vector_weight * vector_score
#![allow(unused)]
fn main() {
use laurus::FusionAlgorithm;

let fusion = FusionAlgorithm::WeightedSum {
    lexical_weight: 0.3,
    vector_weight: 0.7,
};
}

When to use:

  • When you want explicit control over the balance between lexical and vector relevance
  • When you know one signal is more important than the other

SearchRequest Options

OptionDefaultDescription
lexical_search_requestNoneLexical query component
vector_search_requestNoneVector query component
filter_queryNonePre-filter using a lexical query (restricts both lexical and vector results)
fusion_algorithmNone (uses RRF { k: 60.0 } when both results exist)How to merge lexical and vector results
limit10Maximum number of results to return
offset0Number of results to skip (for pagination)

SearchResult

Each result contains:

FieldTypeDescription
idStringExternal document ID
scoref32Fused relevance score
documentOption<Document>Full document content (if loaded)

Apply a filter to restrict both lexical and vector results:

#![allow(unused)]
fn main() {
let request = SearchRequestBuilder::new()
    .lexical_search_request(
        LexicalSearchRequest::new(Box::new(TermQuery::new("body", "rust")))
    )
    .vector_search_request(
        VectorSearchRequestBuilder::new()
            .add_text("text_vec", "systems programming")
            .build()
    )
    // Only search within "tutorial" category
    .filter_query(Box::new(TermQuery::new("category", "tutorial")))
    .fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
    .limit(10)
    .build();
}

How Filtering Works

  1. The filter query runs on the lexical index to produce a set of allowed document IDs
  2. For lexical search: the filter is combined with the user query as a boolean AND
  3. For vector search: the allowed IDs are passed to restrict the ANN search

Pagination

Use offset and limit for pagination:

#![allow(unused)]
fn main() {
// Page 1: results 0-9
let page1 = SearchRequestBuilder::new()
    .lexical_search_request(/* ... */)
    .vector_search_request(/* ... */)
    .offset(0)
    .limit(10)
    .build();

// Page 2: results 10-19
let page2 = SearchRequestBuilder::new()
    .lexical_search_request(/* ... */)
    .vector_search_request(/* ... */)
    .offset(10)
    .limit(10)
    .build();
}

Complete Example

use std::sync::Arc;
use laurus::{
    Document, Engine, Schema, SearchRequestBuilder,
    LexicalSearchRequest, FusionAlgorithm, PerFieldEmbedder,
};
use laurus::lexical::{TextOption, TermQuery};
use laurus::lexical::core::field::IntegerOption;
use laurus::vector::{HnswOption, VectorSearchRequestBuilder};
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));

    // Schema with both lexical and vector fields
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_text_field("category", TextOption::default())
        .add_integer_field("year", IntegerOption::default())
        .add_hnsw_field("body_vec", HnswOption {
            dimension: 384,
            ..Default::default()
        })
        .build();

    // Configure analyzer and embedder (see Text Analysis and Embeddings docs)
    // let analyzer = Arc::new(StandardAnalyzer::new()?);
    // let embedder = Arc::new(CandleBertEmbedder::new("sentence-transformers/all-MiniLM-L6-v2")?);
    let engine = Engine::builder(storage, schema)
        // .analyzer(analyzer)
        // .embedder(embedder)
        .build()
        .await?;

    // Index documents with both text and vector fields
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming Guide")
        .add_text("body", "Rust is a systems programming language.")
        .add_text("category", "programming")
        .add_integer("year", 2024)
        .add_text("body_vec", "Rust is a systems programming language.")
        .build()
    ).await?;
    engine.commit().await?;

    // Hybrid search: keyword "rust" + semantic "systems language"
    let results = engine.search(
        SearchRequestBuilder::new()
            .lexical_search_request(
                LexicalSearchRequest::new(Box::new(TermQuery::new("body", "rust")))
            )
            .vector_search_request(
                VectorSearchRequestBuilder::new()
                    .add_text("body_vec", "systems language")
                    .build()
            )
            .fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
            .limit(10)
            .build()
    ).await?;

    for r in &results {
        println!("{}: score={:.4}", r.id, r.score);
    }

    Ok(())
}

Next Steps

Spelling Correction

Laurus includes a built-in spelling correction system that can suggest corrections for misspelled query terms and provide “Did you mean?” functionality.

Overview

The spelling corrector uses edit distance (Levenshtein distance) combined with word frequency data to suggest corrections. It supports:

  • Word-level suggestions — correct individual misspelled words
  • Auto-correction — automatically apply high-confidence corrections
  • “Did you mean?” — suggest alternative queries to the user
  • Query learning — improve suggestions by learning from user queries
  • Custom dictionaries — use your own word lists

Basic Usage

SpellingCorrector

#![allow(unused)]
fn main() {
use laurus::spelling::corrector::SpellingCorrector;

// Create a corrector with the built-in English dictionary
let mut corrector = SpellingCorrector::new();

// Correct a query
let result = corrector.correct("programing langauge");

// Check if suggestions are available
if result.has_suggestions() {
    for (word, suggestions) in &result.word_suggestions {
        println!("'{}' -> {:?}", word, suggestions);
    }
}

// Get the best corrected query
if let Some(corrected) = result.query() {
    println!("Corrected: {}", corrected);
}
}

“Did You Mean?”

The DidYouMean wrapper provides a higher-level interface for search UIs:

#![allow(unused)]
fn main() {
use laurus::spelling::corrector::{SpellingCorrector, DidYouMean};

let corrector = SpellingCorrector::new();
let mut did_you_mean = DidYouMean::new(corrector);

if let Some(suggestion) = did_you_mean.suggest("programing") {
    println!("Did you mean: {}?", suggestion);
}
}

Configuration

Use CorrectorConfig to customize behavior:

#![allow(unused)]
fn main() {
use laurus::spelling::corrector::{CorrectorConfig, SpellingCorrector};

let config = CorrectorConfig {
    max_distance: 2,              // Maximum edit distance (default: 2)
    max_suggestions: 5,           // Max suggestions per word (default: 5)
    min_frequency: 1,             // Minimum word frequency threshold (default: 1)
    auto_correct: false,          // Enable auto-correction (default: false)
    auto_correct_threshold: 0.8,  // Confidence threshold for auto-correction (default: 0.8)
    use_index_terms: true,        // Use indexed terms as dictionary (default: true)
    learn_from_queries: true,     // Learn from user queries (default: true)
};
}

Configuration Options

OptionTypeDefaultDescription
max_distanceusize2Maximum Levenshtein edit distance for candidate suggestions
max_suggestionsusize5Maximum number of suggestions returned per word
min_frequencyu321Minimum frequency a word must have in the dictionary to be suggested
auto_correctboolfalseWhen true, automatically apply corrections above the threshold
auto_correct_thresholdf640.8Confidence score (0.0–1.0) required for auto-correction
use_index_termsbooltrueUse terms from the search index as dictionary words
learn_from_queriesbooltrueLearn new words from user search queries

CorrectionResult

The correct() method returns a CorrectionResult with detailed information:

FieldTypeDescription
originalStringThe original query string
correctedOption<String>The corrected query (if auto-correction was applied)
word_suggestionsHashMap<String, Vec<Suggestion>>Suggestions grouped by misspelled word
confidencef64Overall confidence score (0.0–1.0)
auto_correctedboolWhether auto-correction was applied

Helper Methods

MethodReturnsDescription
has_suggestions()boolTrue if any word has suggestions
best_suggestion()Option<&Suggestion>The single highest-scoring suggestion
query()Option<String>The corrected query string, if corrections were made
should_show_did_you_mean()boolWhether to display a “Did you mean?” prompt

Custom Dictionaries

You can provide your own dictionary instead of using the built-in English one:

#![allow(unused)]
fn main() {
use laurus::spelling::corrector::SpellingCorrector;
use laurus::spelling::dictionary::SpellingDictionary;

// Build a custom dictionary
let mut dictionary = SpellingDictionary::new();
dictionary.add_word("elasticsearch", 100);
dictionary.add_word("lucene", 80);
dictionary.add_word("laurus", 90);

let corrector = SpellingCorrector::with_dictionary(dictionary);
}

Learning from Index Terms

When use_index_terms is enabled, the corrector can learn from terms in your search index:

#![allow(unused)]
fn main() {
let mut corrector = SpellingCorrector::new();

// Feed index terms to the corrector
let index_terms = vec!["rust", "programming", "search", "engine"];
corrector.learn_from_terms(&index_terms);
}

This improves suggestion quality by incorporating domain-specific vocabulary.

Statistics

Monitor the corrector’s state with stats():

#![allow(unused)]
fn main() {
let stats = corrector.stats();
println!("Dictionary words: {}", stats.dictionary_words);
println!("Total frequency: {}", stats.dictionary_total_frequency);
println!("Learned queries: {}", stats.queries_learned);
}

Next Steps

CLI (Command-Line Interface)

Laurus provides a command-line tool laurus that lets you create indexes, manage documents, and run search queries without writing code.

Features

  • Index management — Create and inspect indexes from TOML schema files, with an interactive schema generator
  • Document CRUD — Add, retrieve, and delete documents via JSON
  • Search — Execute queries using the Query DSL
  • Dual output — Human-readable tables or machine-parseable JSON
  • Interactive REPL — Explore your index in a live session
  • gRPC server — Start a gRPC server with laurus serve

Getting Started

# Install
cargo install laurus-cli

# Generate a schema interactively
laurus create schema

# Create an index from the schema
laurus --data-dir ./my_index create index --schema schema.toml

# Add a document
laurus --data-dir ./my_index add doc --id doc1 --data '{"title":"Hello","body":"World"}'

# Commit changes
laurus --data-dir ./my_index commit

# Search
laurus --data-dir ./my_index search "body:world"

See the sub-sections for detailed documentation:

Installation

From crates.io

cargo install laurus-cli

This installs the laurus binary to ~/.cargo/bin/.

From source

git clone https://github.com/mosuka/laurus.git
cd laurus
cargo install --path laurus-cli

Verify

laurus --version

Shell Completion

Generate completion scripts for your shell:

# Bash
laurus --help

# The CLI uses clap, so shell completions can be generated
# with clap_complete if needed in a future release.

Command Reference

Global Options

Every command accepts these options:

OptionEnvironment VariableDefaultDescription
--data-dir <PATH>LAURUS_DATA_DIR./laurus_dataPath to the index data directory
--format <FORMAT>tableOutput format: table or json
# Example: use JSON output with a custom data directory
laurus --data-dir /var/data/my_index --format json search "title:rust"

create — Create a Resource

create index

Create a new index from a schema TOML file.

laurus create index --schema <FILE>

Arguments:

FlagRequiredDescription
--schema <FILE>YesPath to a TOML file defining the index schema

Schema file format:

The schema file follows the same structure as the Schema type in the Laurus library. See Schema Format Reference for full details. Example:

default_fields = ["title", "body"]

[fields.title.Text]
stored = true
indexed = true

[fields.body.Text]
stored = true
indexed = true

[fields.category.Text]
stored = true
indexed = true

Example:

laurus --data-dir ./my_index create index --schema schema.toml
# Index created at ./my_index.

Note: An error is returned if the index already exists. Delete the data directory to recreate.

create schema

Interactively generate a schema TOML file through a guided wizard.

laurus create schema [--output <FILE>]

Arguments:

FlagRequiredDefaultDescription
--output <FILE>Noschema.tomlOutput file path for the generated schema

The wizard guides you through:

  1. Field definition — Enter a field name, select the type, and configure type-specific options
  2. Repeat — Add as many fields as needed
  3. Default fields — Select which lexical fields to use as default search fields
  4. Preview — Review the generated TOML before saving
  5. Save — Write the schema file

Supported field types:

TypeCategoryOptions
TextLexicalindexed, stored, term_vectors
IntegerLexicalindexed, stored
FloatLexicalindexed, stored
BooleanLexicalindexed, stored
DateTimeLexicalindexed, stored
GeoLexicalindexed, stored
BytesLexicalstored
HnswVectordimension, distance, m, ef_construction
FlatVectordimension, distance
IvfVectordimension, distance, n_clusters, n_probe

Example:

# Generate schema.toml interactively
laurus create schema

# Specify output path
laurus create schema --output my_schema.toml

# Then create an index from the generated schema
laurus create index --schema schema.toml

get — Get a Resource

get index

Display statistics about the index.

laurus get index

Table output example:

Document count: 42

Vector fields:
╭──────────┬─────────┬───────────╮
│ Field    │ Vectors │ Dimension │
├──────────┼─────────┼───────────┤
│ text_vec │ 42      │ 384       │
╰──────────┴─────────┴───────────╯

JSON output example:

laurus --format json get index
{
  "document_count": 42,
  "fields": {
    "text_vec": {
      "vector_count": 42,
      "dimension": 384
    }
  }
}

get doc

Retrieve a document (and all its chunks) by external ID.

laurus get doc --id <ID>

Table output example:

╭──────┬─────────────────────────────────────────╮
│ ID   │ Fields                                  │
├──────┼─────────────────────────────────────────┤
│ doc1 │ body: This is a test, title: Hello World │
╰──────┴─────────────────────────────────────────╯

JSON output example:

laurus --format json get doc --id doc1
[
  {
    "id": "doc1",
    "document": {
      "title": "Hello World",
      "body": "This is a test document."
    }
  }
]

add — Add a Resource

add doc

Add a document to the index. Documents are not searchable until commit is called.

laurus add doc --id <ID> --data <JSON>

Arguments:

FlagRequiredDescription
--id <ID>YesExternal document ID (string)
--data <JSON>YesDocument fields as a JSON string

The JSON format is a flat object mapping field names to values:

{
  "title": "Introduction to Rust",
  "body": "Rust is a systems programming language.",
  "category": "programming"
}

Example:

laurus add doc --id doc1 --data '{"title":"Hello World","body":"This is a test document."}'
# Document 'doc1' added. Run 'commit' to persist changes.

Tip: Multiple documents can share the same external ID (chunking pattern). Use add doc for each chunk.


delete — Delete a Resource

delete doc

Delete a document (and all its chunks) by external ID.

laurus delete doc --id <ID>

Example:

laurus delete doc --id doc1
# Document 'doc1' deleted. Run 'commit' to persist changes.

commit

Commit pending changes (additions and deletions) to the index. Until committed, changes are not visible to search.

laurus commit

Example:

laurus --data-dir ./my_index commit
# Changes committed successfully.

search

Execute a search query using the Query DSL.

laurus search <QUERY> [--limit <N>] [--offset <N>]

Arguments:

Argument / FlagRequiredDefaultDescription
<QUERY>YesQuery string in Laurus Query DSL
--limit <N>No10Maximum number of results
--offset <N>No0Number of results to skip

Query syntax examples:

# Term query
laurus search "body:rust"

# Phrase query
laurus search 'body:"machine learning"'

# Boolean query
laurus search "+body:programming -body:python"

# Fuzzy query (typo tolerance)
laurus search "body:programing~2"

# Wildcard query
laurus search "title:intro*"

# Range query
laurus search "price:[10 TO 50]"

Table output example:

╭──────┬────────┬─────────────────────────────────────────╮
│ ID   │ Score  │ Fields                                  │
├──────┼────────┼─────────────────────────────────────────┤
│ doc1 │ 0.8532 │ body: Rust is a systems..., title: Intr │
│ doc3 │ 0.4210 │ body: JavaScript powers..., title: Web  │
╰──────┴────────┴─────────────────────────────────────────╯

JSON output example:

laurus --format json search "body:rust" --limit 5
[
  {
    "id": "doc1",
    "score": 0.8532,
    "document": {
      "title": "Introduction to Rust",
      "body": "Rust is a systems programming language."
    }
  }
]

repl

Start an interactive REPL session. See REPL for details.

laurus repl

serve

Start the gRPC server. See gRPC Server for full documentation.

laurus serve [OPTIONS]

Options:

OptionShortEnv VariableDefaultDescription
--config <PATH>-cLAURUS_CONFIGPath to a TOML configuration file
--host <HOST>-HLAURUS_HOST0.0.0.0Listen address
--port <PORT>-pLAURUS_PORT50051Listen port
--log-level <LEVEL>-lLAURUS_LOG_LEVELinfoLog level (trace, debug, info, warn, error)

Example:

# Start with defaults (port 50051)
laurus --data-dir ./my_index serve

# Custom port and log level
laurus serve --port 8080 --log-level debug

# Use a configuration file
laurus serve --config config.toml

# Use environment variables
LAURUS_DATA_DIR=./my_index LAURUS_PORT=8080 laurus serve

Schema Format Reference

The schema file defines the structure of your index — what fields exist, their types, and how they are indexed. Laurus uses TOML format for schema files.

Overview

A schema consists of two top-level elements:

# Fields to search by default when a query does not specify a field.
default_fields = ["title", "body"]

# Field definitions. Each field has a name and a typed configuration.
[fields.<field_name>.<FieldType>]
# ... type-specific options
  • default_fields — A list of field names used as default search targets by the Query DSL. Only lexical fields (Text, Integer, Float, etc.) can be default fields. This key is optional and defaults to an empty list.
  • fields — A map of field names to their typed configuration. Each field must specify exactly one field type.

Field Naming

  • Field names are arbitrary strings (e.g., title, body_vec, created_at).
  • The _id field is reserved by Laurus for internal document ID management — do not use it.
  • Field names must be unique within a schema.

Field Types

Fields fall into two categories: Lexical (for keyword/full-text search) and Vector (for similarity search). A single field cannot be both.

Lexical Fields

Text

Full-text searchable field. Text is processed by the analysis pipeline (tokenization, normalization, stemming, etc.).

[fields.title.Text]
indexed = true       # Whether to index this field for search
stored = true        # Whether to store the original value for retrieval
term_vectors = false # Whether to store term positions (for phrase queries, highlighting)
OptionTypeDefaultDescription
indexedbooltrueEnables searching this field
storedbooltrueStores the original value so it can be returned in results
term_vectorsbooltrueStores term positions for phrase queries, highlighting, and more-like-this

Integer

64-bit signed integer field. Supports range queries and exact match.

[fields.year.Integer]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables range and exact-match queries
storedbooltrueStores the original value

Float

64-bit floating point field. Supports range queries.

[fields.rating.Float]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables range queries
storedbooltrueStores the original value

Boolean

Boolean field (true / false).

[fields.published.Boolean]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables filtering by boolean value
storedbooltrueStores the original value

DateTime

UTC timestamp field. Supports range queries.

[fields.created_at.DateTime]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables range queries on date/time
storedbooltrueStores the original value

Geo

Geographic point field (latitude/longitude). Supports radius and bounding box queries.

[fields.location.Geo]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables geo queries (radius, bounding box)
storedbooltrueStores the original value

Bytes

Raw binary data field. Not indexed — stored only.

[fields.thumbnail.Bytes]
stored = true
OptionTypeDefaultDescription
storedbooltrueStores the binary data

Vector Fields

Vector fields are indexed for approximate nearest neighbor (ANN) search. They require a dimension (the length of each vector) and a distance metric.

Hnsw

Hierarchical Navigable Small World graph index. Best for most use cases — offers a good balance of speed and recall.

[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger128Vector dimensionality (must match your embedding model)
distancestring"Cosine"Distance metric (see Distance Metrics)
minteger16Max bi-directional connections per node. Higher = better recall, more memory
ef_constructioninteger200Search width during index construction. Higher = better quality, slower build
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobjectnoneOptional quantization method (see Quantization)

Tuning guidelines:

  • m: 12–48 is typical. Use higher values for higher-dimensional vectors.
  • ef_construction: 100–500. Higher values produce a better graph but increase build time.
  • dimension: Must exactly match the output dimension of your embedding model (e.g., 384 for all-MiniLM-L6-v2, 768 for BERT-base, 1536 for text-embedding-3-small).

Flat

Brute-force linear scan index. Provides exact results with no approximation. Best for small datasets (< 10,000 vectors).

[fields.embedding.Flat]
dimension = 384
distance = "Cosine"
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger128Vector dimensionality
distancestring"Cosine"Distance metric (see Distance Metrics)
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobjectnoneOptional quantization method (see Quantization)

Ivf

Inverted File Index. Clusters vectors and searches only a subset of clusters. Suitable for very large datasets.

[fields.embedding.Ivf]
dimension = 384
distance = "Cosine"
n_clusters = 100
n_probe = 1
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger(required)Vector dimensionality
distancestring"Cosine"Distance metric (see Distance Metrics)
n_clustersinteger100Number of clusters. More clusters = finer partitioning
n_probeinteger1Number of clusters to search at query time. Higher = better recall, slower
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobjectnoneOptional quantization method (see Quantization)

Note: Unlike Hnsw and Flat, the dimension field in Ivf is required and has no default value.

Tuning guidelines:

  • n_clusters: A common heuristic is sqrt(N) where N is the total number of vectors.
  • n_probe: Start with 1 and increase until recall is acceptable. Typical range is 1–20.

Distance Metrics

The distance option for vector fields accepts the following values:

ValueDescriptionUse When
"Cosine"Cosine distance (1 - cosine similarity). Default.Normalized text/image embeddings
"Euclidean"L2 (Euclidean) distanceSpatial data, non-normalized vectors
"Manhattan"L1 (Manhattan) distanceSparse feature vectors
"DotProduct"Dot product (higher = more similar)Pre-normalized vectors where magnitude matters
"Angular"Angular distanceSimilar to cosine, but based on angle

For most embedding models (BERT, Sentence Transformers, OpenAI, etc.), "Cosine" is the correct choice.

Quantization

Vector fields optionally support quantization to reduce memory usage at the cost of some accuracy. Specify the quantizer option as a TOML table.

None (default)

No quantization — full precision 32-bit floats.

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
# quantizer is omitted (no quantization)

Scalar 8-bit

Compresses each float32 component to uint8 (~4x memory reduction).

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
quantizer = "Scalar8Bit"

Product Quantization

Splits the vector into subvectors and quantizes each independently.

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"

[fields.embedding.Hnsw.quantizer.ProductQuantization]
subvector_count = 48
OptionTypeDescription
subvector_countintegerNumber of subvectors. Must evenly divide dimension.

Complete Examples

Full-text search only

A simple blog post index with lexical search:

default_fields = ["title", "body"]

[fields.title.Text]
indexed = true
stored = true
term_vectors = false

[fields.body.Text]
indexed = true
stored = true
term_vectors = false

[fields.category.Text]
indexed = true
stored = true
term_vectors = false

[fields.published_at.DateTime]
indexed = true
stored = true

Vector search only

A vector-only index for semantic similarity:

[fields.embedding.Hnsw]
dimension = 768
distance = "Cosine"
m = 16
ef_construction = 200

Hybrid search (lexical + vector)

Combine lexical and vector search for best-of-both-worlds retrieval:

default_fields = ["title", "body"]

[fields.title.Text]
indexed = true
stored = true
term_vectors = false

[fields.body.Text]
indexed = true
stored = true
term_vectors = true

[fields.category.Text]
indexed = true
stored = true
term_vectors = false

[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200

Tip: A single field cannot be both lexical and vector. Use separate fields (e.g., body for text, body_vec for embedding) and map them both to the same source content.

E-commerce product index

A more complex schema with mixed field types:

default_fields = ["name", "description"]

[fields.name.Text]
indexed = true
stored = true
term_vectors = false

[fields.description.Text]
indexed = true
stored = true
term_vectors = true

[fields.price.Float]
indexed = true
stored = true

[fields.in_stock.Boolean]
indexed = true
stored = true

[fields.created_at.DateTime]
indexed = true
stored = true

[fields.location.Geo]
indexed = true
stored = true

[fields.description_vec.Hnsw]
dimension = 384
distance = "Cosine"

Generating a Schema

You can generate a schema TOML file interactively using the CLI:

laurus create schema
laurus create schema --output my_schema.toml

See create schema for details.

Using a Schema

Once you have a schema file, create an index from it:

laurus create index --schema schema.toml

Or load it programmatically in Rust:

#![allow(unused)]
fn main() {
use laurus::Schema;

let toml_str = std::fs::read_to_string("schema.toml")?;
let schema: Schema = toml::from_str(&toml_str)?;
}

REPL (Interactive Mode)

The REPL provides an interactive session for exploring your index without typing the full laurus command each time.

Starting the REPL

laurus --data-dir ./my_index repl
Laurus REPL (type 'help' for commands, 'quit' to exit)
laurus>

The REPL opens the index at startup and keeps it loaded throughout the session.

Available Commands

CommandDescription
search <query> [limit]Search the index
doc add <id> <json>Add a document
doc get <id>Get a document by ID
doc delete <id>Delete a document by ID
commitCommit pending changes
statsShow index statistics
helpShow available commands
quit / exitExit the REPL

Usage Examples

Searching

laurus> search body:rust
╭──────┬────────┬────────────────────────────────────╮
│ ID   │ Score  │ Fields                             │
├──────┼────────┼────────────────────────────────────┤
│ doc1 │ 0.8532 │ body: Rust is a systems..., title… │
╰──────┴────────┴────────────────────────────────────╯

Adding and Committing Documents

laurus> doc add doc4 {"title":"New Document","body":"Some content here."}
Document 'doc4' added.
laurus> commit
Changes committed.

Retrieving Documents

laurus> doc get doc4
╭──────┬───────────────────────────────────────────────╮
│ ID   │ Fields                                        │
├──────┼───────────────────────────────────────────────┤
│ doc4 │ body: Some content here., title: New Document │
╰──────┴───────────────────────────────────────────────╯

Deleting Documents

laurus> doc delete doc4
Document 'doc4' deleted.
laurus> commit
Changes committed.

Viewing Statistics

laurus> stats
Document count: 3

Features

  • Line editing — Arrow keys, Home/End, and standard readline shortcuts
  • History — Use Up/Down arrows to recall previous commands
  • Ctrl+C / Ctrl+D — Exit the REPL gracefully

gRPC Server

Laurus includes a built-in gRPC server that keeps the search engine resident in memory, eliminating the per-command startup overhead of the CLI. This is the recommended way to run Laurus in production or when integrating with other services.

Features

  • Persistent engine — The index stays open across requests; no WAL replay on every call
  • Full gRPC API — Index management, document CRUD, commit, and search (unary + streaming)
  • Health checking — Standard health check endpoint for load balancers and orchestrators
  • Graceful shutdown — Pending changes are committed automatically on Ctrl+C / SIGINT
  • TOML configuration — Optional config file with CLI override support

Quick Start

# Start the server with default settings
laurus serve

# Start with a custom data directory and port
laurus --data-dir ./my_index serve --port 8080

# Start with a configuration file
laurus serve --config config.toml

Sections

Getting Started with the gRPC Server

Starting the Server

The gRPC server is started via the serve subcommand of the laurus CLI:

laurus serve [OPTIONS]

Options

OptionShortEnv VariableDefaultDescription
--config <PATH>-cLAURUS_CONFIGPath to a TOML configuration file
--host <HOST>-HLAURUS_HOST0.0.0.0Listen address
--port <PORT>-pLAURUS_PORT50051Listen port
--http-port <PORT>LAURUS_HTTP_PORTHTTP Gateway port (enables HTTP gateway when set)
--log-level <LEVEL>-lLAURUS_LOG_LEVELinfoLog level (trace, debug, info, warn, error)

The global --data-dir option (env: LAURUS_DATA_DIR) specifies the index data directory:

# Using CLI arguments
laurus --data-dir ./my_index serve --port 8080 --log-level debug

# Using environment variables
export LAURUS_DATA_DIR=./my_index
export LAURUS_PORT=8080
export LAURUS_LOG_LEVEL=debug
laurus serve

Startup Behavior

On startup, the server attempts to open an existing index at the configured data directory. If no index exists, the server starts without one — you can create an index later via the CreateIndex RPC.

Configuration File

You can use a TOML configuration file instead of (or in addition to) command-line options:

laurus serve --config config.toml

Format

[server]
host = "0.0.0.0"
port = 50051
http_port = 8080  # Optional: enables HTTP Gateway

[index]
data_dir = "./laurus_data"

[log]
level = "info"

Priority

Settings are resolved in the following order (highest priority first):

CLI arguments > Environment variables > Config file > Defaults

For example, if config.toml sets port = 50051, the environment variable LAURUS_PORT=4567 is set, and --port 1234 is passed on the command line:

LAURUS_PORT=4567 laurus serve --config config.toml --port 1234
# → Listens on port 1234 (CLI argument wins)

If the CLI argument is omitted:

LAURUS_PORT=4567 laurus serve --config config.toml
# → Listens on port 4567 (environment variable wins over config file)

Graceful Shutdown

When the server receives a shutdown signal (Ctrl+C / SIGINT), it automatically:

  1. Stops accepting new connections
  2. Commits any pending changes to the index
  3. Exits cleanly

HTTP Gateway

When http_port is set, an HTTP/JSON gateway starts alongside the gRPC server. The gateway proxies HTTP requests to the gRPC server internally:

User Request (HTTP/JSON) → gRPC Gateway (axum) → gRPC Server (tonic) → Engine

If http_port is omitted, only the gRPC server starts (default behavior).

Starting with HTTP Gateway

# Via CLI
laurus serve --http-port 8080

# Via config file (set http_port in [server] section)
laurus serve --config config.toml

# Via environment variable
LAURUS_HTTP_PORT=8080 laurus serve

HTTP API Endpoints

MethodPathgRPC Method
GET/v1/healthHealthService/Check
POST/v1/indexIndexService/CreateIndex
GET/v1/indexIndexService/GetIndex
GET/v1/schemaIndexService/GetSchema
PUT/v1/documents/:idDocumentService/PutDocument
POST/v1/documents/:idDocumentService/AddDocument
GET/v1/documents/:idDocumentService/GetDocuments
DELETE/v1/documents/:idDocumentService/DeleteDocuments
POST/v1/commitDocumentService/Commit
POST/v1/searchSearchService/Search
POST/v1/search/streamSearchService/SearchStream (SSE)

HTTP API Examples

# Health check
curl http://localhost:8080/v1/health

# Create an index
curl -X POST http://localhost:8080/v1/index \
  -H 'Content-Type: application/json' \
  -d '{
    "schema": {
      "fields": {
        "title": {"text": {"indexed": true, "stored": true}},
        "body": {"text": {"indexed": true, "stored": true}}
      },
      "default_fields": ["title", "body"]
    }
  }'

# Add a document
curl -X POST http://localhost:8080/v1/documents/doc1 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Hello World",
        "body": "This is a test document."
      }
    }
  }'

# Commit
curl -X POST http://localhost:8080/v1/commit

# Search
curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "body:test", "limit": 10}'

# Streaming search (SSE)
curl -N -X POST http://localhost:8080/v1/search/stream \
  -H 'Content-Type: application/json' \
  -d '{"query": "body:test", "limit": 10}'

Connecting via gRPC

Any gRPC client can connect to the server. For quick testing, grpcurl is useful:

# Health check
grpcurl -plaintext localhost:50051 laurus.v1.HealthService/Check

# Create an index
grpcurl -plaintext -d '{
  "schema": {
    "fields": {
      "title": {"text": {"indexed": true, "stored": true, "term_vectors": true}},
      "body": {"text": {"indexed": true, "stored": true, "term_vectors": true}}
    },
    "default_fields": ["title", "body"]
  }
}' localhost:50051 laurus.v1.IndexService/CreateIndex

# Add a document
grpcurl -plaintext -d '{
  "id": "doc1",
  "document": {
    "fields": {
      "title": {"text_value": "Hello World"},
      "body": {"text_value": "This is a test document."}
    }
  }
}' localhost:50051 laurus.v1.DocumentService/AddDocument

# Commit
grpcurl -plaintext localhost:50051 laurus.v1.DocumentService/Commit

# Search
grpcurl -plaintext -d '{"query": "body:test", "limit": 10}' \
  localhost:50051 laurus.v1.SearchService/Search

gRPC API Reference

All services are defined under the laurus.v1 protobuf package.

Services Overview

ServiceRPCsDescription
HealthServiceCheckHealth checking
IndexServiceCreateIndex, GetIndex, GetSchemaIndex lifecycle and schema
DocumentServicePutDocument, AddDocument, GetDocuments, DeleteDocuments, CommitDocument CRUD and commit
SearchServiceSearch, SearchStreamUnary and streaming search

HealthService

Check

Returns the current serving status of the server.

rpc Check(HealthCheckRequest) returns (HealthCheckResponse);

Response fields:

FieldTypeDescription
statusServingStatusSERVING_STATUS_SERVING when the server is ready

IndexService

CreateIndex

Create a new index with the given schema. Fails with ALREADY_EXISTS if an index is already open.

rpc CreateIndex(CreateIndexRequest) returns (CreateIndexResponse);

Request fields:

FieldTypeRequiredDescription
schemaSchemaYesIndex schema definition

Schema structure:

message Schema {
  map<string, FieldOption> fields = 1;
  repeated string default_fields = 2;
}

Each FieldOption is a oneof with one of the following field types:

Lexical FieldsVector Fields
TextOption (indexed, stored, term_vectors)HnswOption (dimension, distance, m, ef_construction, base_weight, quantizer)
IntegerOption (indexed, stored)FlatOption (dimension, distance, base_weight, quantizer)
FloatOption (indexed, stored)IvfOption (dimension, distance, n_clusters, n_probe, base_weight, quantizer)
BooleanOption (indexed, stored)
DateTimeOption (indexed, stored)
GeoOption (indexed, stored)
BytesOption (stored)

Distance metrics: COSINE, EUCLIDEAN, MANHATTAN, DOT_PRODUCT, ANGULAR

Quantization methods: NONE, SCALAR_8BIT, PRODUCT_QUANTIZATION

Example:

{
  "schema": {
    "fields": {
      "title": {"text": {"indexed": true, "stored": true, "term_vectors": true}},
      "embedding": {"hnsw": {"dimension": 384, "distance": "DISTANCE_METRIC_COSINE", "m": 16, "ef_construction": 200}}
    },
    "default_fields": ["title"]
  }
}

GetIndex

Get index statistics.

rpc GetIndex(GetIndexRequest) returns (GetIndexResponse);

Response fields:

FieldTypeDescription
document_countuint64Total number of documents in the index
vector_fieldsmap<string, VectorFieldStats>Per-field vector statistics

Each VectorFieldStats contains vector_count and dimension.

GetSchema

Retrieve the current index schema.

rpc GetSchema(GetSchemaRequest) returns (GetSchemaResponse);

Response fields:

FieldTypeDescription
schemaSchemaThe index schema

DocumentService

PutDocument

Insert or replace a document by ID. If a document with the same ID already exists, it is replaced.

rpc PutDocument(PutDocumentRequest) returns (PutDocumentResponse);

Request fields:

FieldTypeRequiredDescription
idstringYesExternal document ID
documentDocumentYesDocument content

Document structure:

message Document {
  map<string, Value> fields = 1;
}

Each Value is a oneof with these types:

TypeProto FieldDescription
Nullnull_valueNull value
Booleanbool_valueBoolean value
Integerint64_value64-bit integer
Floatfloat64_value64-bit floating point
Texttext_valueUTF-8 string
Bytesbytes_valueRaw bytes
Vectorvector_valueVectorValue (list of floats)
DateTimedatetime_valueUnix microseconds (UTC)
Geogeo_valueGeoPoint (latitude, longitude)

AddDocument

Add a document. Unlike PutDocument, this does not replace existing documents with the same ID — multiple documents can share an ID (chunking pattern).

rpc AddDocument(AddDocumentRequest) returns (AddDocumentResponse);

Request fields are the same as PutDocument.

GetDocuments

Retrieve all documents matching the given external ID.

rpc GetDocuments(GetDocumentsRequest) returns (GetDocumentsResponse);

Request fields:

FieldTypeRequiredDescription
idstringYesExternal document ID

Response fields:

FieldTypeDescription
documentsrepeated DocumentMatching documents

DeleteDocuments

Delete all documents matching the given external ID.

rpc DeleteDocuments(DeleteDocumentsRequest) returns (DeleteDocumentsResponse);

Commit

Commit pending changes (additions and deletions) to the index. Changes are not visible to search until committed.

rpc Commit(CommitRequest) returns (CommitResponse);

SearchService

Search

Execute a search query and return results as a single response.

rpc Search(SearchRequest) returns (SearchResponse);

SearchStream

Execute a search query and stream results back one at a time.

rpc SearchStream(SearchRequest) returns (stream SearchResult);

SearchRequest Fields

FieldTypeRequiredDescription
querystringNoLexical search query in Query DSL
query_vectorsrepeated QueryVectorNoVector search queries
limituint32NoMaximum number of results (default: engine default)
offsetuint32NoNumber of results to skip
fusionFusionAlgorithmNoFusion algorithm for hybrid search
lexical_paramsLexicalParamsNoLexical search parameters
vector_paramsVectorParamsNoVector search parameters
field_boostsmap<string, float>NoPer-field score boosting

At least one of query or query_vectors must be provided.

QueryVector

FieldTypeDescription
vectorrepeated floatQuery vector
weightfloatWeight for this vector (default: 1.0)
fieldsrepeated stringTarget vector fields (empty = all)

FusionAlgorithm

A oneof with two options:

  • RRF (Reciprocal Rank Fusion): k parameter (default: 60)
  • WeightedSum: lexical_weight and vector_weight

LexicalParams

FieldTypeDescription
min_scorefloatMinimum score threshold
timeout_msuint64Search timeout in milliseconds
parallelboolEnable parallel search
sort_bySortSpecSort by a field instead of score

VectorParams

FieldTypeDescription
fieldsrepeated stringTarget vector fields
score_modeVectorScoreModeWEIGHTED_SUM, MAX_SIM, or LATE_INTERACTION
overfetchfloatOverfetch factor (default: 2.0)
min_scorefloatMinimum score threshold

SearchResult

FieldTypeDescription
idstringExternal document ID
scorefloatRelevance score
documentDocumentDocument content

Example

{
  "query": "body:rust",
  "query_vectors": [
    {"vector": [0.1, 0.2, 0.3], "weight": 1.0}
  ],
  "limit": 10,
  "fusion": {
    "rrf": {"k": 60}
  },
  "field_boosts": {
    "title": 2.0
  }
}

Error Handling

gRPC errors are returned as standard Status codes:

Laurus ErrorgRPC StatusWhen
Schema / Query / Field / JSONINVALID_ARGUMENTMalformed request or schema
No index openFAILED_PRECONDITIONRPC called before CreateIndex
Index already existsALREADY_EXISTSCreateIndex called twice
Not implementedUNIMPLEMENTEDFeature not yet supported
Internal errorsINTERNALI/O, storage, or unexpected errors

Advanced Features

This section covers advanced topics for users who want to go deeper into Laurus’s capabilities.

Topics

Query DSL

A human-readable query language for lexical, vector, and hybrid search. Supports boolean operators, phrase matching, fuzzy search, range queries, and more — all in a single query string.

ID Management

How Laurus manages document identity with a dual-tiered ID system:

  • External IDs (user-provided strings)
  • Internal IDs (shard-prefixed u64 for performance)

Persistence & WAL

How Laurus ensures data durability through Write-Ahead Logging (WAL) and the commit lifecycle.

Deletions & Compaction

How documents are deleted (logical deletion via bitmaps) and how space is reclaimed (compaction).

Error Handling

Understanding LaurusError and Result<T> for robust application development. Covers all error variants, matching patterns, and common error scenarios.

Extensibility

Implementing custom components by extending Laurus’s trait-based abstractions:

  • Custom Analyzer for text analysis
  • Custom Embedder for vector embeddings
  • Custom Storage for new backends

Query DSL

Laurus provides a unified query DSL (Domain Specific Language) that allows lexical (keyword) and vector (semantic) search in a single query string. The UnifiedQueryParser splits the input into lexical and vector portions and delegates to the appropriate sub-parser.

Overview

title:hello AND content:~"cute kitten"^0.8
|--- lexical --|    |--- vector --------|

The ~" pattern distinguishes vector clauses from lexical clauses. Everything else is treated as a lexical query.

Lexical Query Syntax

Lexical queries search the inverted index using exact or approximate keyword matching.

Term Query

Match a single term against a field (or the default field):

hello
title:hello

Boolean Operators

Combine clauses with AND and OR (case-insensitive):

title:hello AND body:world
title:hello OR title:goodbye

Space-separated clauses without an explicit operator use implicit boolean (behaves like OR with scoring).

Required / Prohibited Clauses

Use + (must match) and - (must not match):

+title:hello -title:goodbye

Phrase Query

Match an exact phrase using double quotes. Optional proximity (~N) allows N words between terms:

"hello world"
"hello world"~2

Fuzzy Query

Approximate matching with edit distance. Append ~ and optionally the maximum edit distance:

roam~
roam~2

Wildcard Query

Use ? (single character) and * (zero or more characters):

te?t
test*

Range Query

Inclusive [] or exclusive {} ranges, useful for numeric and date fields:

price:[100 TO 500]
date:{2024-01-01 TO 2024-12-31}
price:[* TO 100]

Boost

Increase the weight of a clause with ^:

title:hello^2
"important phrase"^1.5

Grouping

Use parentheses for sub-expressions:

(title:hello OR title:hi) AND body:world

PEG Grammar

The full lexical grammar (parser.pest):

query          = { SOI ~ boolean_query ~ EOI }
boolean_query  = { clause ~ (boolean_op ~ clause | clause)* }
clause         = { required_clause | prohibited_clause | sub_clause }
required_clause   = { "+" ~ sub_clause }
prohibited_clause = { "-" ~ sub_clause }
sub_clause     = { grouped_query | field_query | term_query }
grouped_query  = { "(" ~ boolean_query ~ ")" ~ boost? }
boolean_op     = { ^"AND" | ^"OR" }
field_query    = { field ~ ":" ~ field_value }
field_value    = { range_query | phrase_query | fuzzy_term
                 | wildcard_term | simple_term }
phrase_query   = { "\"" ~ phrase_content ~ "\"" ~ proximity? ~ boost? }
proximity      = { "~" ~ number }
fuzzy_term     = { term ~ "~" ~ fuzziness? ~ boost? }
wildcard_term  = { wildcard_pattern ~ boost? }
simple_term    = { term ~ boost? }
boost          = { "^" ~ boost_value }

Vector Query Syntax

Vector queries embed text into vectors at parse time and perform similarity search.

Basic Syntax

field:~"text"
field:~"text"^weight
ElementRequiredDescriptionExample
field:NoTarget vector field namecontent:
~YesVector query marker
"text"YesText to embed"cute kitten"
^weightNoScore weight (default: 1.0)^0.8

Examples

# Single field
content:~"cute kitten"

# With boost weight
content:~"cute kitten"^0.8

# Default field (when configured)
~"cute kitten"

# Multiple clauses
content:~"cats" image:~"dogs"^0.5

# Nested field name (dot notation)
metadata.embedding:~"text"

Multiple Clauses

Multiple vector clauses are space-separated. All clauses are executed and their scores are combined using the score_mode (default: WeightedSum):

content:~"cats" image:~"dogs"^0.5

This produces:

score = similarity("cats", content) * 1.0
      + similarity("dogs", image)   * 0.5

There are no AND/OR operators in the vector DSL. Vector search is inherently a ranking operation, and the weight (^) controls the contribution of each clause.

Score Modes

ModeDescription
WeightedSum (default)Sum of (similarity * weight) across all clauses
MaxSimMaximum similarity score across clauses
LateInteractionLate interaction scoring

Score mode cannot be set from DSL syntax. Use the Rust API to override:

#![allow(unused)]
fn main() {
let mut request = parser.parse(r#"content:~"cats" image:~"dogs""#).await?;
request.score_mode = VectorScoreMode::MaxSim;
}

PEG Grammar

The full vector grammar (parser.pest):

query          = { SOI ~ vector_clause+ ~ EOI }
vector_clause  = { field_prefix? ~ "~" ~ quoted_text ~ boost? }
field_prefix   = { field_name ~ ":" }
field_name     = @{ (ASCII_ALPHA | "_") ~ (ASCII_ALPHANUMERIC | "_" | ".")* }
quoted_text    = ${ "\"" ~ inner_text ~ "\"" }
inner_text     = @{ (!("\"") ~ ANY)* }
boost          = { "^" ~ float_value }
float_value    = @{ ASCII_DIGIT+ ~ ("." ~ ASCII_DIGIT+)? }

Unified (Hybrid) Query Syntax

The UnifiedQueryParser allows mixing lexical and vector clauses freely in a single query string:

title:hello content:~"cute kitten"^0.8

How It Works

  1. Split: Vector clauses (matching field:~"text"^boost pattern) are extracted via regex.
  2. Delegate: Vector portion goes to VectorQueryParser, remainder goes to lexical QueryParser.
  3. Fuse: If both lexical and vector results exist, they are combined using a fusion algorithm.

Disambiguation

The ~" pattern unambiguously identifies vector clauses because in lexical syntax, ~ only appears after a term or phrase (e.g., roam~2, "hello world"~10), never before a quote.

Fusion Algorithms

When a query contains both lexical and vector clauses, results are fused:

AlgorithmFormulaDescription
RRF (default)score = sum(1 / (k + rank))Reciprocal Rank Fusion. Robust to different score distributions. Default k=60.
WeightedSumscore = lexical * a + vector * bLinear combination with configurable weights.

Note: The fusion algorithm cannot be specified in the DSL syntax. It is configured when constructing the UnifiedQueryParser via .with_fusion(). The default is RRF (k=60). See Custom Fusion for a code example.

Examples

# Lexical only — no fusion
title:hello AND body:world

# Vector only — no fusion
content:~"cute kitten"

# Hybrid — fusion applied automatically
title:hello content:~"cute kitten"

# Hybrid with boolean operators
title:hello AND category:animal content:~"cute kitten"^0.8

# Multiple vector clauses + lexical
category:animal content:~"cats" image:~"dogs"^0.5

# Default fields (when configured)
hello ~"cats"

Code Examples

Lexical Search with DSL

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use laurus::lexical::query::QueryParser;

let analyzer = Arc::new(StandardAnalyzer::new()?);
let parser = QueryParser::new(analyzer)
    .with_default_field("title");

let query = parser.parse("title:hello AND body:world")?;
}

Vector Search with DSL

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::vector::query::VectorQueryParser;

let parser = VectorQueryParser::new(embedder)
    .with_default_field("content");

let request = parser.parse(r#"content:~"cute kitten"^0.8"#).await?;
}

Hybrid Search with Unified DSL

#![allow(unused)]
fn main() {
use laurus::engine::query::UnifiedQueryParser;

let unified = UnifiedQueryParser::new(lexical_parser, vector_parser);

let request = unified.parse(
    r#"title:hello content:~"cute kitten"^0.8"#
).await?;
// request.lexical_search_request  -> Some(...)  — lexical query
// request.vector_search_request   -> Some(...)  — vector query
// request.fusion_algorithm        -> Some(RRF)  — fusion algorithm
}

Custom Fusion

#![allow(unused)]
fn main() {
use laurus::engine::search::FusionAlgorithm;

let unified = UnifiedQueryParser::new(lexical_parser, vector_parser)
    .with_fusion(FusionAlgorithm::WeightedSum {
        lexical_weight: 0.3,
        vector_weight: 0.7,
    });
}

ID Management

Laurus uses a dual-tiered ID management strategy to ensure efficient document retrieval, updates, and aggregation in distributed environments.

1. External ID (String)

The External ID is a logical identifier used by users and applications to uniquely identify a document.

  • Type: String
  • Role: You can use any unique value, such as UUIDs, URLs, or database primary keys.
  • Storage: Persisted transparently as a reserved system field name _id within the Lexical Index.
  • Uniqueness: Expected to be unique across the entire system.
  • Updates: Indexing a document with an existing external_id triggers an automatic “Delete-then-Insert” (Upsert) operation, replacing the old version with the newest.

2. Internal ID (u64 / Stable ID)

The Internal ID is a physical handle used internally by Laurus’s engines (Lexical and Vector) for high-performance operations.

  • Type: Unsigned 64-bit Integer (u64)
  • Role: Used for bitmap operations, point references, and routing between distributed nodes.
  • Immutability (Stable): Once assigned, an Internal ID never changes due to index merges (segment compaction) or restarts. This prevents inconsistencies in deletion logs and caches.

ID Structure (Shard-Prefixed)

Laurus employs a Shard-Prefixed Stable ID scheme designed for multi-node distributed environments.

Bit RangeNameDescription
48-63 bitShard IDPrefix identifying the node or partition (up to 65,535 shards).
0-47 bitLocal IDMonotonically increasing document number within a shard (up to ~281 trillion documents).

Why this structure?

  1. Zero-Cost Aggregation: Since u64 IDs are globally unique, the aggregator can perform fast sorting and deduplication without worrying about ID collisions between nodes.
  2. Fast Routing: The aggregator can immediately identify the physical node responsible for a document just by looking at the upper bits, avoiding expensive hash lookups.
  3. High-Performance Fetching: Internal IDs map directly to physical data structures. This allows Laurus to skip the “External-to-Internal ID” conversion step during retrieval, achieving O(1) access speed.

ID Lifecycle

  1. Registration (engine.put_document() / engine.add_document()): User provides a document with an External ID.
  2. ID Assignment: The Engine combines the current shard_id with a new Local ID to issue a Shard-Prefixed Internal ID.
  3. Mapping: The engine maintains the relationship between the External ID and the new Internal ID.
  4. Search: Search results return the External ID (String), resolved from the Internal ID.
  5. Retrieval/Deletion: While the user-facing API accepts External IDs for convenience, the engine internally converts them to Internal IDs for near-instant processing.

Persistence & WAL

Laurus uses a Write-Ahead Log (WAL) to ensure data durability. Every write operation is persisted to the WAL before modifying in-memory structures, guaranteeing that no data is lost even if the process crashes.

Write Path

sequenceDiagram
    participant App as Application
    participant Engine
    participant WAL as DocumentLog (WAL)
    participant Mem as In-Memory Buffers
    participant Disk as Storage (segments)

    App->>Engine: add_document() / delete_documents()
    Engine->>WAL: 1. Append operation to WAL
    Engine->>Mem: 2. Update in-memory buffers

    Note over Mem: Document is buffered but\nNOT yet searchable

    App->>Engine: commit()
    Engine->>Disk: 3. Flush segments to storage
    Engine->>WAL: 4. Truncate WAL
    Note over Disk: Documents are now\nsearchable and durable

Key Principles

  1. WAL-first: Every write (add or delete) is appended to the WAL before updating in-memory structures
  2. Buffered writes: In-memory buffers accumulate changes until commit() is called
  3. Atomic commit: commit() flushes all buffered changes to segment files and truncates the WAL
  4. Crash safety: If the process crashes between writes and commit, the WAL is replayed on the next startup

Write-Ahead Log (WAL)

The WAL is managed by the DocumentLog component and stored at the root level of the storage backend (engine.wal).

WAL Entry Types

Entry TypeDescription
UpsertDocument content + external ID + assigned internal ID
DeleteExternal ID of the document to remove

WAL File

The WAL file (engine.wal) is an append-only binary log. Each entry is self-contained with:

  • Operation type (add/delete)
  • Sequence number
  • Payload (document data or ID)

Recovery

When an engine is built (Engine::builder(...).build().await), it automatically checks for remaining WAL entries and replays them (the WAL is truncated on commit, so any remaining entries are from a crashed session):

graph TD
    Start["Engine::build()"] --> Check["Check WAL for\nuncommitted entries"]
    Check -->|"Entries found"| Replay["Replay operations\ninto in-memory buffers"]
    Replay --> Ready["Engine ready"]
    Check -->|"No entries"| Ready

Recovery is transparent — you do not need to handle it manually.

The Commit Lifecycle

#![allow(unused)]
fn main() {
// 1. Add documents (buffered, not yet searchable)
engine.add_document("doc-1", doc1).await?;
engine.add_document("doc-2", doc2).await?;

// 2. Commit — flush to persistent storage
engine.commit().await?;
// Documents are now searchable

// 3. Add more documents
engine.add_document("doc-3", doc3).await?;

// 4. If the process crashes here, doc-3 is in the WAL
//    and will be recovered on next startup
}

When to Commit

StrategyDescriptionUse Case
After each documentMaximum durability, minimum search latencyReal-time search with few writes
After a batchGood balance of throughput and latencyBulk indexing
PeriodicallyMaximum write throughputHigh-volume ingestion

Tip: Commits are relatively expensive because they flush segments to storage. For bulk indexing, batch many documents before calling commit().

Storage Layout

The engine uses PrefixedStorage to organize data:

<storage root>/
├── lexical/          # Inverted index segments
│   ├── seg-000/
│   │   ├── terms.dict
│   │   ├── postings.post
│   │   └── ...
│   └── metadata.json
├── vector/           # Vector index segments
│   ├── seg-000/
│   │   ├── graph.hnsw
│   │   ├── vectors.vecs
│   │   └── ...
│   └── metadata.json
├── documents/        # Document storage
│   └── ...
└── engine.wal        # Write-ahead log

Next Steps

Deletions & Compaction

Laurus uses a two-phase deletion strategy: fast logical deletion followed by periodic physical compaction.

Deleting Documents

#![allow(unused)]
fn main() {
// Delete a document by its external ID
engine.delete_documents("doc-1").await?;
engine.commit().await?;
}

Logical Deletion

When a document is deleted, it is not immediately removed from the index files. Instead:

graph LR
    Del["delete_documents('doc-1')"] --> Bitmap["Add internal ID\nto Deletion Bitmap"]
    Bitmap --> Search["Search skips\ndeleted IDs"]
  1. The document’s internal ID is added to a deletion bitmap
  2. The bitmap is checked during every search, filtering out deleted documents from results
  3. The original data remains in the segment files

Why Logical Deletion?

BenefitDescription
SpeedO(1) — flipping a bit is instant
Immutable segmentsSegment files are never modified in place, simplifying concurrency
Safe recoveryIf a crash occurs, the deletion bitmap can be reconstructed from the WAL

Upserts (Update = Delete + Insert)

When you index a document with an existing external ID, Laurus performs an automatic upsert:

  1. The old document is logically deleted (its ID is added to the deletion bitmap)
  2. A new document is inserted with a new internal ID
  3. The external-to-internal ID mapping is updated
#![allow(unused)]
fn main() {
// First insert
engine.put_document("doc-1", doc_v1).await?;
engine.commit().await?;

// Update: old version is logically deleted, new version is inserted
engine.put_document("doc-1", doc_v2).await?;
engine.commit().await?;
}

Physical Compaction

Over time, logically deleted documents accumulate and waste space. Compaction reclaims this space by rewriting segment files without the deleted entries.

graph LR
    subgraph "Before Compaction"
        S1["Segment 0\ndoc-1 (deleted)\ndoc-2\ndoc-3 (deleted)"]
        S2["Segment 1\ndoc-4\ndoc-5"]
    end

    Compact["Compaction"]

    subgraph "After Compaction"
        S3["Segment 0\ndoc-2\ndoc-4\ndoc-5"]
    end

    S1 --> Compact
    S2 --> Compact
    Compact --> S3

What Compaction Does

  1. Reads all live (non-deleted) documents from existing segments
  2. Rebuilds the inverted index and/or vector index without deleted entries
  3. Writes new, clean segment files
  4. Removes the old segment files
  5. Resets the deletion bitmap

Cost and Frequency

AspectDetail
CPU costHigh — rebuilds index structures from scratch
I/O costHigh — reads all data, writes new segments
BlockingSearches continue during compaction (reads see the old segments until the new ones are ready)
FrequencyRun when deleted documents exceed a threshold (e.g., 10-20% of total)

When to Compact

  • Low-write workloads: Compact periodically (e.g., daily or weekly)
  • High-write workloads: Compact when the deletion ratio exceeds a threshold
  • After bulk updates: Compact after a large batch of upserts

Deletion Bitmap

The deletion bitmap tracks which internal IDs have been deleted:

  • Storage: HashSet of deleted document IDs (AHashSet<u64>)
  • Lookup: O(1) — hash set lookup

The bitmap is persisted alongside the index segments and is rebuilt from the WAL during recovery.

Next Steps

Error Handling

Laurus uses a unified error type for all operations. Understanding the error system helps you write robust applications that handle failures gracefully.

LaurusError

All Laurus operations return Result<T>, which is an alias for std::result::Result<T, LaurusError>.

LaurusError is an enum with variants for each category of failure:

VariantDescriptionCommon Causes
IoI/O errorsFile not found, permission denied, disk full
IndexIndex operation errorsCorrupt index, segment read failure
SchemaSchema-related errorsUnknown field name, type mismatch
AnalysisText analysis errorsTokenizer failure, invalid filter config
QueryQuery parsing/execution errorsMalformed Query DSL, unknown field in query
StorageStorage backend errorsFailed to open storage, write failure
FieldField definition errorsInvalid field options, duplicate field name
JsonJSON serialization errorsMalformed document JSON
InvalidOperationInvalid operationSearching before commit, double close
ResourceExhaustedResource limits exceededOut of memory, too many open files
SerializationErrorBinary serialization errorsCorrupt data on disk
OperationCancelledOperation was cancelledTimeout, user cancellation
NotImplementedFeature not availableUnimplemented operation
OtherGeneric errorsTimeout, invalid config, invalid argument

Basic Error Handling

Using the ? Operator

The simplest approach — propagate errors to the caller:

#![allow(unused)]
fn main() {
use laurus::{Engine, Result};

async fn index_documents(engine: &Engine) -> Result<()> {
    let doc = laurus::Document::builder()
        .add_text("title", "Rust Programming")
        .build();

    engine.put_document("doc1", doc).await?;
    engine.commit().await?;
    Ok(())
}
}

Matching on Error Variants

When you need different behavior for different error types:

#![allow(unused)]
fn main() {
use laurus::{Engine, LaurusError};

async fn safe_search(engine: &Engine, query: &str) {
    match engine.search(/* request */).await {
        Ok(results) => {
            for result in results {
                println!("{}: {}", result.id, result.score);
            }
        }
        Err(LaurusError::Query(msg)) => {
            eprintln!("Invalid query syntax: {}", msg);
        }
        Err(LaurusError::Io(e)) => {
            eprintln!("Storage I/O error: {}", e);
        }
        Err(e) => {
            eprintln!("Unexpected error: {}", e);
        }
    }
}
}

Checking Error Types with downcast

Since LaurusError implements std::error::Error, you can use standard error handling patterns:

#![allow(unused)]
fn main() {
use laurus::LaurusError;

fn is_retriable(error: &LaurusError) -> bool {
    matches!(error, LaurusError::Io(_) | LaurusError::ResourceExhausted(_))
}
}

Common Error Scenarios

Schema Mismatch

Adding a document with fields that don’t match the schema:

#![allow(unused)]
fn main() {
// Schema has "title" (Text) and "year" (Integer)
let doc = Document::builder()
    .add_text("title", "Hello")
    .add_text("unknown_field", "this field is not in schema")
    .build();

// Fields not in the schema are silently ignored during indexing.
// No error is raised — only schema-defined fields are processed.
}

Query Parsing Errors

Invalid Query DSL syntax returns a Query error:

#![allow(unused)]
fn main() {
use laurus::engine::query::UnifiedQueryParser;

let parser = UnifiedQueryParser::new();
match parser.parse("title:\"unclosed phrase") {
    Ok(request) => { /* ... */ }
    Err(LaurusError::Query(msg)) => {
        // msg contains details about the parse failure
        eprintln!("Bad query: {}", msg);
    }
    Err(e) => { /* other errors */ }
}
}

Storage I/O Errors

File-based storage may encounter I/O errors:

#![allow(unused)]
fn main() {
use laurus::storage::{StorageConfig, StorageFactory};

match StorageFactory::open(StorageConfig::File {
    path: "/nonexistent/path".into(),
    loading_mode: Default::default(),
}) {
    Ok(storage) => { /* ... */ }
    Err(LaurusError::Io(e)) => {
        eprintln!("Cannot open storage: {}", e);
    }
    Err(e) => { /* other errors */ }
}
}

Convenience Constructors

LaurusError provides factory methods for creating errors in custom implementations:

MethodCreates
LaurusError::index(msg)Index variant
LaurusError::schema(msg)Schema variant
LaurusError::analysis(msg)Analysis variant
LaurusError::query(msg)Query variant
LaurusError::storage(msg)Storage variant
LaurusError::field(msg)Field variant
LaurusError::other(msg)Other variant
LaurusError::cancelled(msg)OperationCancelled variant
LaurusError::invalid_argument(msg)Other with “Invalid argument” prefix
LaurusError::invalid_config(msg)Other with “Invalid configuration” prefix
LaurusError::not_found(msg)Other with “Not found” prefix
LaurusError::timeout(msg)Other with “Timeout” prefix

These are useful when implementing custom Analyzer, Embedder, or Storage traits:

#![allow(unused)]
fn main() {
use laurus::{LaurusError, Result};

fn validate_dimension(dim: usize) -> Result<()> {
    if dim == 0 {
        return Err(LaurusError::invalid_argument("dimension must be > 0"));
    }
    Ok(())
}
}

Automatic Conversions

LaurusError implements From for common error types, so they convert automatically with ?:

Source TypeTarget Variant
std::io::ErrorLaurusError::Io
serde_json::ErrorLaurusError::Json
anyhow::ErrorLaurusError::Anyhow

Next Steps

Extensibility

Laurus uses trait-based abstractions for its core components. You can implement these traits to provide custom analyzers, embedders, and storage backends.

Custom Analyzer

Implement the Analyzer trait to create a custom text analysis pipeline:

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::analyzer::Analyzer;
use laurus::analysis::token::{Token, TokenStream};
use laurus::Result;

#[derive(Debug)]
struct ReverseAnalyzer;

impl Analyzer for ReverseAnalyzer {
    fn analyze(&self, text: &str) -> Result<TokenStream> {
        let tokens: Vec<Token> = text
            .split_whitespace()
            .enumerate()
            .map(|(i, word)| Token {
                text: word.chars().rev().collect(),
                position: i,
                ..Default::default()
            })
            .collect();
        Ok(Box::new(tokens.into_iter()))
    }

    fn name(&self) -> &str {
        "reverse"
    }

    fn as_any(&self) -> &dyn std::any::Any {
        self
    }
}
}

Required Methods

MethodDescription
analyze(&self, text: &str) -> Result<TokenStream>Process text into a stream of tokens
name(&self) -> &strReturn a unique identifier for this analyzer
as_any(&self) -> &dyn AnyEnable downcasting to the concrete type

Using a Custom Analyzer

Pass your analyzer to EngineBuilder:

#![allow(unused)]
fn main() {
use std::sync::Arc;

let analyzer = Arc::new(ReverseAnalyzer);
let engine = Engine::builder(storage, schema)
    .analyzer(analyzer)
    .build()
    .await?;
}

For per-field analyzers, wrap with PerFieldAnalyzer:

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::per_field::PerFieldAnalyzer;
use laurus::analysis::analyzer::standard::StandardAnalyzer;

let mut per_field = PerFieldAnalyzer::new(Arc::new(StandardAnalyzer::new()?));
per_field.add_analyzer("custom_field", Arc::new(ReverseAnalyzer));

let engine = Engine::builder(storage, schema)
    .analyzer(Arc::new(per_field))
    .build()
    .await?;
}

Custom Embedder

Implement the Embedder trait to integrate your own vector embedding model:

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use laurus::embedding::embedder::{Embedder, EmbedInput, EmbedInputType};
use laurus::vector::core::vector::Vector;
use laurus::{LaurusError, Result};

#[derive(Debug)]
struct MyEmbedder {
    dimension: usize,
}

#[async_trait]
impl Embedder for MyEmbedder {
    async fn embed(&self, input: &EmbedInput<'_>) -> Result<Vector> {
        match input {
            EmbedInput::Text(text) => {
                // Your embedding logic here
                let vector = vec![0.0f32; self.dimension];
                Ok(Vector::new(vector))
            }
            _ => Err(LaurusError::invalid_argument(
                "this embedder only supports text input",
            )),
        }
    }

    fn supported_input_types(&self) -> Vec<EmbedInputType> {
        vec![EmbedInputType::Text]
    }

    fn name(&self) -> &str {
        "my-embedder"
    }

    fn as_any(&self) -> &dyn std::any::Any {
        self
    }
}
}

Required Methods

MethodDescription
async embed(&self, input: &EmbedInput) -> Result<Vector>Generate an embedding vector for the given input
supported_input_types(&self) -> Vec<EmbedInputType>Declare supported input types (Text, Image)
as_any(&self) -> &dyn AnyEnable downcasting

Optional Methods

MethodDefaultDescription
async embed_batch(&self, inputs) -> Result<Vec<Vector>>Sequential calls to embedOverride for batch optimization
name(&self) -> &str"unknown"Identifier for logging
supports(&self, input_type) -> boolChecks supported_input_typesInput type support check
supports_text() -> boolChecks for TextText support shorthand
supports_image() -> boolChecks for ImageImage support shorthand
is_multimodal() -> boolBoth text and imageMultimodal check

Using a Custom Embedder

#![allow(unused)]
fn main() {
let embedder = Arc::new(MyEmbedder { dimension: 384 });
let engine = Engine::builder(storage, schema)
    .embedder(embedder)
    .build()
    .await?;
}

For per-field embedders, wrap with PerFieldEmbedder:

#![allow(unused)]
fn main() {
use laurus::embedding::per_field::PerFieldEmbedder;

let mut per_field = PerFieldEmbedder::new(Arc::new(MyEmbedder { dimension: 384 }));
per_field.add_embedder("image_vec", Arc::new(ClipEmbedder::new()?));

let engine = Engine::builder(storage, schema)
    .embedder(Arc::new(per_field))
    .build()
    .await?;
}

Custom Storage

Implement the Storage trait to add a new storage backend:

#![allow(unused)]
fn main() {
use laurus::storage::{Storage, StorageInput, StorageOutput, LoadingMode, FileMetadata};
use laurus::Result;

#[derive(Debug)]
struct S3Storage {
    bucket: String,
    prefix: String,
}

impl Storage for S3Storage {
    fn loading_mode(&self) -> LoadingMode {
        LoadingMode::Eager  // S3 requires full download
    }

    fn open_input(&self, name: &str) -> Result<Box<dyn StorageInput>> {
        // Download from S3 and return a reader
        todo!()
    }

    fn create_output(&self, name: &str) -> Result<Box<dyn StorageOutput>> {
        // Create an upload stream to S3
        todo!()
    }

    fn create_output_append(&self, name: &str) -> Result<Box<dyn StorageOutput>> {
        todo!()
    }

    fn file_exists(&self, name: &str) -> bool {
        todo!()
    }

    fn delete_file(&self, name: &str) -> Result<()> {
        todo!()
    }

    fn list_files(&self) -> Result<Vec<String>> {
        todo!()
    }

    fn file_size(&self, name: &str) -> Result<u64> {
        todo!()
    }

    fn metadata(&self, name: &str) -> Result<FileMetadata> {
        todo!()
    }

    fn rename_file(&self, old_name: &str, new_name: &str) -> Result<()> {
        todo!()
    }

    fn create_temp_output(&self, prefix: &str) -> Result<(String, Box<dyn StorageOutput>)> {
        todo!()
    }

    fn sync(&self) -> Result<()> {
        todo!()
    }

    fn close(&mut self) -> Result<()> {
        todo!()
    }
}
}

Required Methods

MethodDescription
open_input(name) -> Result<Box<dyn StorageInput>>Open a file for reading
create_output(name) -> Result<Box<dyn StorageOutput>>Create a file for writing
create_output_append(name) -> Result<Box<dyn StorageOutput>>Open a file for appending
file_exists(name) -> boolCheck if a file exists
delete_file(name) -> Result<()>Delete a file
list_files() -> Result<Vec<String>>List all files
file_size(name) -> Result<u64>Get file size in bytes
metadata(name) -> Result<FileMetadata>Get file metadata
rename_file(old, new) -> Result<()>Rename a file
create_temp_output(prefix) -> Result<(String, Box<dyn StorageOutput>)>Create a temporary file
sync() -> Result<()>Flush all pending writes
close(&mut self) -> Result<()>Close storage and release resources

Optional Methods

MethodDefaultDescription
loading_mode() -> LoadingModeLoadingMode::EagerPreferred data loading mode

Thread Safety

All three traits require Send + Sync. This means your implementations must be safe to share across threads. Use Arc<Mutex<_>> or lock-free data structures for shared mutable state.

Next Steps

Architecture

This page explains how Laurus is structured internally. Understanding the architecture will help you make better decisions about schema design, analyzer selection, and search strategies.

High-Level Overview

Laurus is organized around a single Engine that coordinates four internal components:

graph TB
    subgraph Engine
        SCH["Schema"]
        LS["LexicalStore\n(Inverted Index)"]
        VS["VectorStore\n(HNSW / Flat / IVF)"]
        DL["DocumentLog\n(WAL + Document Storage)"]
    end

    Storage["Storage (trait)\nMemory / File / File+Mmap"]

    LS --- Storage
    VS --- Storage
    DL --- Storage
ComponentResponsibility
SchemaDeclares fields and their types; determines how each field is routed
LexicalStoreInverted index for keyword search (BM25 scoring)
VectorStoreVector index for similarity search (Flat, HNSW, or IVF)
DocumentLogWrite-ahead log (WAL) for durability + raw document storage

All three stores share a single Storage backend, isolated by key prefixes (lexical/, vector/, documents/).

Engine Lifecycle

Building an Engine

The EngineBuilder assembles the engine from its parts:

#![allow(unused)]
fn main() {
let engine = Engine::builder(storage, schema)
    .analyzer(analyzer)      // optional: for text fields
    .embedder(embedder)      // optional: for vector fields
    .build()
    .await?;
}
sequenceDiagram
    participant User
    participant EngineBuilder
    participant Engine

    User->>EngineBuilder: new(storage, schema)
    User->>EngineBuilder: .analyzer(analyzer)
    User->>EngineBuilder: .embedder(embedder)
    User->>EngineBuilder: .build().await
    EngineBuilder->>EngineBuilder: split_schema()
    Note over EngineBuilder: Separate fields into\nLexicalIndexConfig\n+ VectorIndexConfig
    EngineBuilder->>Engine: Create LexicalStore
    EngineBuilder->>Engine: Create VectorStore
    EngineBuilder->>Engine: Create DocumentLog
    EngineBuilder->>Engine: Recover from WAL
    EngineBuilder-->>User: Engine ready

During build(), the engine:

  1. Splits the schema — lexical fields go to LexicalIndexConfig, vector fields go to VectorIndexConfig
  2. Creates prefixed storage — each component gets an isolated namespace (lexical/, vector/, documents/)
  3. Initializes storesLexicalStore and VectorStore are constructed with their configs
  4. Recovers from WAL — replays any uncommitted operations from a previous session

Schema Splitting

The Schema contains both lexical and vector fields. At build time, split_schema() separates them:

graph LR
    S["Schema\ntitle: Text\nbody: Text\ncategory: Text\npage: Integer\ncontent_vec: HNSW"]

    S --> LC["LexicalIndexConfig\ntitle: TextOption\nbody: TextOption\ncategory: TextOption\npage: IntegerOption\n_id: KeywordAnalyzer"]

    S --> VC["VectorIndexConfig\ncontent_vec: HnswOption\n(dim=384, m=16, ef=200)"]

Key details:

  • The reserved _id field is always added to the lexical config with KeywordAnalyzer (exact match)
  • A PerFieldAnalyzer wraps per-field analyzer settings; if you pass a simple StandardAnalyzer, it becomes the default for all text fields
  • A PerFieldEmbedder works the same way for vector fields

Indexing Data Flow

When you call engine.add_document(id, doc):

sequenceDiagram
    participant User
    participant Engine
    participant WAL as DocumentLog (WAL)
    participant Lexical as LexicalStore
    participant Vector as VectorStore

    User->>Engine: add_document("doc-1", doc)
    Engine->>WAL: Append to WAL
    Engine->>Engine: Assign internal ID (u64)

    loop For each field in document
        alt Lexical field (text, integer, etc.)
            Engine->>Lexical: Analyze + index field
        else Vector field
            Engine->>Vector: Embed + index field
        end
    end

    Note over Engine: Document is buffered\nbut NOT yet searchable

    User->>Engine: commit()
    Engine->>Lexical: Flush segments to storage
    Engine->>Vector: Flush segments to storage
    Engine->>WAL: Truncate WAL
    Note over Engine: Documents are\nnow searchable

Key points:

  • WAL-first: every write is logged before modifying in-memory structures
  • Dual indexing: each field is routed to either the lexical or vector store based on the schema
  • Commit required: documents become searchable only after commit()

Search Data Flow

When you call engine.search(request):

sequenceDiagram
    participant User
    participant Engine
    participant Lexical as LexicalStore
    participant Vector as VectorStore
    participant Fusion

    User->>Engine: search(request)

    opt Filter query present
        Engine->>Lexical: Execute filter query
        Lexical-->>Engine: Allowed document IDs
    end

    par Lexical search
        Engine->>Lexical: Execute lexical query
        Lexical-->>Engine: Ranked hits (BM25)
    and Vector search
        Engine->>Vector: Execute vector query
        Vector-->>Engine: Ranked hits (similarity)
    end

    alt Both lexical and vector results
        Engine->>Fusion: Fuse results (RRF or WeightedSum)
        Fusion-->>Engine: Merged ranked list
    end

    Engine->>Engine: Apply offset + limit
    Engine-->>User: Vec of SearchResult

The search pipeline has three stages:

  1. Filter (optional) — execute a filter query on the lexical index to get a set of allowed document IDs
  2. Search — run lexical and/or vector queries in parallel
  3. Fusion — if both query types are present, merge results using RRF (default, k=60) or WeightedSum

Storage Architecture

All components share a single Storage trait implementation, but use key prefixes to isolate their data:

graph TB
    Engine --> PS1["PrefixedStorage\nprefix: 'lexical/'"]
    Engine --> PS2["PrefixedStorage\nprefix: 'vector/'"]
    Engine --> PS3["PrefixedStorage\nprefix: 'documents/'"]

    PS1 --> S["Storage Backend\n(Memory / File / File+Mmap)"]
    PS2 --> S
    PS3 --> S
BackendDescriptionBest For
MemoryStorageAll data in memoryTesting, small datasets, ephemeral use
FileStorageStandard file I/OGeneral production use
FileStorage (mmap)Memory-mapped files (use_mmap = true)Large datasets, read-heavy workloads

Per-Field Dispatch

When a PerFieldAnalyzer is provided, the engine dispatches analysis to field-specific analyzers. The same pattern applies to PerFieldEmbedder.

graph LR
    PFA["PerFieldAnalyzer"]
    PFA -->|"title"| KA["KeywordAnalyzer"]
    PFA -->|"body"| SA["StandardAnalyzer"]
    PFA -->|"description"| JA["JapaneseAnalyzer"]
    PFA -->|"_id"| KA2["KeywordAnalyzer\n(always)"]
    PFA -->|other fields| DEF["Default Analyzer\n(StandardAnalyzer)"]

This allows different fields to use different analysis strategies within the same engine.

Summary

AspectDetail
Core structEngine — coordinates all operations
BuilderEngineBuilder — assembles Engine from Storage + Schema + Analyzer + Embedder
Schema splitLexical fields → LexicalIndexConfig, Vector fields → VectorIndexConfig
Write pathWAL → in-memory buffers → commit() → persistent storage
Read pathQuery → parallel lexical/vector search → fusion → ranked results
Storage isolationPrefixedStorage with lexical/, vector/, documents/ prefixes
Per-field dispatchPerFieldAnalyzer and PerFieldEmbedder route to field-specific implementations

Next Steps

API Reference

This page provides a quick reference of the most important types and methods in Laurus. For full details, generate the Rustdoc:

cargo doc --open

Engine

The central coordinator for all indexing and search operations.

MethodDescription
Engine::builder(storage, schema)Create an EngineBuilder
engine.put_document(id, doc).await?Upsert a document (replace if ID exists)
engine.add_document(id, doc).await?Add a document as a chunk (multiple chunks can share an ID)
engine.delete_documents(id).await?Delete all documents/chunks by external ID
engine.get_documents(id).await?Get all documents/chunks by external ID
engine.search(request).await?Execute a search request
engine.commit().await?Flush all pending changes to storage
engine.stats()?Get index statistics

put_document vs add_document: put_document performs an upsert — if a document with the same external ID already exists, it is deleted and replaced. add_document always appends, allowing multiple document chunks to share the same external ID. See Schema & Fields — Indexing Documents for details.

EngineBuilder

MethodDescription
EngineBuilder::new(storage, schema)Create a builder with storage and schema
.analyzer(Arc<dyn Analyzer>)Set the text analyzer (default: StandardAnalyzer)
.embedder(Arc<dyn Embedder>)Set the vector embedder (optional)
.build().await?Build the Engine

Schema

Defines document structure.

MethodDescription
Schema::builder()Create a SchemaBuilder

SchemaBuilder

MethodDescription
.add_text_field(name, TextOption)Add a full-text field
.add_integer_field(name, IntegerOption)Add an integer field
.add_float_field(name, FloatOption)Add a float field
.add_boolean_field(name, BooleanOption)Add a boolean field
.add_datetime_field(name, DateTimeOption)Add a datetime field
.add_geo_field(name, GeoOption)Add a geographic field
.add_bytes_field(name, BytesOption)Add a binary field
.add_hnsw_field(name, HnswOption)Add an HNSW vector field
.add_flat_field(name, FlatOption)Add a Flat vector field
.add_ivf_field(name, IvfOption)Add an IVF vector field
.add_default_field(name)Set a default search field
.build()Build the Schema

Document

A collection of named field values.

MethodDescription
Document::builder()Create a DocumentBuilder
doc.get(name)Get a field value by name
doc.has_field(name)Check if a field exists
doc.field_names()Get all field names

DocumentBuilder

MethodDescription
.add_text(name, value)Add a text field
.add_integer(name, value)Add an integer field
.add_float(name, value)Add a float field
.add_boolean(name, value)Add a boolean field
.add_datetime(name, value)Add a datetime field
.add_vector(name, vec)Add a pre-computed vector
.add_geo(name, lat, lon)Add a geographic point
.add_bytes(name, data)Add binary data
.build()Build the Document

Search

SearchRequestBuilder

MethodDescription
SearchRequestBuilder::new()Create a new builder
.lexical_search_request(req)Set the lexical search component
.vector_search_request(req)Set the vector search component
.filter_query(query)Set a pre-filter query
.fusion_algorithm(algo)Set the fusion algorithm (default: RRF)
.limit(n)Maximum results (default: 10)
.offset(n)Skip N results (default: 0)
.build()Build the SearchRequest

LexicalSearchRequest

MethodDescription
LexicalSearchRequest::new(query)Create with a query
LexicalSearchRequest::from_dsl(query_str)Create from a DSL query string
.limit(n)Maximum results
.load_documents(bool)Whether to load document content
.min_score(f32)Minimum score threshold
.timeout_ms(u64)Search timeout in milliseconds
.parallel(bool)Enable parallel search
.sort_by_field_asc(field)Sort by field ascending
.sort_by_field_desc(field)Sort by field descending
.sort_by_score()Sort by relevance score (default)
.with_field_boost(field, boost)Add field-level boost

VectorSearchRequestBuilder

MethodDescription
VectorSearchRequestBuilder::new()Create a new builder
.add_text(field, text)Add a text query for a field
.add_vector(field, vector)Add a pre-computed query vector
.add_bytes(field, bytes, mime)Add a binary payload (for multimodal)
.limit(n)Maximum results
.score_mode(VectorScoreMode)Score combination mode (WeightedSum, MaxSim)
.min_score(f32)Minimum score threshold
.field(name)Restrict search to a specific field
.build()Build the request

SearchResult

FieldTypeDescription
idStringExternal document ID
scoref32Relevance score
documentOption<Document>Document content (if loaded)

FusionAlgorithm

VariantDescription
RRF { k: f64 }Reciprocal Rank Fusion (default k=60.0)
WeightedSum { lexical_weight, vector_weight }Linear combination of scores

Query Types (Lexical)

QueryDescriptionExample
TermQuery::new(field, term)Exact term matchTermQuery::new("body", "rust")
PhraseQuery::new(field, terms)Exact phrasePhraseQuery::new("body", vec!["machine".into(), "learning".into()])
BooleanQueryBuilder::new()Boolean combination.must(q1).should(q2).must_not(q3).build()
FuzzyQuery::new(field, term)Fuzzy match (default max_edits=2)FuzzyQuery::new("body", "programing").max_edits(1)
WildcardQuery::new(field, pattern)WildcardWildcardQuery::new("file", "*.pdf")
NumericRangeQuery::new(...)Numeric rangeSee Lexical Search
GeoQuery::within_radius(...)Geo radiusSee Lexical Search
SpanNearQuery::new(...)ProximitySee Lexical Search
PrefixQuery::new(field, prefix)Prefix matchPrefixQuery::new("body", "pro")
RegexpQuery::new(field, pattern)?Regex matchRegexpQuery::new("body", "^pro.*ing$")?

Query Parsers

ParserDescription
QueryParser::new(analyzer)Parse lexical DSL queries
VectorQueryParser::new(embedder)Parse vector DSL queries
UnifiedQueryParser::new(lexical, vector)Parse hybrid DSL queries

Analyzers

TypeDescription
StandardAnalyzerRegexTokenizer + lowercase + stop words
SimpleAnalyzerTokenization only (no filtering)
EnglishAnalyzerRegexTokenizer + lowercase + English stop words
JapaneseAnalyzerJapanese morphological analysis
KeywordAnalyzerNo tokenization (exact match)
PipelineAnalyzerCustom tokenizer + filter chain
PerFieldAnalyzerPer-field analyzer dispatch

Embedders

TypeFeature FlagDescription
CandleBertEmbedderembeddings-candleLocal BERT model
OpenAIEmbedderembeddings-openaiOpenAI API
CandleClipEmbedderembeddings-multimodalLocal CLIP model
PrecomputedEmbedder(default)Pre-computed vectors
PerFieldEmbedder(default)Per-field embedder dispatch

Storage

TypeDescription
MemoryStorageIn-memory (non-durable)
FileStorageFile-system based (supports use_mmap for memory-mapped I/O)
StorageFactory::create(config)Create from config

DataValue

VariantRust Type
DataValue::Null
DataValue::Bool(bool)bool
DataValue::Int64(i64)i64
DataValue::Float64(f64)f64
DataValue::Text(String)String
DataValue::Bytes(Vec<u8>, Option<String>)(data, mime_type)
DataValue::Vector(Vec<f32>)Vec<f32>
DataValue::DateTime(DateTime<Utc>)chrono::DateTime<Utc>
DataValue::Geo(f64, f64)(latitude, longitude)