Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Laurus

A fast, featureful hybrid search library for Rust.

Laurus is a pure-Rust library that combines lexical search (keyword matching via inverted index) and vector search (semantic similarity via embeddings) into a single, unified engine. It is designed to be embedded directly into your Rust application — no external server required.

Key Features

FeatureDescription
Lexical SearchFull-text search powered by an inverted index with BM25 scoring
Vector SearchApproximate nearest neighbor (ANN) search using Flat, HNSW, or IVF indexes
Hybrid SearchCombine lexical and vector results with fusion algorithms (RRF, WeightedSum)
Text AnalysisPluggable analyzer pipeline — tokenizers, filters, stemmers, synonyms
EmbeddingsBuilt-in support for Candle (local BERT/CLIP), OpenAI API, or custom embedders
StoragePluggable backends — in-memory, file-based, or memory-mapped
Query DSLHuman-readable query syntax for lexical, vector, and hybrid search
Pure RustNo C/C++ dependencies in the core — safe, portable, easy to build

How It Works

graph LR
    subgraph Your Application
        D["Document"]
        Q["Query"]
    end

    subgraph Laurus Engine
        SCH["Schema"]
        AN["Analyzer"]
        EM["Embedder"]
        LI["Lexical Index\n(Inverted Index)"]
        VI["Vector Index\n(HNSW / Flat / IVF)"]
        FU["Fusion\n(RRF / WeightedSum)"]
    end

    D --> SCH
    SCH --> AN --> LI
    SCH --> EM --> VI
    Q --> LI --> FU
    Q --> VI --> FU
    FU --> R["Ranked Results"]
  1. Define a Schema — declare your fields and their types (text, integer, vector, etc.)
  2. Build an Engine — attach an analyzer for text and an embedder for vectors
  3. Index Documents — the engine routes each field to the correct index automatically
  4. Search — run lexical, vector, or hybrid queries and get ranked results

Document Map

SectionWhat You Will Learn
Getting StartedInstall Laurus and run your first search in minutes
ArchitectureUnderstand the Engine, its components, and data flow
Core ConceptsSchema, text analysis, embeddings, and storage
IndexingHow inverted indexes and vector indexes work internally
SearchQuery types, vector search, and hybrid fusion
Query DSLHuman-readable query syntax for all search types
Library (laurus)Engine internals, scoring, faceting, and extensibility
CLI (laurus-cli)Command-line tool for index management and search
Server (laurus-server)gRPC server with HTTP Gateway
Development GuideBuild, test, and contribute to Laurus

Quick Example

use std::sync::Arc;
use laurus::{Document, Engine, Schema, SearchRequestBuilder, Result};
use laurus::lexical::{TextOption, TermQuery};
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> Result<()> {
    // 1. Storage
    let storage = Arc::new(MemoryStorage::new(Default::default()));

    // 2. Schema
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_default_field("body")
        .build();

    // 3. Engine
    let engine = Engine::builder(storage, schema).build().await?;

    // 4. Index a document
    let doc = Document::builder()
        .add_text("title", "Hello Laurus")
        .add_text("body", "A fast search library for Rust")
        .build();
    engine.add_document("doc-1", doc).await?;
    engine.commit().await?;

    // 5. Search
    let request = SearchRequestBuilder::new()
        .lexical_search_request(
            laurus::LexicalSearchRequest::new(
                Box::new(TermQuery::new("body", "rust"))
            )
        )
        .limit(10)
        .build();
    let results = engine.search(request).await?;

    for r in &results {
        println!("{}: score={:.4}", r.id, r.score);
    }
    Ok(())
}

License

Laurus is licensed under the MIT License.

Architecture

This page explains how Laurus is structured internally. Understanding the architecture will help you make better decisions about schema design, analyzer selection, and search strategies.

Project Structure

Laurus is organized as a Cargo workspace with five crates:

graph TB
    CLI["laurus-cli\n(Binary Crate)\nCLI + REPL"]
    SRV["laurus-server\n(Library + Binary)\ngRPC Server + HTTP Gateway"]
    MCP["laurus-mcp\n(Library + Binary)\nMCP Server"]
    PY["laurus-python\n(cdylib)\nPython Bindings"]
    LIB["laurus\n(Library Crate)\nCore Search Engine"]

    CLI --> LIB
    CLI --> SRV
    CLI --> MCP
    SRV --> LIB
    MCP --> SRV
    MCP --> LIB
    PY --> LIB
CrateTypeDescription
laurusLibraryCore search engine – lexical, vector, and hybrid search
laurus-cliBinaryCommand-line interface for index management and search
laurus-serverLibrary + BinarygRPC server with optional HTTP/JSON gateway
laurus-mcpLibrary + BinaryMCP (Model Context Protocol) server
laurus-pythoncdylibPython bindings via PyO3

For details on each crate, see:

High-Level Overview

Laurus is organized around a single Engine that coordinates four internal components:

graph TB
    subgraph Engine
        SCH["Schema"]
        LS["LexicalStore\n(Inverted Index)"]
        VS["VectorStore\n(HNSW / Flat / IVF)"]
        DL["DocumentLog\n(WAL + Document Storage)"]
    end

    Storage["Storage (trait)\nMemory / File / File+Mmap"]

    LS --- Storage
    VS --- Storage
    DL --- Storage
ComponentResponsibility
SchemaDeclares fields and their types; determines how each field is routed
LexicalStoreInverted index for keyword search (BM25 scoring)
VectorStoreVector index for similarity search (Flat, HNSW, or IVF)
DocumentLogWrite-ahead log (WAL) for durability + raw document storage

All three stores share a single Storage backend, isolated by key prefixes (lexical/, vector/, documents/).

Engine Lifecycle

Building an Engine

The EngineBuilder assembles the engine from its parts:

#![allow(unused)]
fn main() {
let engine = Engine::builder(storage, schema)
    .analyzer(analyzer)      // optional: for text fields
    .embedder(embedder)      // optional: for vector fields
    .build()
    .await?;
}
sequenceDiagram
    participant User
    participant EngineBuilder
    participant Engine

    User->>EngineBuilder: new(storage, schema)
    User->>EngineBuilder: .analyzer(analyzer)
    User->>EngineBuilder: .embedder(embedder)
    User->>EngineBuilder: .build().await
    EngineBuilder->>EngineBuilder: split_schema()
    Note over EngineBuilder: Separate fields into\nLexicalIndexConfig\n+ VectorIndexConfig
    EngineBuilder->>Engine: Create LexicalStore
    EngineBuilder->>Engine: Create VectorStore
    EngineBuilder->>Engine: Create DocumentLog
    EngineBuilder->>Engine: Recover from WAL
    EngineBuilder-->>User: Engine ready

During build(), the engine:

  1. Splits the schema — lexical fields go to LexicalIndexConfig, vector fields go to VectorIndexConfig
  2. Creates prefixed storage — each component gets an isolated namespace (lexical/, vector/, documents/)
  3. Initializes storesLexicalStore and VectorStore are constructed with their configs
  4. Recovers from WAL — replays any uncommitted operations from a previous session

Schema Splitting

The Schema contains both lexical and vector fields. At build time, split_schema() separates them:

graph LR
    S["Schema\ntitle: Text\nbody: Text\ncategory: Text\npage: Integer\ncontent_vec: HNSW"]

    S --> LC["LexicalIndexConfig\ntitle: TextOption\nbody: TextOption\ncategory: TextOption\npage: IntegerOption\n_id: KeywordAnalyzer"]

    S --> VC["VectorIndexConfig\ncontent_vec: HnswOption\n(dim=384, m=16, ef=200)"]

Key details:

  • The reserved _id field is always added to the lexical config with KeywordAnalyzer (exact match)
  • A PerFieldAnalyzer wraps per-field analyzer settings; if you pass a simple StandardAnalyzer, it becomes the default for all text fields
  • A PerFieldEmbedder works the same way for vector fields

Indexing Data Flow

When you call engine.add_document(id, doc):

sequenceDiagram
    participant User
    participant Engine
    participant WAL as DocumentLog (WAL)
    participant Lexical as LexicalStore
    participant Vector as VectorStore

    User->>Engine: add_document("doc-1", doc)
    Engine->>WAL: Append to WAL
    Engine->>Engine: Assign internal ID (u64)

    loop For each field in document
        alt Lexical field (text, integer, etc.)
            Engine->>Lexical: Analyze + index field
        else Vector field
            Engine->>Vector: Embed + index field
        end
    end

    Note over Engine: Document is buffered\nbut NOT yet searchable

    User->>Engine: commit()
    Engine->>Lexical: Flush segments to storage
    Engine->>Vector: Flush segments to storage
    Engine->>WAL: Truncate WAL
    Note over Engine: Documents are\nnow searchable

Key points:

  • WAL-first: every write is logged before modifying in-memory structures
  • Dual indexing: each field is routed to either the lexical or vector store based on the schema
  • Commit required: documents become searchable only after commit()

Search Data Flow

When you call engine.search(request):

sequenceDiagram
    participant User
    participant Engine
    participant Lexical as LexicalStore
    participant Vector as VectorStore
    participant Fusion

    User->>Engine: search(request)

    opt Filter query present
        Engine->>Lexical: Execute filter query
        Lexical-->>Engine: Allowed document IDs
    end

    par Lexical search
        Engine->>Lexical: Execute lexical query
        Lexical-->>Engine: Ranked hits (BM25)
    and Vector search
        Engine->>Vector: Execute vector query
        Vector-->>Engine: Ranked hits (similarity)
    end

    alt Both lexical and vector results
        Engine->>Fusion: Fuse results (RRF or WeightedSum)
        Fusion-->>Engine: Merged ranked list
    end

    Engine->>Engine: Apply offset + limit
    Engine-->>User: Vec of SearchResult

The search pipeline has three stages:

  1. Filter (optional) — execute a filter query on the lexical index to get a set of allowed document IDs
  2. Search — run lexical and/or vector queries in parallel
  3. Fusion — if both query types are present, merge results using RRF (default, k=60) or WeightedSum

Storage Architecture

All components share a single Storage trait implementation, but use key prefixes to isolate their data:

graph TB
    Engine --> PS1["PrefixedStorage\nprefix: 'lexical/'"]
    Engine --> PS2["PrefixedStorage\nprefix: 'vector/'"]
    Engine --> PS3["PrefixedStorage\nprefix: 'documents/'"]

    PS1 --> S["Storage Backend\n(Memory / File / File+Mmap)"]
    PS2 --> S
    PS3 --> S
BackendDescriptionBest For
MemoryStorageAll data in memoryTesting, small datasets, ephemeral use
FileStorageStandard file I/OGeneral production use
FileStorage (mmap)Memory-mapped files (use_mmap = true)Large datasets, read-heavy workloads

Per-Field Dispatch

When a PerFieldAnalyzer is provided, the engine dispatches analysis to field-specific analyzers. The same pattern applies to PerFieldEmbedder.

graph LR
    PFA["PerFieldAnalyzer"]
    PFA -->|"title"| KA["KeywordAnalyzer"]
    PFA -->|"body"| SA["StandardAnalyzer"]
    PFA -->|"description"| JA["JapaneseAnalyzer"]
    PFA -->|"_id"| KA2["KeywordAnalyzer\n(always)"]
    PFA -->|other fields| DEF["Default Analyzer\n(StandardAnalyzer)"]

This allows different fields to use different analysis strategies within the same engine.

Summary

AspectDetail
Core structEngine — coordinates all operations
BuilderEngineBuilder — assembles Engine from Storage + Schema + Analyzer + Embedder
Schema splitLexical fields → LexicalIndexConfig, Vector fields → VectorIndexConfig
Write pathWAL → in-memory buffers → commit() → persistent storage
Read pathQuery → parallel lexical/vector search → fusion → ranked results
Storage isolationPrefixedStorage with lexical/, vector/, documents/ prefixes
Per-field dispatchPerFieldAnalyzer and PerFieldEmbedder route to field-specific implementations

Next Steps

Getting Started

Welcome to Laurus! This section will help you install the library and run your first search.

What You Will Build

By the end of this guide, you will have a working search engine that can:

  • Index text documents
  • Perform keyword (lexical) search
  • Perform semantic (vector) search
  • Combine both with hybrid search

Prerequisites

  • Rust 1.85 or later (edition 2024)
  • Cargo (included with Rust)
  • Tokio runtime (Laurus uses async APIs)

Steps

  1. Installation — Add Laurus to your project and choose feature flags
  2. Quick Start — Build a complete search engine in 5 steps

Workflow Overview

Building a search application with Laurus follows a consistent pattern:

graph LR
    A["1. Create\nStorage"] --> B["2. Define\nSchema"]
    B --> C["3. Build\nEngine"]
    C --> D["4. Index\nDocuments"]
    D --> E["5. Search"]
StepWhat Happens
Create StorageChoose where data lives — in memory, on disk, or memory-mapped
Define SchemaDeclare fields and their types (text, integer, vector, etc.)
Build EngineAttach an analyzer (for text) and an embedder (for vectors)
Index DocumentsAdd documents; the engine routes fields to the correct index
SearchRun lexical, vector, or hybrid queries and get ranked results

Installation

Add Laurus to Your Project

Add laurus and tokio (async runtime) to your Cargo.toml:

[dependencies]
laurus = "0.1.0"
tokio = { version = "1", features = ["full"] }

Feature Flags

Laurus ships with a minimal default feature set. Enable additional features as needed:

FeatureDescriptionUse Case
(default)Core library (lexical search, storage, analyzers — no embedding)Keyword search only
embeddings-candleLocal BERT embeddings via Hugging Face CandleVector search without external API
embeddings-openaiOpenAI API embeddings (text-embedding-3-small, etc.)Cloud-based vector search
embeddings-multimodalCLIP embeddings for text + image via CandleMultimodal (text-to-image) search
embeddings-allAll embedding features aboveFull embedding support

Examples

Lexical search only (no embeddings needed):

[dependencies]
laurus = "0.1.0"

Vector search with local model (no API key required):

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-candle"] }

Vector search with OpenAI:

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-openai"] }

Everything:

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-all"] }

Verify Installation

Create a minimal program to verify that Laurus compiles:

use laurus::Result;

#[tokio::main]
async fn main() -> Result<()> {
    println!("Laurus version: {}", laurus::VERSION);
    Ok(())
}
cargo run

If you see the version printed, you are ready to proceed to the Quick Start.

Quick Start

This tutorial walks you through building a complete search engine in 5 steps. By the end, you will be able to index documents and search them by keyword.

Step 1 — Create Storage

Storage determines where Laurus persists index data. For development and testing, use MemoryStorage:

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::storage::memory::MemoryStorage;
use laurus::Storage;

let storage: Arc<dyn Storage> = Arc::new(
    MemoryStorage::new(Default::default())
);
}

Tip: For production, consider FileStorage (with optional use_mmap for memory-mapped I/O). See Storage for details.

Step 2 — Define a Schema

A Schema declares the fields in your documents and how each field should be indexed:

#![allow(unused)]
fn main() {
use laurus::Schema;
use laurus::lexical::TextOption;

let schema = Schema::builder()
    .add_text_field("title", TextOption::default())
    .add_text_field("body", TextOption::default())
    .add_default_field("body")  // used when no field is specified in a query
    .build();
}

Each field has a type. Common types include:

MethodField TypeExample Values
add_text_fieldText (full-text searchable)"Hello world"
add_integer_field64-bit integer42
add_float_field64-bit float3.14
add_boolean_fieldBooleantrue / false
add_datetime_fieldUTC datetime2024-01-15T10:30:00Z
add_hnsw_fieldVector (HNSW index)[0.1, 0.2, ...]
add_flat_fieldVector (Flat index)[0.1, 0.2, ...]

See Schema & Fields for the full list.

Step 3 — Build an Engine

The Engine ties storage, schema, and runtime components together:

#![allow(unused)]
fn main() {
use laurus::Engine;

let engine = Engine::builder(storage, schema)
    .build()
    .await?;
}

When you only use text fields, the default StandardAnalyzer is used automatically. To customize analysis or add vector embeddings, see Architecture.

Step 4 — Index Documents

Create documents with the DocumentBuilder and add them to the engine:

#![allow(unused)]
fn main() {
use laurus::Document;

// Each document needs a unique external ID (string)
let doc = Document::builder()
    .add_text("title", "Introduction to Rust")
    .add_text("body", "Rust is a systems programming language focused on safety and performance.")
    .build();
engine.add_document("doc-1", doc).await?;

let doc = Document::builder()
    .add_text("title", "Python for Data Science")
    .add_text("body", "Python is widely used in machine learning and data analysis.")
    .build();
engine.add_document("doc-2", doc).await?;

let doc = Document::builder()
    .add_text("title", "Web Development with JavaScript")
    .add_text("body", "JavaScript powers interactive web applications and server-side code with Node.js.")
    .build();
engine.add_document("doc-3", doc).await?;

// Commit to make documents searchable
engine.commit().await?;
}

Important: Documents are not searchable until commit() is called.

Use SearchRequestBuilder with a query to search the index:

#![allow(unused)]
fn main() {
use laurus::SearchRequestBuilder;
use laurus::lexical::TermQuery;
use laurus::lexical::search::searcher::LexicalSearchQuery;

// Search for "rust" in the "body" field
let request = SearchRequestBuilder::new()
    .lexical_query(
        LexicalSearchQuery::Obj(
            Box::new(TermQuery::new("body", "rust"))
        )
    )
    .limit(10)
    .build();

let results = engine.search(request).await?;

for result in &results {
    println!("ID: {}, Score: {:.4}", result.id, result.score);
    if let Some(doc) = &result.document {
        if let Some(title) = doc.get("title") {
            println!("  Title: {:?}", title);
        }
    }
}
}

Complete Example

Here is the full program that you can copy, paste, and run:

use std::sync::Arc;
use laurus::{
    Document, Engine,
    Result, Schema, SearchRequestBuilder,
};
use laurus::lexical::{TextOption, TermQuery};
use laurus::lexical::search::searcher::LexicalSearchQuery;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> Result<()> {
    // 1. Storage
    let storage = Arc::new(MemoryStorage::new(Default::default()));

    // 2. Schema
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_default_field("body")
        .build();

    // 3. Engine
    let engine = Engine::builder(storage, schema).build().await?;

    // 4. Index documents
    for (id, title, body) in [
        ("doc-1", "Introduction to Rust", "Rust is a systems programming language focused on safety."),
        ("doc-2", "Python for Data Science", "Python is widely used in machine learning."),
        ("doc-3", "Web Development", "JavaScript powers interactive web applications."),
    ] {
        let doc = Document::builder()
            .add_text("title", title)
            .add_text("body", body)
            .build();
        engine.add_document(id, doc).await?;
    }
    engine.commit().await?;

    // 5. Search
    let request = SearchRequestBuilder::new()
        .lexical_query(
            LexicalSearchQuery::Obj(
                Box::new(TermQuery::new("body", "rust"))
            )
        )
        .limit(10)
        .build();

    let results = engine.search(request).await?;
    for r in &results {
        println!("{}: score={:.4}", r.id, r.score);
    }

    Ok(())
}

Next Steps

Examples

The laurus/examples/ directory contains runnable examples demonstrating different features of the library.

Running Examples

# Run an example without feature flags
cargo run --example <name>

# Run an example with a feature flag
cargo run --example <name> --features <flag>

Available Examples

quickstart

A minimal example showing the basic workflow: create storage, define a schema, build an engine, index documents, and search.

cargo run --example quickstart

Demonstrates: In-memory storage, TextOption, TermQuery, LexicalSearchRequest.

Comprehensive example of all lexical query types, using both the Builder API and the QueryParser DSL.

cargo run --example lexical_search

Demonstrates: TermQuery, PhraseQuery, FuzzyQuery, WildcardQuery, NumericRangeQuery, GeoQuery, BooleanQuery, SpanQuery.

Vector search with a mock embedder, including filtered vector search and DSL syntax.

cargo run --example vector_search

Demonstrates: PerFieldEmbedder, VectorSearchRequestBuilder, filtered search, DSL syntax (field:"query").

Combining lexical and vector search with different fusion algorithms.

cargo run --example hybrid_search

Demonstrates: Lexical-only, vector-only, and hybrid search. Both RRF and WeightedSum fusion algorithms. Builder API and DSL.

search_with_candle

Vector search using real BERT embeddings via Hugging Face Candle. The model is downloaded automatically on first run (~80 MB).

cargo run --example search_with_candle --features embeddings-candle

Requires: embeddings-candle feature flag.

Demonstrates: CandleBertEmbedder with sentence-transformers/all-MiniLM-L6-v2 (384 dimensions).

search_with_openai

Vector search using the OpenAI Embeddings API.

export OPENAI_API_KEY=your-api-key
cargo run --example search_with_openai --features embeddings-openai

Requires: embeddings-openai feature flag, OPENAI_API_KEY environment variable.

Demonstrates: OpenAIEmbedder with text-embedding-3-small (1536 dimensions).

Multimodal (text + image) search using a CLIP model.

cargo run --example multimodal_search --features embeddings-multimodal

Requires: embeddings-multimodal feature flag.

Demonstrates: CandleClipEmbedder, indexing images from the filesystem, text-to-image and image-to-image queries.

synonym_graph_filter

Demonstrates the SynonymGraphFilter for token expansion during analysis.

cargo run --example synonym_graph_filter

Demonstrates: Synonym dictionary creation, synonym-based token expansion, boost application, token position and position_length attributes.

Helper Module: common.rs

The common.rs file provides shared utilities used by the examples:

  • memory_storage() – Create an in-memory storage instance
  • per_field_analyzer() – Create a PerFieldAnalyzer with KeywordAnalyzer for specific fields
  • MockEmbedder – A mock Embedder implementation for testing vector search without a real model

Schema & Fields

The Schema defines the structure of your documents — what fields exist and how each field is indexed. It is the single source of truth for the Engine.

For the TOML file format used by the CLI, see Schema Format Reference.

Schema

A Schema is a collection of named fields. Each field is either a lexical field (for keyword search) or a vector field (for similarity search).

#![allow(unused)]
fn main() {
use laurus::Schema;
use laurus::lexical::TextOption;
use laurus::lexical::core::field::IntegerOption;
use laurus::vector::HnswOption;

let schema = Schema::builder()
    .add_text_field("title", TextOption::default())
    .add_text_field("body", TextOption::default())
    .add_integer_field("year", IntegerOption::default())
    .add_hnsw_field("embedding", HnswOption::default())
    .add_default_field("body")
    .build();
}

Default Fields

add_default_field() specifies which field(s) are searched when a query does not explicitly name a field. This is used by the Query DSL parser.

Field Types

graph TB
    FO["FieldOption"]

    FO --> T["Text"]
    FO --> I["Integer"]
    FO --> FL["Float"]
    FO --> B["Boolean"]
    FO --> DT["DateTime"]
    FO --> G["Geo"]
    FO --> BY["Bytes"]

    FO --> FLAT["Flat"]
    FO --> HNSW["HNSW"]
    FO --> IVF["IVF"]

Lexical Fields

Lexical fields are indexed using an inverted index and support keyword-based queries.

TypeRust TypeSchemaBuilder MethodDescription
TextTextOptionadd_text_field()Full-text searchable; tokenized by the analyzer
IntegerIntegerOptionadd_integer_field()64-bit signed integer; supports range queries
FloatFloatOptionadd_float_field()64-bit floating point; supports range queries
BooleanBooleanOptionadd_boolean_field()true / false
DateTimeDateTimeOptionadd_datetime_field()UTC timestamp; supports range queries
GeoGeoOptionadd_geo_field()Latitude/longitude pair; supports radius and bounding box queries
BytesBytesOptionadd_bytes_field()Raw binary data

Text Field Options

TextOption controls how text is indexed:

#![allow(unused)]
fn main() {
use laurus::lexical::TextOption;

// Default: indexed + stored + term vectors (all true)
let opt = TextOption::default();

// Customize
let opt = TextOption::default()
    .indexed(true)
    .stored(true)
    .term_vectors(true);
}
OptionDefaultDescription
indexedtrueWhether the field is searchable
storedtrueWhether the original value is stored for retrieval
term_vectorstrueWhether term positions are stored (needed for phrase queries and highlighting)

Vector Fields

Vector fields are indexed using vector indexes for approximate nearest neighbor (ANN) search.

TypeRust TypeSchemaBuilder MethodDescription
FlatFlatOptionadd_flat_field()Brute-force linear scan; exact results
HNSWHnswOptionadd_hnsw_field()Hierarchical Navigable Small World graph; fast approximate
IVFIvfOptionadd_ivf_field()Inverted File Index; cluster-based approximate

HNSW Field Options (most common)

#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;

let opt = HnswOption {
    dimension: 384,                          // vector dimensions
    distance: DistanceMetric::Cosine,        // distance metric
    m: 16,                                   // max connections per layer
    ef_construction: 200,                    // construction search width
    base_weight: 1.0,                        // default scoring weight
    quantizer: None,                         // optional quantization
};
}

See Vector Indexing for detailed parameter guidance.

Document

A Document is a collection of named field values. Use DocumentBuilder to construct documents:

#![allow(unused)]
fn main() {
use laurus::Document;

let doc = Document::builder()
    .add_text("title", "Introduction to Rust")
    .add_text("body", "Rust is a systems programming language.")
    .add_integer("year", 2024)
    .add_float("rating", 4.8)
    .add_boolean("published", true)
    .build();
}

Indexing Documents

The Engine provides two methods for adding documents, each with different semantics:

MethodBehaviorUse Case
put_document(id, doc)Upsert — if a document with the same ID exists, it is replacedStandard document indexing
add_document(id, doc)Append — adds the document as a new chunk; multiple chunks can share the same IDChunked/split documents (e.g., long articles split into paragraphs)
#![allow(unused)]
fn main() {
// Upsert: replaces any existing document with id "doc1"
engine.put_document("doc1", doc).await?;

// Append: adds another chunk under the same id "doc1"
engine.add_document("doc1", chunk2).await?;

// Always commit after indexing
engine.commit().await?;
}

Retrieving Documents

Use get_documents to retrieve all documents (including chunks) by external ID:

#![allow(unused)]
fn main() {
let docs = engine.get_documents("doc1").await?;
for doc in &docs {
    if let Some(title) = doc.get("title") {
        println!("Title: {:?}", title);
    }
}
}

Deleting Documents

Delete all documents and chunks sharing an external ID:

#![allow(unused)]
fn main() {
engine.delete_documents("doc1").await?;
engine.commit().await?;
}

Document Lifecycle

graph LR
    A["Build Document"] --> B["put/add_document()"]
    B --> C["WAL"]
    C --> D["commit()"]
    D --> E["Searchable"]
    E --> F["get_documents()"]
    E --> G["delete_documents()"]

Important: Documents are not searchable until commit() is called.

DocumentBuilder Methods

MethodValue TypeDescription
add_text(name, value)StringAdd a text field
add_integer(name, value)i64Add an integer field
add_float(name, value)f64Add a float field
add_boolean(name, value)boolAdd a boolean field
add_datetime(name, value)DateTime<Utc>Add a datetime field
add_vector(name, value)Vec<f32>Add a pre-computed vector field
add_geo(name, lat, lon)(f64, f64)Add a geographic point
add_bytes(name, data)Vec<u8>Add binary data
add_field(name, value)DataValueAdd any value type

DataValue

DataValue is the unified value enum that represents any field value in Laurus:

#![allow(unused)]
fn main() {
pub enum DataValue {
    Null,
    Bool(bool),
    Int64(i64),
    Float64(f64),
    Text(String),
    Bytes(Vec<u8>, Option<String>),  // (data, optional MIME type)
    Vector(Vec<f32>),
    DateTime(DateTime<Utc>),
    Geo(f64, f64),          // (latitude, longitude)
}
}

DataValue implements From<T> for common types, so you can use .into() conversions:

#![allow(unused)]
fn main() {
use laurus::DataValue;

let v: DataValue = "hello".into();       // Text
let v: DataValue = 42i64.into();         // Int64
let v: DataValue = 3.14f64.into();       // Float64
let v: DataValue = true.into();          // Bool
let v: DataValue = vec![0.1f32, 0.2].into(); // Vector
}

Reserved Fields

The _id field is reserved by Laurus for internal use. It stores the external document ID and is always indexed with KeywordAnalyzer (exact match). You do not need to add it to your schema — it is managed automatically.

Dynamic Field Management

Fields can be added to or removed from a running engine at runtime. Type changes are not supported—remove the field and re-add it with the new type instead.

Adding a Field

Use Engine::add_field() to add a new field to the schema.

Adding a Lexical Field

let updated_schema = engine.add_field(
    "category",
    FieldOption::Text(TextOption::default()),
).await?;

Adding a Vector Field

let updated_schema = engine.add_field(
    "embedding",
    FieldOption::Flat(FlatOption::default().dimension(384)),
).await?;

Existing documents are unaffected—they simply have no value for the new field. The returned Schema should be persisted (e.g., to schema.toml) by the caller.

Removing a Field

Use Engine::delete_field() to remove a field from the schema.

let updated_schema = engine.delete_field("category").await?;

When a field is deleted:

  • The field definition is removed from the schema.
  • Existing indexed data for the field remains in the index but becomes inaccessible through queries.
  • If the field was listed in default_fields, it is automatically removed.
  • Any per-field analyzer or embedder registered for the field is unregistered.

Schema Design Tips

  1. Separate lexical and vector fields — a field is either lexical or vector, never both. For hybrid search, create separate fields (e.g., body for text, body_vec for vector).

  2. Use KeywordAnalyzer for exact-match fields — category, status, and tag fields should use KeywordAnalyzer via PerFieldAnalyzer to avoid tokenization.

  3. Choose the right vector index — use HNSW for most cases, Flat for small datasets, IVF for very large datasets. See Vector Indexing.

  4. Set default fields — if you use the Query DSL, set default fields so users can write hello instead of body:hello.

  5. Use the schema generator — run laurus create schema to interactively build a schema TOML file instead of writing it by hand. See CLI Commands.

Text Analysis

Text analysis is the process of converting raw text into searchable tokens. When a document is indexed, the analyzer breaks text fields into individual terms; when a query is executed, the same analyzer processes the query text to ensure consistency.

The Analysis Pipeline

graph LR
    Input["Raw Text\n'The quick brown FOX jumps!'"]
    CF["UnicodeNormalizationCharFilter"]
    T["Tokenizer\nSplit into words"]
    F1["LowercaseFilter"]
    F2["StopFilter"]
    F3["StemFilter"]
    Output["Terms\n'quick', 'brown', 'fox', 'jump'"]

    Input --> CF --> T --> F1 --> F2 --> F3 --> Output

The analysis pipeline consists of:

  1. Char Filters — normalize raw text at the character level before tokenization
  2. Tokenizer — splits text into raw tokens (words, characters, n-grams)
  3. Token Filters — transform, remove, or expand tokens (lowercase, stop words, stemming, synonyms)

The Analyzer Trait

All analyzers implement the Analyzer trait:

#![allow(unused)]
fn main() {
pub trait Analyzer: Send + Sync + Debug {
    fn analyze(&self, text: &str) -> Result<TokenStream>;
    fn name(&self) -> &str;
    fn as_any(&self) -> &dyn Any;
}
}

TokenStream is a Box<dyn Iterator<Item = Token> + Send> — a lazy iterator over tokens.

A Token contains:

FieldTypeDescription
textStringThe token text
positionusizePosition in the original text
start_offsetusizeStart byte offset in original text
end_offsetusizeEnd byte offset in original text
position_incrementusizeDistance from previous token
position_lengthusizeSpan of the token (>1 for synonyms)
boostf32Token-level scoring weight
stoppedboolWhether marked as a stop word
metadataOption<TokenMetadata>Additional token metadata

Built-in Analyzers

StandardAnalyzer

The default analyzer. Suitable for most Western languages.

Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilterStopFilter (128 common English stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::standard::StandardAnalyzer;

let analyzer = StandardAnalyzer::default();
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering)
}

JapaneseAnalyzer

Uses morphological analysis for Japanese text segmentation.

Pipeline: UnicodeNormalizationCharFilter (NFKC) → JapaneseIterationMarkCharFilterLinderaTokenizerLowercaseFilterStopFilter (Japanese stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::japanese::JapaneseAnalyzer;

let analyzer = JapaneseAnalyzer::new()?;
// "東京都に住んでいる" → ["東京", "都", "に", "住ん", "で", "いる"]
}

KeywordAnalyzer

Treats the entire input as a single token. No tokenization or normalization.

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;

let analyzer = KeywordAnalyzer::new();
// "Hello World" → ["Hello World"]
}

Use this for fields that should match exactly (categories, tags, status codes).

SimpleAnalyzer

Tokenizes text without any filtering. The original case and all tokens are preserved. Useful when you need complete control over the analysis pipeline or want to test a tokenizer in isolation.

Pipeline: User-specified Tokenizer only (no char filters, no token filters)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::simple::SimpleAnalyzer;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use std::sync::Arc;

let tokenizer = Arc::new(RegexTokenizer::new()?);
let analyzer = SimpleAnalyzer::new(tokenizer);
// "Hello World" → ["Hello", "World"]
// (no lowercasing, no stop word removal)
}

Use this for testing tokenizers, or when you want to apply token filters manually in a separate step.

EnglishAnalyzer

An English-specific analyzer. Tokenizes, lowercases, and removes common English stop words.

Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilterStopFilter (128 common English stop words)

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::language::english::EnglishAnalyzer;

let analyzer = EnglishAnalyzer::new()?;
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering, remaining tokens are lowercased)
}

PipelineAnalyzer

Build a custom pipeline by combining any char filters, a tokenizer, and any sequence of token filters:

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;
use laurus::analysis::token_filter::stop::StopFilter;
use laurus::analysis::token_filter::stem::StemFilter;

let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
    .add_char_filter(Arc::new(UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC)))
    .add_filter(Arc::new(LowercaseFilter::new()))
    .add_filter(Arc::new(StopFilter::new()))
    .add_filter(Arc::new(StemFilter::new()));  // Porter stemmer
}

PerFieldAnalyzer

PerFieldAnalyzer lets you assign different analyzers to different fields within the same engine:

graph LR
    PFA["PerFieldAnalyzer"]
    PFA -->|"title"| KW["KeywordAnalyzer"]
    PFA -->|"body"| STD["StandardAnalyzer"]
    PFA -->|"description_ja"| JP["JapaneseAnalyzer"]
    PFA -->|other fields| DEF["Default\n(StandardAnalyzer)"]
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;
use laurus::analysis::analyzer::per_field::PerFieldAnalyzer;

// Default analyzer for fields not explicitly configured
let per_field = PerFieldAnalyzer::new(
    Arc::new(StandardAnalyzer::default())
);

// Use KeywordAnalyzer for exact-match fields
per_field.add_analyzer("category", Arc::new(KeywordAnalyzer::new()));
per_field.add_analyzer("status", Arc::new(KeywordAnalyzer::new()));

let engine = Engine::builder(storage, schema)
    .analyzer(Arc::new(per_field))
    .build()
    .await?;
}

Note: The _id field is always analyzed with KeywordAnalyzer regardless of configuration.

Char Filters

Char filters operate on the raw input text before it reaches the tokenizer. They perform character-level normalization such as Unicode normalization, character mapping, and pattern-based replacement. This ensures that the tokenizer receives clean, normalized text.

All char filters implement the CharFilter trait:

#![allow(unused)]
fn main() {
pub trait CharFilter: Send + Sync {
    fn filter(&self, input: &str) -> (String, Vec<Transformation>);
    fn name(&self) -> &'static str;
}
}

The Transformation records describe how character positions shifted, allowing the engine to map token positions back to the original text.

Char FilterDescription
UnicodeNormalizationCharFilterUnicode normalization (NFC, NFD, NFKC, NFKD)
MappingCharFilterReplaces character sequences based on a mapping dictionary
PatternReplaceCharFilterReplaces characters matching a regex pattern
JapaneseIterationMarkCharFilterExpands Japanese iteration marks (踊り字) to their base characters

UnicodeNormalizationCharFilter

Applies Unicode normalization to the input text. NFKC is recommended for search use cases because it normalizes both compatibility characters and composed forms.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};

let filter = UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC);
// "Sony" (fullwidth) → "Sony" (halfwidth)
// "㌂" → "アンペア"
}
FormDescription
NFCCanonical decomposition followed by canonical composition
NFDCanonical decomposition
NFKCCompatibility decomposition followed by canonical composition
NFKDCompatibility decomposition

MappingCharFilter

Replaces character sequences using a dictionary. Matches are found using the Aho-Corasick algorithm (leftmost-longest match).

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use laurus::analysis::char_filter::mapping::MappingCharFilter;

let mut mapping = HashMap::new();
mapping.insert("ph".to_string(), "f".to_string());
mapping.insert("qu".to_string(), "k".to_string());

let filter = MappingCharFilter::new(mapping)?;
// "phone queue" → "fone keue"
}

PatternReplaceCharFilter

Replaces all occurrences of a regex pattern with a fixed string.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;

// Remove hyphens
let filter = PatternReplaceCharFilter::new(r"-", "")?;
// "123-456-789" → "123456789"

// Normalize numbers
let filter = PatternReplaceCharFilter::new(r"\d+", "NUM")?;
// "Year 2024" → "Year NUM"
}

JapaneseIterationMarkCharFilter

Expands Japanese iteration marks (踊り字) to their base characters. Supports kanji (), hiragana (, ), and katakana (, ) iteration marks.

#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::japanese_iteration_mark::JapaneseIterationMarkCharFilter;

let filter = JapaneseIterationMarkCharFilter::new(
    true,  // normalize kanji iteration marks
    true,  // normalize kana iteration marks
);
// "佐々木" → "佐佐木"
// "いすゞ" → "いすず"
}

Using Char Filters in a Pipeline

Add char filters to a PipelineAnalyzer with add_char_filter(). Multiple char filters are applied in the order they are added, all before the tokenizer runs.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
    NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;

let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
    .add_char_filter(Arc::new(
        UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC),
    ))
    .add_char_filter(Arc::new(
        PatternReplaceCharFilter::new(r"-", "")?,
    ))
    .add_filter(Arc::new(LowercaseFilter::new()));
// "Tokyo-2024" → NFKC → "Tokyo-2024" → remove hyphens → "Tokyo2024" → tokenize → lowercase → ["tokyo2024"]
}

Tokenizers

TokenizerDescription
RegexTokenizerUnicode word boundaries; splits on whitespace and punctuation
UnicodeWordTokenizerSplits on Unicode word boundaries
WhitespaceTokenizerSplits on whitespace only
WholeTokenizerReturns the entire input as a single token
LinderaTokenizerJapanese morphological analysis (Lindera/MeCab)
NgramTokenizerGenerates n-gram tokens of configurable size

Token Filters

FilterDescription
LowercaseFilterConverts tokens to lowercase
StopFilterRemoves common words (“the”, “is”, “a”)
StemFilterReduces words to their root form (“running” → “run”)
SynonymGraphFilterExpands tokens with synonyms from a dictionary
BoostFilterAdjusts token boost values
LimitFilterLimits the number of tokens
StripFilterStrips leading/trailing whitespace from tokens
FlattenGraphFilterFlattens token graphs (for synonym expansion)
RemoveEmptyFilterRemoves empty tokens

Synonym Expansion

The SynonymGraphFilter expands terms using a synonym dictionary:

#![allow(unused)]
fn main() {
use laurus::analysis::synonym::dictionary::SynonymDictionary;
use laurus::analysis::token_filter::synonym_graph::SynonymGraphFilter;

let mut dict = SynonymDictionary::new(None)?;
dict.add_synonym_group(vec!["ml".into(), "machine learning".into()]);
dict.add_synonym_group(vec!["ai".into(), "artificial intelligence".into()]);

// keep_original=true means original token is preserved alongside synonyms
let filter = SynonymGraphFilter::new(dict, true)
    .with_boost(0.8);  // synonyms get 80% weight
}

The boost parameter controls how much weight synonyms receive relative to original tokens. A value of 0.8 means synonym matches contribute 80% as much to the score as exact matches.

Embeddings

Embeddings convert text (or images) into dense numeric vectors that capture semantic meaning. Two texts with similar meanings produce vectors that are close together in vector space, enabling similarity-based search.

The Embedder Trait

All embedders implement the Embedder trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Embedder: Send + Sync + Debug {
    async fn embed(&self, input: &EmbedInput<'_>) -> Result<Vector>;
    async fn embed_batch(&self, inputs: &[EmbedInput<'_>]) -> Result<Vec<Vector>>;
    fn supported_input_types(&self) -> Vec<EmbedInputType>;
    fn name(&self) -> &str;
    fn as_any(&self) -> &dyn Any;
}
}

The embed() method returns a Vector (a struct wrapping Vec<f32>).

EmbedInput supports two modalities:

VariantDescription
EmbedInput::Text(&str)Text input
EmbedInput::Bytes(&[u8], Option<&str>)Binary input with optional MIME type (for images)

Built-in Embedders

CandleBertEmbedder

Runs a BERT model locally using Hugging Face Candle. No API key required.

Feature flag: embeddings-candle

#![allow(unused)]
fn main() {
use laurus::CandleBertEmbedder;

// Downloads model on first run (~80MB)
let embedder = CandleBertEmbedder::new(
    "sentence-transformers/all-MiniLM-L6-v2"  // model name
)?;
// Output: 384-dimensional vector
}
PropertyValue
Modelsentence-transformers/all-MiniLM-L6-v2
Dimensions384
RuntimeLocal (CPU)
First-run download~80 MB

OpenAIEmbedder

Calls the OpenAI Embeddings API. Requires an API key.

Feature flag: embeddings-openai

#![allow(unused)]
fn main() {
use laurus::OpenAIEmbedder;

let embedder = OpenAIEmbedder::new(
    api_key,
    "text-embedding-3-small".to_string()
).await?;
// Output: 1536-dimensional vector
}
PropertyValue
Modeltext-embedding-3-small (or any OpenAI model)
Dimensions1536 (for text-embedding-3-small)
RuntimeRemote API call
RequiresOPENAI_API_KEY environment variable

CandleClipEmbedder

Runs a CLIP model locally for multimodal (text + image) embeddings.

Feature flag: embeddings-multimodal

#![allow(unused)]
fn main() {
use laurus::CandleClipEmbedder;

let embedder = CandleClipEmbedder::new(
    "openai/clip-vit-base-patch32"
)?;
// Text or images → 512-dimensional vector
}
PropertyValue
Modelopenai/clip-vit-base-patch32
Dimensions512
Input typesText AND images
Use caseText-to-image search, image-to-image search

PrecomputedEmbedder

Use pre-computed vectors directly without any embedding computation. Useful when vectors are generated externally.

#![allow(unused)]
fn main() {
use laurus::PrecomputedEmbedder;

let embedder = PrecomputedEmbedder::new();  // no parameters needed
}

When using PrecomputedEmbedder, you provide vectors directly in documents instead of text for embedding:

#![allow(unused)]
fn main() {
let doc = Document::builder()
    .add_vector("embedding", vec![0.1, 0.2, 0.3, ...])
    .build();
}

PerFieldEmbedder

PerFieldEmbedder routes embedding requests to field-specific embedders:

graph LR
    PFE["PerFieldEmbedder"]
    PFE -->|"text_vec"| BERT["CandleBertEmbedder\n(384 dim)"]
    PFE -->|"image_vec"| CLIP["CandleClipEmbedder\n(512 dim)"]
    PFE -->|other fields| DEF["Default Embedder"]
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::PerFieldEmbedder;

let bert = Arc::new(CandleBertEmbedder::new("...")?);
let clip = Arc::new(CandleClipEmbedder::new("...")?);


let per_field = PerFieldEmbedder::new(bert.clone());
per_field.add_embedder("text_vec", bert.clone());
per_field.add_embedder("image_vec", clip.clone());

let engine = Engine::builder(storage, schema)
    .embedder(Arc::new(per_field))
    .build()
    .await?;
}

This is especially useful when:

  • Different vector fields need different models (e.g., BERT for text, CLIP for images)
  • Different fields have different vector dimensions
  • You want to mix local and remote embedders

How Embeddings Are Used

At Index Time

When you add a text value to a vector field, the engine automatically embeds it:

#![allow(unused)]
fn main() {
let doc = Document::builder()
    .add_text("text_vec", "Rust is a systems programming language")
    .build();
engine.add_document("doc-1", doc).await?;
// The embedder converts the text to a vector before indexing
}

At Search Time

When you search with text, the engine embeds the query text as well:

#![allow(unused)]
fn main() {
// Builder API
let request = VectorSearchRequestBuilder::new()
    .add_text("text_vec", "systems programming")
    .build();

// Query DSL
let request = vector_parser.parse(r#"text_vec:"systems programming""#).await?;
}

Both approaches embed the query text using the same embedder that was used at index time, ensuring consistent vector spaces.

Feature Flags Summary

Each embedder requires a specific feature flag to be enabled in Cargo.toml:

EmbedderFeature FlagDependencies
CandleBertEmbedderembeddings-candlecandle-core, candle-nn, candle-transformers, hf-hub, tokenizers
OpenAIEmbedderembeddings-openaireqwest
CandleClipEmbedderembeddings-multimodalimage + embeddings-candle
PrecomputedEmbedder(none – always available)

The embeddings-all feature enables all embedding features at once. See Feature Flags for details.

Choosing an Embedder

ScenarioRecommended Embedder
Quick prototyping, offline useCandleBertEmbedder
Production with high accuracyOpenAIEmbedder
Text + image searchCandleClipEmbedder
Pre-computed vectors from external pipelinePrecomputedEmbedder
Multiple models per fieldPerFieldEmbedder wrapping others

Storage

Laurus uses a pluggable storage layer that abstracts how and where index data is persisted. All components — lexical index, vector index, and document log — share a single storage backend.

The Storage Trait

All backends implement the Storage trait:

#![allow(unused)]
fn main() {
pub trait Storage: Send + Sync + Debug {
    fn loading_mode(&self) -> LoadingMode;
    fn open_input(&self, name: &str) -> Result<Box<dyn StorageInput>>;
    fn create_output(&self, name: &str) -> Result<Box<dyn StorageOutput>>;
    fn file_exists(&self, name: &str) -> bool;
    fn delete_file(&self, name: &str) -> Result<()>;
    fn list_files(&self) -> Result<Vec<String>>;
    fn file_size(&self, name: &str) -> Result<u64>;
    // ... additional methods
}
}

This interface is file-oriented: all data (index segments, metadata, WAL entries, documents) is stored as named files accessed through streaming StorageInput / StorageOutput handles.

Storage Backends

MemoryStorage

All data lives in memory. Fast and simple, but not durable.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::memory::MemoryStorage;

let storage: Arc<dyn Storage> = Arc::new(
    MemoryStorage::new(Default::default())
);
}
PropertyValue
DurabilityNone (data lost on process exit)
SpeedFastest
Use caseTesting, prototyping, ephemeral data

FileStorage

Standard file-system based persistence. Each key maps to a file on disk.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::file::{FileStorage, FileStorageConfig};

let config = FileStorageConfig::new("/tmp/laurus-data");
let storage: Arc<dyn Storage> = Arc::new(FileStorage::new("/tmp/laurus-data", config)?);
}
PropertyValue
DurabilityFull (persisted to disk)
SpeedModerate (disk I/O)
Use caseGeneral production use

FileStorage with Memory Mapping

FileStorage supports memory-mapped file access via the use_mmap configuration flag. When enabled, the OS manages paging between memory and disk.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::file::{FileStorage, FileStorageConfig};

let mut config = FileStorageConfig::new("/tmp/laurus-data");
config.use_mmap = true;  // enable memory-mapped I/O
let storage: Arc<dyn Storage> = Arc::new(FileStorage::new("/tmp/laurus-data", config)?);
}
PropertyValue
DurabilityFull (persisted to disk)
SpeedFast (OS-managed memory mapping)
Use caseLarge datasets, read-heavy workloads

StorageFactory

You can also create storage via configuration:

#![allow(unused)]
fn main() {
use laurus::storage::{StorageConfig, StorageFactory};
use laurus::storage::memory::MemoryStorageConfig;

let storage = StorageFactory::create(
    StorageConfig::Memory(MemoryStorageConfig::default())
)?;
}

PrefixedStorage

The engine uses PrefixedStorage to isolate components within a single storage backend:

graph TB
    E["Engine"]
    E --> P1["PrefixedStorage\nprefix = 'lexical/'"]
    E --> P2["PrefixedStorage\nprefix = 'vector/'"]
    E --> P3["PrefixedStorage\nprefix = 'documents/'"]
    P1 --> S["Storage Backend"]
    P2 --> S
    P3 --> S

When the lexical store writes a key segments/seg-001.dict, it is actually stored as lexical/segments/seg-001.dict in the underlying backend. This ensures no key collisions between components.

You do not need to create PrefixedStorage yourself — the EngineBuilder handles this automatically.

ColumnStorage

In addition to the primary storage backends, Laurus provides a ColumnStorage layer for fast field-level access. This is used internally for operations like faceting, sorting, and aggregation, where accessing individual field values without deserializing entire documents is important.

ColumnValue

ColumnValue represents a single stored column value:

VariantDescription
String(String)UTF-8 text
I32(i32)32-bit signed integer
I64(i64)64-bit signed integer
U32(u32)32-bit unsigned integer
U64(u64)64-bit unsigned integer
F32(f32)32-bit floating point
F64(f64)64-bit floating point
Bool(bool)Boolean
DateTime(i64)Unix timestamp (seconds)
NullAbsent value

ColumnStorage is managed internally by the Engine – you do not need to interact with it directly.

Choosing a Backend

FactorMemoryStorageFileStorageFileStorage (mmap)
DurabilityNoneFullFull
Read speedFastestModerateFast
Write speedFastestModerateModerate
Memory usageProportional to data sizeLowOS-managed
Max data sizeLimited by RAMLimited by diskLimited by disk + address space
Best forTests, small datasetsGeneral useLarge read-heavy datasets

Recommendations

  • Development / Testing: Use MemoryStorage for fast iteration without file cleanup
  • Production (general): Use FileStorage for reliable persistence
  • Production (large scale): Use FileStorage with use_mmap = true when you have large indexes and want to leverage OS page cache

Next Steps

Indexing

This section explains how Laurus stores and organizes data internally. Understanding the indexing layer will help you choose the right field types and tune performance.

Topics

Lexical Indexing

How text, numeric, and geographic fields are indexed using an inverted index. Covers:

  • The inverted index structure (term dictionary, posting lists)
  • BKD trees for numeric range queries
  • Segment files and their formats
  • BM25 scoring

Vector Indexing

How vector fields are indexed for approximate nearest neighbor search. Covers:

  • Index types: Flat, HNSW, IVF
  • Parameter tuning (m, ef_construction, n_clusters, n_probe)
  • Distance metrics (Cosine, Euclidean, DotProduct)
  • Quantization (SQ8, PQ)

Lexical Indexing

Lexical indexing powers keyword-based search. When a document’s text field is indexed, Laurus builds an inverted index — a data structure that maps terms to the documents containing them.

How Lexical Indexing Works

sequenceDiagram
    participant Doc as Document
    participant Analyzer
    participant Writer as IndexWriter
    participant Seg as Segment

    Doc->>Analyzer: "The quick brown fox"
    Analyzer->>Analyzer: Tokenize + Filter
    Analyzer-->>Writer: ["quick", "brown", "fox"]
    Writer->>Writer: Buffer in memory
    Writer->>Seg: Flush to segment on commit()

Step by Step

  1. Analyze: The text passes through the configured analyzer (tokenizer + filters), producing a stream of normalized terms
  2. Buffer: Terms are stored in an in-memory write buffer, organized by field
  3. Commit: On commit(), the buffer is flushed to a new segment on storage

The Inverted Index

An inverted index is essentially a map from terms to document lists:

graph LR
    subgraph "Term Dictionary"
        T1["'brown'"]
        T2["'fox'"]
        T3["'quick'"]
        T4["'rust'"]
    end

    subgraph "Posting Lists"
        P1["doc_1, doc_3"]
        P2["doc_1"]
        P3["doc_1, doc_2"]
        P4["doc_2, doc_3"]
    end

    T1 --> P1
    T2 --> P2
    T3 --> P3
    T4 --> P4
ComponentDescription
Term DictionarySorted list of all unique terms in the index; supports fast prefix lookup
Posting ListsFor each term, a list of document IDs and metadata (term frequency, positions)
Doc ValuesColumn-oriented storage for sort/filter operations on numeric and date fields

Posting List Contents

Each entry in a posting list contains:

FieldDescription
Document IDInternal u64 identifier
Term FrequencyHow many times the term appears in this document
Positions (optional)Where in the document the term appears (needed for phrase queries)
WeightScore weight for this posting

Numeric and Date Fields

Integer, float, and datetime fields are indexed using a BKD tree — a space-partitioning data structure optimized for range queries:

graph TB
    Root["BKD Root"]
    Root --> L["values < 50"]
    Root --> R["values >= 50"]
    L --> LL["values < 25"]
    L --> LR["25 <= values < 50"]
    R --> RL["50 <= values < 75"]
    R --> RR["values >= 75"]

BKD trees allow efficient evaluation of range queries like price:[10 TO 100] or date:[2024-01-01 TO 2024-12-31].

Geo Fields

Geographic fields store latitude/longitude pairs. They are indexed using a spatial data structure that supports:

  • Radius queries: find all points within N kilometers of a center point
  • Bounding box queries: find all points within a rectangular area

Segments

The lexical index is organized into segments. Each segment is an immutable, self-contained mini-index:

graph TB
    LI["Lexical Index"]
    LI --> S1["Segment 0"]
    LI --> S2["Segment 1"]
    LI --> S3["Segment 2"]

    S1 --- F1[".dict (terms)"]
    S1 --- F2[".post (postings)"]
    S1 --- F3[".bkd (numerics)"]
    S1 --- F4[".docs (doc store)"]
    S1 --- F5[".dv (doc values)"]
    S1 --- F6[".meta (metadata)"]
    S1 --- F7[".lens (field lengths)"]
File ExtensionContents
.dictTerm dictionary (sorted terms + metadata offsets)
.postPosting lists (document IDs, term frequencies, positions)
.bkdBKD tree data for numeric and date fields
.docsStored field values (the original document content)
.dvDoc values for sorting and filtering
.metaSegment metadata (doc count, term count, etc.)
.lensField length norms (for BM25 scoring)

Segment Lifecycle

  1. Create: A new segment is created each time commit() is called
  2. Search: All segments are searched in parallel and results are merged
  3. Merge: Periodically, multiple small segments are merged into larger ones to improve query performance
  4. Delete: When a document is deleted, its ID is added to a deletion bitmap rather than physically removed (see Deletions & Compaction)

BM25 Scoring

Laurus uses the BM25 algorithm to score lexical search results. BM25 considers:

  • Term Frequency (TF): how often the term appears in the document (more = better, with diminishing returns)
  • Inverse Document Frequency (IDF): how rare the term is across all documents (rarer = more important)
  • Field Length Normalization: shorter fields are boosted relative to longer ones

The formula:

score(q, d) = IDF(q) * (TF(q, d) * (k1 + 1)) / (TF(q, d) + k1 * (1 - b + b * |d| / avgdl))

Where k1 = 1.2 and b = 0.75 are the default tuning parameters.

SIMD Optimization

Vector distance calculations leverage SIMD (Single Instruction, Multiple Data) instructions when available, providing significant speedups for similarity computations in vector search.

Code Example

use std::sync::Arc;
use laurus::{Document, Engine, Schema};
use laurus::lexical::TextOption;
use laurus::lexical::core::field::IntegerOption;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_integer_field("year", IntegerOption::default())
        .build();

    let engine = Engine::builder(storage, schema).build().await?;

    // Index documents
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming")
        .add_text("body", "Rust is a systems programming language.")
        .add_integer("year", 2024)
        .build()
    ).await?;

    // Commit to flush segments to storage
    engine.commit().await?;

    Ok(())
}

Next Steps

Vector Indexing

Vector indexing powers similarity-based search. When a document’s vector field is indexed, Laurus stores the embedding vector in a specialized index structure that enables fast approximate nearest neighbor (ANN) retrieval.

How Vector Indexing Works

sequenceDiagram
    participant Doc as Document
    participant Embedder
    participant Normalize as Normalizer
    participant Index as Vector Index

    Doc->>Embedder: "Rust is a systems language"
    Embedder-->>Normalize: [0.12, -0.45, 0.78, ...]
    Normalize->>Normalize: L2 normalize
    Normalize-->>Index: [0.14, -0.52, 0.90, ...]
    Index->>Index: Insert into index structure

Step by Step

  1. Embed: The text (or image) is converted to a vector by the configured embedder
  2. Normalize: The vector is L2-normalized (for cosine similarity)
  3. Index: The vector is inserted into the configured index structure (Flat, HNSW, or IVF)
  4. Commit: On commit(), the index is flushed to persistent storage

Index Types

Laurus supports three vector index types, each with different performance characteristics:

Comparison

PropertyFlatHNSWIVF
Accuracy100% (exact)~95-99% (approximate)~90-98% (approximate)
Search speedO(n) linear scanO(log n) graph walkO(n/k) cluster scan
Memory usageLowHigher (graph edges)Moderate (centroids)
Index build timeFastModerateSlower (clustering)
Best for< 10K vectors10K - 10M vectors> 1M vectors

Flat Index

The simplest index. Compares the query vector against every stored vector (brute-force).

#![allow(unused)]
fn main() {
use laurus::vector::FlatOption;
use laurus::vector::core::distance::DistanceMetric;

let opt = FlatOption {
    dimension: 384,
    distance: DistanceMetric::Cosine,
    ..Default::default()
};
}
  • Pros: 100% recall (exact results), simple, low memory
  • Cons: Slow for large datasets (linear scan)
  • Use when: You have fewer than ~10,000 vectors, or you need exact results

HNSW Index

Hierarchical Navigable Small World graph. The default and most commonly used index type.

graph TB
    subgraph "Layer 2 (sparse)"
        A2["A"] --- C2["C"]
    end

    subgraph "Layer 1 (medium)"
        A1["A"] --- B1["B"]
        A1 --- C1["C"]
        B1 --- D1["D"]
        C1 --- D1
    end

    subgraph "Layer 0 (dense - all vectors)"
        A0["A"] --- B0["B"]
        A0 --- C0["C"]
        B0 --- D0["D"]
        B0 --- E0["E"]
        C0 --- D0
        C0 --- F0["F"]
        D0 --- E0
        E0 --- F0
    end

    A2 -.->|"entry point"| A1
    A1 -.-> A0
    C2 -.-> C1
    C1 -.-> C0
    B1 -.-> B0
    D1 -.-> D0

The HNSW algorithm searches from the top (sparse) layer down to the bottom (dense) layer, narrowing the search space at each level.

#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;

let opt = HnswOption {
    dimension: 384,
    distance: DistanceMetric::Cosine,
    m: 16,                  // max connections per node per layer
    ef_construction: 200,   // search width during index building
    ..Default::default()
};
}

HNSW Parameters

ParameterDefaultDescriptionImpact
m16Max bi-directional connections per layerHigher = better recall, more memory
ef_construction200Search width during index buildingHigher = better recall, slower build
dimension128Vector dimensionsMust match embedder output
distanceCosineDistance metricSee Distance Metrics below

Tuning tips:

  • Increase m (e.g., 32 or 64) for higher recall at the cost of memory
  • Increase ef_construction (e.g., 400) for better index quality at the cost of build time
  • At search time, the ef_search parameter (set in the search request) controls the search width

IVF Index

Inverted File Index. Partitions vectors into clusters, then only searches relevant clusters.

graph TB
    Q["Query Vector"]
    Q --> C1["Cluster 1\n(centroid)"]
    Q --> C2["Cluster 2\n(centroid)"]

    C1 --> V1["vec_3"]
    C1 --> V2["vec_7"]
    C1 --> V3["vec_12"]

    C2 --> V4["vec_1"]
    C2 --> V5["vec_9"]
    C2 --> V6["vec_15"]

    style C1 fill:#f9f,stroke:#333
    style C2 fill:#f9f,stroke:#333
#![allow(unused)]
fn main() {
use laurus::vector::IvfOption;
use laurus::vector::core::distance::DistanceMetric;

let opt = IvfOption {
    dimension: 384,
    distance: DistanceMetric::Cosine,
    n_clusters: 100,   // number of clusters
    n_probe: 10,       // clusters to search at query time
    ..Default::default()
};
}

IVF Parameters

ParameterDefaultDescriptionImpact
n_clusters100Number of Voronoi cellsMore clusters = faster search, lower recall
n_probe1Clusters to search at query timeHigher = better recall, slower search
dimension(required)Vector dimensionsMust match embedder output
distanceCosineDistance metricSee Distance Metrics below

Tuning tips:

  • Set n_clusters to roughly sqrt(n) where n is the number of vectors
  • Set n_probe to 5-20% of n_clusters for a good recall/speed trade-off
  • IVF requires a training phase — initial indexing may be slower

Distance Metrics

MetricDescriptionRangeBest For
Cosine1 - cosine similarity[0, 2]Text embeddings (most common)
EuclideanL2 distance[0, +inf)Spatial data
ManhattanL1 distance[0, +inf)Feature vectors
DotProductNegative inner product(-inf, +inf)Pre-normalized vectors
AngularAngular distance[0, pi]Directional similarity
#![allow(unused)]
fn main() {
use laurus::vector::core::distance::DistanceMetric;

let metric = DistanceMetric::Cosine;      // Default for text
let metric = DistanceMetric::Euclidean;    // For spatial data
let metric = DistanceMetric::Manhattan;    // L1 distance
let metric = DistanceMetric::DotProduct;   // For pre-normalized vectors
let metric = DistanceMetric::Angular;      // Angular distance
}

Note: For cosine similarity, vectors are automatically L2-normalized before indexing. Lower distance = more similar.

Quantization

Quantization reduces memory usage by compressing vectors at the cost of some accuracy:

MethodEnum VariantDescriptionMemory Reduction
Scalar 8-bitScalar8BitScalar quantization to 8-bit integers~4x
Product QuantizationProductQuantization { subvector_count }Splits vectors into sub-vectors and quantizes each~16-64x
#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::quantization::QuantizationMethod;

let opt = HnswOption {
    dimension: 384,
    quantizer: Some(QuantizationMethod::Scalar8Bit),
    ..Default::default()
};
}

VectorQuantizer

The VectorQuantizer manages the quantization lifecycle:

MethodDescription
new(method, dimension)Create an untrained quantizer
train(vectors)Train on representative vectors (computes per-dimension min/max for Scalar8Bit)
quantize(vector)Compress a vector using the trained parameters
dequantize(quantized)Decompress a quantized vector back to full precision

For Scalar8Bit, training computes the min and max value for each dimension. Each component is then mapped to the [0, 255] range. Dequantization reverses this mapping with some precision loss.

Note: ProductQuantization is defined in the API but is currently unimplemented. Using it will return an error.

Segment Files

Each vector index type stores its data in a single segment file:

Index TypeFile ExtensionContents
HNSW.hnswGraph structure, vectors, and metadata
Flat.flatRaw vectors and metadata
IVF.ivfCluster centroids, assigned vectors, and metadata

Code Example

use std::sync::Arc;
use laurus::{Document, Engine, Schema};
use laurus::lexical::TextOption;
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_hnsw_field("embedding", HnswOption {
            dimension: 384,
            distance: DistanceMetric::Cosine,
            m: 16,
            ef_construction: 200,
            ..Default::default()
        })
        .build();

    // With an embedder, text in vector fields is automatically embedded
    let engine = Engine::builder(storage, schema)
        .embedder(my_embedder)
        .build()
        .await?;

    // Add text to the vector field — it will be embedded automatically
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming")
        .add_text("embedding", "Rust is a systems programming language.")
        .build()
    ).await?;

    engine.commit().await?;

    Ok(())
}

Next Steps

Search

This section covers how to query your indexed data. Laurus supports three search modes that can be used independently or combined.

Topics

Keyword-based search using an inverted index. Covers:

  • All query types: Term, Phrase, Boolean, Fuzzy, Wildcard, Range, Geo, Span
  • BM25 scoring and field boosts
  • Using the Query DSL for text-based queries

Semantic similarity search using vector embeddings. Covers:

  • VectorSearchRequestBuilder API
  • Multi-field vector search and score modes
  • Filtered vector search

Combining lexical and vector search for best-of-both-worlds results. Covers:

  • SearchRequestBuilder API
  • Fusion algorithms (RRF, WeightedSum)
  • Filtered hybrid search
  • Pagination with offset/limit

For spelling correction, see Spelling Correction in the Library section.

Lexical Search

Lexical search finds documents by matching keywords against an inverted index. Laurus provides a rich set of query types that cover exact matching, phrase matching, fuzzy matching, and more.

Basic Usage

#![allow(unused)]
fn main() {
use laurus::SearchRequestBuilder;
use laurus::lexical::TermQuery;
use laurus::lexical::search::searcher::LexicalSearchQuery;

let request = SearchRequestBuilder::new()
    .lexical_query(
        LexicalSearchQuery::Obj(
            Box::new(TermQuery::new("body", "rust"))
        )
    )
    .limit(10)
    .build();

let results = engine.search(request).await?;
}

Query Types

TermQuery

Matches documents containing an exact term in a specific field.

#![allow(unused)]
fn main() {
use laurus::lexical::TermQuery;

// Find documents where "body" contains the term "rust"
let query = TermQuery::new("body", "rust");
}

Note: Terms are matched after analysis. If the field uses StandardAnalyzer, both the indexed text and the query term are lowercased, so TermQuery::new("body", "rust") will match “Rust” in the original text.

PhraseQuery

Matches documents containing an exact sequence of terms.

#![allow(unused)]
fn main() {
use laurus::lexical::query::phrase::PhraseQuery;

// Find documents containing the exact phrase "machine learning"
let query = PhraseQuery::new("body", vec!["machine".to_string(), "learning".to_string()]);

// Or use the convenience method from a phrase string:
let query = PhraseQuery::from_phrase("body", "machine learning");
}

Phrase queries require term positions to be stored (the default for TextOption).

BooleanQuery

Combines multiple queries with boolean logic.

#![allow(unused)]
fn main() {
use laurus::lexical::query::boolean::{BooleanQuery, BooleanQueryBuilder, Occur};

let query = BooleanQueryBuilder::new()
    .must(Box::new(TermQuery::new("body", "rust")))       // AND
    .must(Box::new(TermQuery::new("body", "programming"))) // AND
    .must_not(Box::new(TermQuery::new("body", "python")))  // NOT
    .build();
}
OccurMeaningDSL Equivalent
MustDocument MUST match+term or AND
ShouldDocument SHOULD match (boosts score)term or OR
MustNotDocument MUST NOT match-term or NOT
FilterMUST match, but does not affect score(no DSL equivalent)

FuzzyQuery

Matches terms within a specified edit distance (Levenshtein distance).

#![allow(unused)]
fn main() {
use laurus::lexical::query::fuzzy::FuzzyQuery;

// Find documents matching "programing" within edit distance 2
// This will match "programming", "programing", etc.
let query = FuzzyQuery::new("body", "programing");  // default max_edits = 2
}

WildcardQuery

Matches terms using wildcard patterns.

#![allow(unused)]
fn main() {
use laurus::lexical::query::wildcard::WildcardQuery;

// '?' matches exactly one character, '*' matches zero or more
let query = WildcardQuery::new("filename", "*.pdf")?;
let query = WildcardQuery::new("body", "pro*")?;
let query = WildcardQuery::new("body", "col?r")?;  // matches "color" and "colour"
}

PrefixQuery

Matches documents containing terms that start with a specific prefix.

#![allow(unused)]
fn main() {
use laurus::lexical::query::prefix::PrefixQuery;

// Find documents where "body" contains terms starting with "pro"
// This matches "programming", "program", "production", etc.
let query = PrefixQuery::new("body", "pro");
}

RegexpQuery

Matches documents containing terms that match a regular expression pattern.

#![allow(unused)]
fn main() {
use laurus::lexical::query::regexp::RegexpQuery;

// Find documents where "body" contains terms matching the regex
let query = RegexpQuery::new("body", "^pro.*ing$")?;

// Match version-like patterns
let query = RegexpQuery::new("version", r"^v\d+\.\d+")?;
}

Note: RegexpQuery::new() returns Result because the regex pattern is validated at construction time. Invalid patterns will produce an error.

NumericRangeQuery

Matches documents with numeric field values within a range.

#![allow(unused)]
fn main() {
use laurus::lexical::NumericRangeQuery;
use laurus::lexical::core::field::NumericType;

// Find documents where "price" is between 10.0 and 100.0 (inclusive)
let query = NumericRangeQuery::new(
    "price",
    NumericType::Float,
    Some(10.0),   // min
    Some(100.0),  // max
    true,         // include min
    true,         // include max
);

// Open-ended range: price >= 50
let query = NumericRangeQuery::new(
    "price",
    NumericType::Float,
    Some(50.0),
    None,     // no upper bound
    true,
    false,
);
}

GeoQuery

Matches documents by geographic location.

#![allow(unused)]
fn main() {
use laurus::lexical::query::geo::GeoQuery;

// Find documents within 10km of Tokyo Station (35.6812, 139.7671)
let query = GeoQuery::within_radius("location", 35.6812, 139.7671, 10.0)?; // radius in kilometers

// Find documents within a bounding box (min_lat, min_lon, max_lat, max_lon)
let query = GeoQuery::within_bounding_box(
    "location",
    35.0, 139.0,  // min (lat, lon)
    36.0, 140.0,  // max (lat, lon)
)?;
}

SpanQuery

Matches terms based on their proximity within a document. Use SpanTermQuery and SpanNearQuery to build proximity queries:

#![allow(unused)]
fn main() {
use laurus::lexical::query::span::{SpanQuery, SpanTermQuery, SpanNearQuery};

// Find documents where "quick" appears near "fox" (within 3 positions)
let query = SpanNearQuery::new(
    "body",
    vec![
        Box::new(SpanTermQuery::new("body", "quick")) as Box<dyn SpanQuery>,
        Box::new(SpanTermQuery::new("body", "fox")) as Box<dyn SpanQuery>,
    ],
    3,    // slop (max distance between terms)
    true, // in_order (terms must appear in order)
);
}

Scoring

Lexical search results are scored using BM25. The score reflects how relevant a document is to the query:

  • Higher term frequency in the document increases the score
  • Rarer terms across the index increase the score
  • Shorter documents are boosted relative to longer ones

Field Boosts

You can boost specific fields to influence relevance using the SearchRequestBuilder:

#![allow(unused)]
fn main() {
use laurus::SearchRequestBuilder;
use laurus::lexical::TermQuery;
use laurus::lexical::search::searcher::LexicalSearchQuery;

let request = SearchRequestBuilder::new()
    .lexical_query(LexicalSearchQuery::Obj(Box::new(TermQuery::new("body", "rust"))))
    .add_field_boost("title", 2.0)  // title matches count double
    .add_field_boost("body", 1.0)
    .build();
}

Lexical Search Options

Lexical search behavior is controlled via LexicalSearchOptions on the SearchRequest, or by using builder methods on SearchRequestBuilder:

OptionDefaultDescription
field_boostsemptyPer-field score multipliers
min_score0.0Minimum score threshold
timeout_msNoneSearch timeout in milliseconds
parallelfalseEnable parallel search across segments
sort_byScoreSort by relevance score, or by a field (asc / desc)

Builder Methods

SearchRequestBuilder provides convenience methods for lexical options:

#![allow(unused)]
fn main() {
use laurus::SearchRequestBuilder;
use laurus::lexical::TermQuery;
use laurus::lexical::search::searcher::{LexicalSearchQuery, SortField, SortOrder};

let request = SearchRequestBuilder::new()
    .lexical_query(LexicalSearchQuery::Obj(Box::new(TermQuery::new("body", "rust"))))
    .lexical_min_score(0.5)
    .lexical_timeout_ms(5000)
    .lexical_parallel(true)
    .sort_by(SortField::Field { name: "date".to_string(), order: SortOrder::Desc })
    .add_field_boost("title", 2.0)
    .add_field_boost("body", 1.0)
    .limit(20)
    .build();
}

Using the Query DSL

Instead of building queries programmatically, you can use the text-based Query DSL:

#![allow(unused)]
fn main() {
use laurus::lexical::QueryParser;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use std::sync::Arc;

let analyzer = Arc::new(StandardAnalyzer::default());
let parser = QueryParser::new(analyzer).with_default_field("body");

// Simple term
let query = parser.parse("rust")?;

// Boolean
let query = parser.parse("rust AND programming")?;

// Phrase
let query = parser.parse("\"machine learning\"")?;

// Field-specific
let query = parser.parse("title:rust AND body:programming")?;

// Fuzzy
let query = parser.parse("programing~2")?;

// Range
let query = parser.parse("year:[2020 TO 2024]")?;
}

See Query DSL for the complete syntax reference.

Next Steps

Vector Search

Vector search finds documents by semantic similarity. Instead of matching keywords, it compares the meaning of the query against document embeddings in vector space.

Basic Usage

Builder API

#![allow(unused)]
fn main() {
use laurus::SearchRequestBuilder;
use laurus::vector::search::searcher::VectorSearchQuery;
use laurus::vector::store::request::QueryPayload;
use laurus::data::DataValue;

let request = SearchRequestBuilder::new()
    .vector_query(
        VectorSearchQuery::Payloads(vec![
            QueryPayload {
                field: "embedding".to_string(),
                payload: DataValue::Text("systems programming language".to_string()),
                weight: 1.0,
            },
        ])
    )
    .limit(10)
    .build();

let results = engine.search(request).await?;
}

The QueryPayload stores raw data (text, bytes, etc.) that will be embedded at search time using the configured embedder.

Query DSL

#![allow(unused)]
fn main() {
use laurus::vector::VectorQueryParser;

let parser = VectorQueryParser::new(embedder.clone())
    .with_default_field("embedding");

let request = parser.parse(r#"embedding:"systems programming""#).await?;
}

VectorSearchQuery

The vector search query is specified as a VectorSearchQuery enum:

VariantDescription
Payloads(Vec<QueryPayload>)Raw payloads (text, bytes, etc.) to be embedded at search time
Vectors(Vec<QueryVector>)Pre-embedded query vectors ready for nearest-neighbor search

QueryPayload

FieldTypeDescription
fieldStringTarget vector field name
payloadDataValueThe payload to embed (e.g., DataValue::Text(...))
weightf32Score weight (default: 1.0)

QueryVector

FieldTypeDescription
vectorVectorPre-computed dense vector embedding
weightf32Score weight (default: 1.0)
fieldsOption<Vec<String>>Optional field restriction

Examples

#![allow(unused)]
fn main() {
use laurus::vector::search::searcher::VectorSearchQuery;
use laurus::vector::store::request::{QueryPayload, QueryVector};
use laurus::vector::core::vector::Vector;
use laurus::data::DataValue;

// Text query (will be embedded at search time)
let query = VectorSearchQuery::Payloads(vec![
    QueryPayload {
        field: "text_vec".to_string(),
        payload: DataValue::Text("machine learning".to_string()),
        weight: 1.0,
    },
]);

// Pre-computed vector
let query = VectorSearchQuery::Vectors(vec![
    QueryVector {
        vector: Vector::from(vec![0.1, 0.2, 0.3]),
        weight: 1.0,
        fields: Some(vec!["embedding".to_string()]),
    },
]);
}

You can search across multiple vector fields in a single request:

#![allow(unused)]
fn main() {
use laurus::vector::search::searcher::VectorSearchQuery;
use laurus::vector::store::request::QueryPayload;
use laurus::data::DataValue;

let query = VectorSearchQuery::Payloads(vec![
    QueryPayload {
        field: "text_vec".to_string(),
        payload: DataValue::Text("cute kitten".to_string()),
        weight: 1.0,
    },
    QueryPayload {
        field: "image_vec".to_string(),
        payload: DataValue::Text("fluffy cat".to_string()),
        weight: 1.0,
    },
]);
}

Each clause produces a vector that is searched against its respective field. Results are combined using the configured score mode.

Score Modes

ModeDescription
WeightedSum (default)Sum of (similarity * weight) across all clauses
MaxSimMaximum similarity score across clauses
LateInteractionColBERT-style late interaction scoring

Weights

Use the ^ boost syntax in DSL or weight in QueryVector to adjust how much each field contributes:

text_vec:"cute kitten"^1.0 image_vec:"fluffy cat"^0.5

This means text similarity counts twice as much as image similarity.

You can apply lexical filters to narrow the vector search results:

#![allow(unused)]
fn main() {
use laurus::SearchRequestBuilder;
use laurus::lexical::TermQuery;
use laurus::vector::search::searcher::VectorSearchQuery;
use laurus::vector::store::request::QueryPayload;
use laurus::data::DataValue;

// Vector search with a category filter
let request = SearchRequestBuilder::new()
    .vector_query(
        VectorSearchQuery::Payloads(vec![
            QueryPayload {
                field: "embedding".to_string(),
                payload: DataValue::Text("machine learning".to_string()),
                weight: 1.0,
            },
        ])
    )
    .filter_query(Box::new(TermQuery::new("category", "tutorial")))
    .limit(10)
    .build();

let results = engine.search(request).await?;
}

The filter query runs first on the lexical index to identify allowed document IDs, then the vector search is restricted to those IDs.

Filter with Numeric Range

#![allow(unused)]
fn main() {
use laurus::lexical::NumericRangeQuery;
use laurus::lexical::core::field::NumericType;

let request = SearchRequestBuilder::new()
    .vector_query(
        VectorSearchQuery::Payloads(vec![
            QueryPayload {
                field: "embedding".to_string(),
                payload: DataValue::Text("type systems".to_string()),
                weight: 1.0,
            },
        ])
    )
    .filter_query(Box::new(NumericRangeQuery::new(
        "year", NumericType::Integer,
        Some(2020.0), Some(2024.0), true, true
    )))
    .limit(10)
    .build();
}

Distance Metrics

The distance metric is configured per field in the schema (see Vector Indexing):

MetricDescriptionLower = More Similar
Cosine1 - cosine similarityYes
EuclideanL2 distanceYes
ManhattanL1 distanceYes
DotProductNegative inner productYes
AngularAngular distanceYes
use std::sync::Arc;
use laurus::{Document, Engine, Schema, SearchRequestBuilder, PerFieldEmbedder};
use laurus::lexical::TextOption;
use laurus::vector::HnswOption;
use laurus::vector::search::searcher::VectorSearchQuery;
use laurus::vector::store::request::QueryPayload;
use laurus::data::DataValue;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));

    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_hnsw_field("text_vec", HnswOption {
            dimension: 384,
            ..Default::default()
        })
        .build();

    // Set up per-field embedder
    let embedder = Arc::new(my_embedder);
    let pfe = PerFieldEmbedder::new(embedder.clone());
    pfe.add_embedder("text_vec", embedder.clone());

    let engine = Engine::builder(storage, schema)
        .embedder(Arc::new(pfe))
        .build()
        .await?;

    // Index documents (text in vector field is auto-embedded)
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming")
        .add_text("text_vec", "Rust is a systems programming language.")
        .build()
    ).await?;
    engine.commit().await?;

    // Search by semantic similarity
    let results = engine.search(
        SearchRequestBuilder::new()
            .vector_query(
                VectorSearchQuery::Payloads(vec![
                    QueryPayload {
                        field: "text_vec".to_string(),
                        payload: DataValue::Text("systems language".to_string()),
                        weight: 1.0,
                    },
                ])
            )
            .limit(5)
            .build()
    ).await?;

    for r in &results {
        println!("{}: score={:.4}", r.id, r.score);
    }

    Ok(())
}

Next Steps

Hybrid Search

Hybrid search combines lexical search (keyword matching) with vector search (semantic similarity) to deliver results that are both precise and semantically relevant. This is Laurus’s most powerful search mode.

Search TypeStrengthsWeaknesses
Lexical onlyExact keyword matching, handles rare terms wellMisses synonyms and paraphrases
Vector onlyUnderstands meaning, handles synonymsMay miss exact keywords, less precise
HybridBest of both worldsSlightly more complex to configure

How It Works

sequenceDiagram
    participant User
    participant Engine
    participant Lexical as LexicalStore
    participant Vector as VectorStore
    participant Fusion

    User->>Engine: SearchRequest\n(lexical + vector)

    par Execute in parallel
        Engine->>Lexical: BM25 keyword search
        Lexical-->>Engine: Ranked hits (by relevance)
    and
        Engine->>Vector: ANN similarity search
        Vector-->>Engine: Ranked hits (by distance)
    end

    Engine->>Fusion: Merge two result sets
    Note over Fusion: RRF or WeightedSum
    Fusion-->>Engine: Unified ranked list
    Engine-->>User: Vec of SearchResult

Basic Usage

Builder API

#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, FusionAlgorithm};
use laurus::lexical::TermQuery;
use laurus::lexical::search::searcher::LexicalSearchQuery;
use laurus::vector::search::searcher::VectorSearchQuery;
use laurus::vector::store::request::QueryPayload;
use laurus::data::DataValue;

let request = SearchRequestBuilder::new()
    // Lexical component
    .lexical_query(
        LexicalSearchQuery::Obj(
            Box::new(TermQuery::new("body", "rust"))
        )
    )
    // Vector component
    .vector_query(
        VectorSearchQuery::Payloads(vec![
            QueryPayload {
                field: "text_vec".to_string(),
                payload: DataValue::Text("systems programming".to_string()),
                weight: 1.0,
            },
        ])
    )
    // Fusion algorithm
    .fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
    .limit(10)
    .build();

let results = engine.search(request).await?;
}

Query DSL

Mix lexical and vector clauses in a single query string:

#![allow(unused)]
fn main() {
use laurus::UnifiedQueryParser;
use laurus::lexical::QueryParser;
use laurus::vector::VectorQueryParser;

let unified = UnifiedQueryParser::new(
    QueryParser::new(analyzer).with_default_field("body"),
    VectorQueryParser::new(embedder),
);

// Lexical + vector in one query
let request = unified.parse(r#"body:rust text_vec:"systems programming""#).await?;
let results = engine.search(request).await?;
}

The parser uses the schema to identify vector clauses by field type. Fields defined as vector fields (e.g., HNSW) are parsed as vector queries; everything else is parsed as lexical.

Fusion Algorithms

When both lexical and vector results exist, they must be merged into a single ranked list. Laurus supports two fusion algorithms:

RRF (Reciprocal Rank Fusion)

The default algorithm. Combines results based on their rank positions rather than raw scores.

score(doc) = sum( 1 / (k + rank_i) )

Where rank_i is the position of the document in each result list, and k is a smoothing parameter (default 60).

#![allow(unused)]
fn main() {
use laurus::FusionAlgorithm;

let fusion = FusionAlgorithm::RRF { k: 60.0 };
}

Advantages:

  • Robust to different score distributions between lexical and vector results
  • No need to tune weights
  • Works well out of the box

WeightedSum

Linearly combines normalized lexical and vector scores:

score(doc) = lexical_weight * lexical_score + vector_weight * vector_score
#![allow(unused)]
fn main() {
use laurus::FusionAlgorithm;

let fusion = FusionAlgorithm::WeightedSum {
    lexical_weight: 0.3,
    vector_weight: 0.7,
};
}

When to use:

  • When you want explicit control over the balance between lexical and vector relevance
  • When you know one signal is more important than the other

SearchRequest Fields

FieldTypeDefaultDescription
querySearchQueryDsl("")Search query (Dsl, Lexical, Vector, or Hybrid)
limitusize10Maximum number of results to return
offsetusize0Number of results to skip (for pagination)
fusion_algorithmOption<FusionAlgorithm>None (uses RRF { k: 60.0 } when both results exist)How to merge lexical and vector results
filter_queryOption<Box<dyn Query>>NonePre-filter using a lexical query (restricts both lexical and vector results)
lexical_optionsLexicalSearchOptionsDefaultParameters controlling lexical search behavior (field boosts, min score, timeout, etc.)
vector_optionsVectorSearchOptionsDefaultParameters controlling vector search behavior (score mode, min score)

SearchResult

Each result contains:

FieldTypeDescription
idStringExternal document ID
scoref32Fused relevance score
documentOption<Document>Full document content (if loaded)

Apply a filter to restrict both lexical and vector results:

#![allow(unused)]
fn main() {
let request = SearchRequestBuilder::new()
    .lexical_query(
        LexicalSearchQuery::Obj(Box::new(TermQuery::new("body", "rust")))
    )
    .vector_query(
        VectorSearchQuery::Payloads(vec![
            QueryPayload {
                field: "text_vec".to_string(),
                payload: DataValue::Text("systems programming".to_string()),
                weight: 1.0,
            },
        ])
    )
    // Only search within "tutorial" category
    .filter_query(Box::new(TermQuery::new("category", "tutorial")))
    .fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
    .limit(10)
    .build();
}

How Filtering Works

  1. The filter query runs on the lexical index to produce a set of allowed document IDs
  2. For lexical search: the filter is combined with the user query as a boolean AND
  3. For vector search: the allowed IDs are passed to restrict the ANN search

Pagination

Use offset and limit for pagination:

#![allow(unused)]
fn main() {
// Page 1: results 0-9
let page1 = SearchRequestBuilder::new()
    .lexical_query(/* ... */)
    .vector_query(/* ... */)
    .offset(0)
    .limit(10)
    .build();

// Page 2: results 10-19
let page2 = SearchRequestBuilder::new()
    .lexical_query(/* ... */)
    .vector_query(/* ... */)
    .offset(10)
    .limit(10)
    .build();
}

Complete Example

use std::sync::Arc;
use laurus::{
    Document, Engine, Schema, SearchRequestBuilder,
    FusionAlgorithm, PerFieldEmbedder,
};
use laurus::lexical::{TextOption, TermQuery};
use laurus::lexical::core::field::IntegerOption;
use laurus::lexical::search::searcher::LexicalSearchQuery;
use laurus::vector::HnswOption;
use laurus::vector::search::searcher::VectorSearchQuery;
use laurus::vector::store::request::QueryPayload;
use laurus::data::DataValue;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));

    // Schema with both lexical and vector fields
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_text_field("category", TextOption::default())
        .add_integer_field("year", IntegerOption::default())
        .add_hnsw_field("body_vec", HnswOption {
            dimension: 384,
            ..Default::default()
        })
        .build();

    // Configure analyzer and embedder (see Text Analysis and Embeddings docs)
    // let analyzer = Arc::new(StandardAnalyzer::new()?);
    // let embedder = Arc::new(CandleBertEmbedder::new("sentence-transformers/all-MiniLM-L6-v2")?);
    let engine = Engine::builder(storage, schema)
        // .analyzer(analyzer)
        // .embedder(embedder)
        .build()
        .await?;

    // Index documents with both text and vector fields
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming Guide")
        .add_text("body", "Rust is a systems programming language.")
        .add_text("category", "programming")
        .add_integer("year", 2024)
        .add_text("body_vec", "Rust is a systems programming language.")
        .build()
    ).await?;
    engine.commit().await?;

    // Hybrid search: keyword "rust" + semantic "systems language"
    let results = engine.search(
        SearchRequestBuilder::new()
            .lexical_query(
                LexicalSearchQuery::Obj(Box::new(TermQuery::new("body", "rust")))
            )
            .vector_query(
                VectorSearchQuery::Payloads(vec![
                    QueryPayload {
                        field: "body_vec".to_string(),
                        payload: DataValue::Text("systems language".to_string()),
                        weight: 1.0,
                    },
                ])
            )
            .fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
            .limit(10)
            .build()
    ).await?;

    for r in &results {
        println!("{}: score={:.4}", r.id, r.score);
    }

    Ok(())
}

Next Steps

Query DSL

Laurus provides a unified query DSL (Domain Specific Language) that allows lexical (keyword) and vector (semantic) search in a single query string. The UnifiedQueryParser splits the input into lexical and vector portions and delegates to the appropriate sub-parser.

Overview

title:hello AND content:"cute kitten"^0.8
|--- lexical --|    |--- vector --------|

The field type in the schema determines whether a clause is lexical or vector. If the field is a vector field (e.g., HNSW), the clause is treated as a vector query. Everything else is treated as a lexical query.

Lexical Query Syntax

Lexical queries search the inverted index using exact or approximate keyword matching.

Term Query

Match a single term against a field (or the default field):

hello
title:hello

Boolean Operators

Combine clauses with AND and OR (case-insensitive):

title:hello AND body:world
title:hello OR title:goodbye

Space-separated clauses without an explicit operator use implicit boolean (behaves like OR with scoring).

Required / Prohibited Clauses

Use + (must match) and - (must not match):

+title:hello -title:goodbye

Phrase Query

Match an exact phrase using double quotes. Optional proximity (~N) allows N words between terms:

"hello world"
"hello world"~2

Fuzzy Query

Approximate matching with edit distance. Append ~ and optionally the maximum edit distance:

roam~
roam~2

Wildcard Query

Use ? (single character) and * (zero or more characters):

te?t
test*

Range Query

Inclusive [] or exclusive {} ranges, useful for numeric and date fields:

price:[100 TO 500]
date:{2024-01-01 TO 2024-12-31}
price:[* TO 100]

Boost

Increase the weight of a clause with ^:

title:hello^2
"important phrase"^1.5

Grouping

Use parentheses for sub-expressions:

(title:hello OR title:hi) AND body:world

Lexical PEG Grammar

The full lexical grammar (parser.pest):

query          = { SOI ~ boolean_query ~ EOI }
boolean_query  = { clause ~ (boolean_op ~ clause | clause)* }
clause         = { required_clause | prohibited_clause | sub_clause }
required_clause   = { "+" ~ sub_clause }
prohibited_clause = { "-" ~ sub_clause }
sub_clause     = { grouped_query | field_query | term_query }
grouped_query  = { "(" ~ boolean_query ~ ")" ~ boost? }
boolean_op     = { ^"AND" | ^"OR" }
field_query    = { field ~ ":" ~ field_value }
field_value    = { range_query | phrase_query | fuzzy_term
                 | wildcard_term | simple_term }
phrase_query   = { "\"" ~ phrase_content ~ "\"" ~ proximity? ~ boost? }
proximity      = { "~" ~ number }
fuzzy_term     = { term ~ "~" ~ fuzziness? ~ boost? }
wildcard_term  = { wildcard_pattern ~ boost? }
simple_term    = { term ~ boost? }
boost          = { "^" ~ boost_value }

Vector Query Syntax

Vector queries embed text into vectors at parse time and perform similarity search.

Basic Syntax

field:"text"
field:text
field:"text"^weight

The field name must refer to a vector field defined in the schema. The parser uses the schema to determine whether a clause is a vector query.

ElementRequiredDescriptionExample
field:YesTarget vector field name (must be a vector field in the schema)content:
"text" or textYesText to embed (quoted or unquoted)"cute kitten", python
^weightNoScore weight (default: 1.0)^0.8

Vector Query Examples

# Single field (quoted text)
content:"cute kitten"

# Unquoted text
content:python

# With boost weight
content:"cute kitten"^0.8

# Multiple clauses
content:"cats" image:"dogs"^0.5

# Nested field name (dot notation)
metadata.embedding:"text"

Multiple Clauses

Multiple vector clauses are space-separated. All clauses are executed and their scores are combined using the score_mode (default: WeightedSum):

content:"cats" image:"dogs"^0.5

This produces:

score = similarity("cats", content) * 1.0
      + similarity("dogs", image)   * 0.5

There are no AND/OR operators in the vector DSL. Vector search is inherently a ranking operation, and the weight (^) controls the contribution of each clause.

Score Modes

ModeDescription
WeightedSum (default)Sum of (similarity * weight) across all clauses
MaxSimMaximum similarity score across clauses
LateInteractionLate interaction scoring

Score mode cannot be set from DSL syntax. Use the Rust API to override:

#![allow(unused)]
fn main() {
let mut request = parser.parse(r#"content:"cats" image:"dogs""#).await?;
request.vector_options.score_mode = VectorScoreMode::MaxSim;
}

Vector PEG Grammar

The full vector grammar (parser.pest):

query          = { SOI ~ vector_clause+ ~ EOI }
vector_clause  = { field_prefix ~ (quoted_text | unquoted_text) ~ boost? }
field_prefix   = { field_name ~ ":" }
field_name     = @{ (ASCII_ALPHA | "_") ~ (ASCII_ALPHANUMERIC | "_" | ".")* }
quoted_text    = ${ "\"" ~ inner_text ~ "\"" }
inner_text     = @{ (!("\"") ~ ANY)* }
unquoted_text  = @{ (!(" " | "^" | "\"") ~ ANY)+ }
boost          = { "^" ~ float_value }
float_value    = @{ ASCII_DIGIT+ ~ ("." ~ ASCII_DIGIT+)? }

Unified (Hybrid) Query Syntax

The UnifiedQueryParser allows mixing lexical and vector clauses freely in a single query string:

title:hello content:"cute kitten"^0.8

How It Works

  1. Split: The parser checks each field name against the schema. Fields defined as vector fields (e.g., HNSW, Flat, IVF) are routed to the vector parser; all other fields are routed to the lexical parser.
  2. Delegate: Vector portion goes to VectorQueryParser, remainder goes to lexical QueryParser.
  3. Fuse: If both lexical and vector results exist, they are combined using a fusion algorithm.

Disambiguation

The parser uses the schema’s field type information to distinguish vector clauses from lexical clauses. A clause like content:"cute kitten" is a vector query if content is a vector field, or a phrase query if content is a text field. Lexical ~ syntax (e.g., roam~2 for fuzzy, "hello world"~10 for proximity) is unaffected.

Fusion Algorithms

When a query contains both lexical and vector clauses, results are fused:

AlgorithmFormulaDescription
RRF (default)score = sum(1 / (k + rank))Reciprocal Rank Fusion. Robust to different score distributions. Default k=60.
WeightedSumscore = lexical * a + vector * bLinear combination with configurable weights.

Note: The fusion algorithm cannot be specified in the DSL syntax. It is configured when constructing the UnifiedQueryParser via .with_fusion(). The default is RRF (k=60). See Custom Fusion for a code example.

Hybrid AND/OR Semantics (the + Prefix)

By default, hybrid queries use union (OR) — documents appearing in either the lexical results or the vector results are included. You can switch to intersection (AND) by prefixing a vector clause with +, which requires documents to appear in both result sets.

SyntaxModeBehaviour
title:Rust content:"system process"OR (union)Documents matching the lexical query or the vector query are returned.
title:Rust +content:"system process"AND (intersection)Only documents matching both the lexical and vector results are returned.
+title:Rust +content:"system process"AND (intersection)Both clauses required. + on the lexical field is handled by the lexical parser as a required clause.

Rules:

  • When no vector clause carries the + prefix, the fusion produces a union (OR) of lexical and vector results.
  • When at least one vector clause carries the + prefix, the fusion switches to intersection (AND) — only documents present in both the lexical and vector result sets are returned.
  • + on a lexical field (e.g., +title:Rust) is interpreted by the lexical query parser as a required clause, which is the existing Tantivy/Lucene-style behaviour. It does not by itself trigger intersection mode for the hybrid fusion.

Unified Query Examples

# Lexical only — no fusion
title:hello AND body:world

# Vector only — no fusion
content:"cute kitten"

# Hybrid — fusion applied automatically (OR / union)
title:hello content:"cute kitten"

# Hybrid with AND / intersection — only docs in both result sets
title:hello +content:"cute kitten"

# Hybrid with boolean operators
title:hello AND category:animal content:"cute kitten"^0.8

# Multiple vector clauses + lexical
category:animal content:"cats" image:"dogs"^0.5

# Unquoted vector text
category:animal content:python

Code Examples

Lexical Search with DSL

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use laurus::lexical::query::QueryParser;

let analyzer = Arc::new(StandardAnalyzer::new()?);
let parser = QueryParser::new(analyzer)
    .with_default_field("title");

let query = parser.parse("title:hello AND body:world")?;
}

Vector Search with DSL

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::vector::query::VectorQueryParser;

let parser = VectorQueryParser::new(embedder)
    .with_default_field("content");

let request = parser.parse(r#"content:"cute kitten"^0.8"#).await?;
}

Hybrid Search with Unified DSL

#![allow(unused)]
fn main() {
use laurus::engine::query::UnifiedQueryParser;

let unified = UnifiedQueryParser::new(lexical_parser, vector_parser);

let request = unified.parse(
    r#"title:hello content:"cute kitten"^0.8"#
).await?;
// request.query              -> SearchQuery::Hybrid { lexical, vector }
// request.fusion_algorithm   -> Some(RRF)  — fusion algorithm
}

Custom Fusion

#![allow(unused)]
fn main() {
use laurus::engine::search::FusionAlgorithm;

let unified = UnifiedQueryParser::new(lexical_parser, vector_parser)
    .with_fusion(FusionAlgorithm::WeightedSum {
        lexical_weight: 0.3,
        vector_weight: 0.7,
    });
}

Library Overview

The laurus crate is the core search engine library. It provides lexical search (keyword matching via inverted index), vector search (semantic similarity via embeddings), and hybrid search (combining both) through a unified API.

Module Structure

graph TB
    LIB["laurus (lib.rs)"]

    LIB --> engine["engine\nEngine, EngineBuilder\nSearchRequest, FusionAlgorithm"]
    LIB --> analysis["analysis\nAnalyzer, Tokenizer\nToken Filters, Char Filters"]
    LIB --> lexical["lexical\nInverted Index, BM25\nQuery Types, Faceting, Highlighting"]
    LIB --> vector["vector\nFlat, HNSW, IVF\nDistance Metrics, Quantization"]
    LIB --> embedding["embedding\nCandle BERT, OpenAI\nCLIP, Precomputed"]
    LIB --> storage["storage\nMemory, File, Mmap\nColumnStorage"]
    LIB --> store["store\nDocumentLog (WAL)"]
    LIB --> spelling["spelling\nSpelling Correction\nSuggestion Engine"]
    LIB --> data["data\nDataValue, Document"]
    LIB --> error["error\nLaurusError, Result"]

Key Types

TypeModuleDescription
EngineengineUnified search engine coordinating lexical and vector search
EngineBuilderengineBuilder pattern for configuring and creating an Engine
SchemaengineField definitions and routing configuration
SearchRequestengineUnified search request (lexical, vector, or hybrid)
FusionAlgorithmengineResult merging strategy (RRF or WeightedSum)
DocumentdataCollection of named field values
DataValuedataUnified value enum for all field types
LaurusErrorerrorComprehensive error type with variants for each subsystem

Feature Flags

The laurus crate has no default features enabled. Enable embedding support as needed:

FeatureDescriptionDependencies
embeddings-candleLocal BERT embeddings via Hugging Face Candlecandle-core, candle-nn, candle-transformers, hf-hub, tokenizers
embeddings-openaiOpenAI API embeddingsreqwest
embeddings-multimodalCLIP multimodal embeddings (text + image)image, embeddings-candle
embeddings-allAll embedding features combinedAll of the above
# Lexical search only (no embedding)
[dependencies]
laurus = "0.1.0"

# With local BERT embeddings
[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-candle"] }

# All features
[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-all"] }

Sections

Engine

The Engine is the central type in Laurus. It coordinates the lexical index, vector index, and document log behind a single async API.

Engine Struct

#![allow(unused)]
fn main() {
pub struct Engine {
    schema: Schema,
    lexical: LexicalStore,
    vector: VectorStore,
    log: Arc<DocumentLog>,
}
}
FieldTypeDescription
schemaSchemaField definitions and routing rules
lexicalLexicalStoreInverted index for keyword search
vectorVectorStoreVector index for similarity search
logArc<DocumentLog>Write-ahead log for crash recovery and document storage

EngineBuilder

Use EngineBuilder to configure and create an Engine:

#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::{Engine, Schema};
use laurus::lexical::TextOption;
use laurus::storage::memory::MemoryStorage;

let storage = Arc::new(MemoryStorage::new(Default::default()));
let schema = Schema::builder()
    .add_text_field("title", TextOption::default())
    .add_text_field("body", TextOption::default())
    .add_default_field("body")
    .build();

let engine = Engine::builder(storage, schema)
    .analyzer(my_analyzer)    // optional: custom text analyzer
    .embedder(my_embedder)    // optional: vector embedder
    .build()
    .await?;
}

Builder Methods

MethodParameterDefaultDescription
analyzer()Arc<dyn Analyzer>StandardAnalyzerText analysis pipeline for lexical fields
embedder()Arc<dyn Embedder>NoneEmbedding model for vector fields
build()Create the Engine (async)

Build Lifecycle

When build() is called, the following steps occur:

sequenceDiagram
    participant User
    participant EngineBuilder
    participant Engine

    User->>EngineBuilder: .build().await
    EngineBuilder->>EngineBuilder: split_schema()
    Note over EngineBuilder: Separate fields into<br/>LexicalIndexConfig<br/>+ VectorIndexConfig
    EngineBuilder->>Engine: Create PrefixedStorage (lexical/, vector/, documents/)
    EngineBuilder->>Engine: Create LexicalStore
    EngineBuilder->>Engine: Create VectorStore
    EngineBuilder->>Engine: Create DocumentLog
    EngineBuilder->>Engine: Recover from WAL
    EngineBuilder-->>User: Engine ready
  1. Split schema – Lexical fields (Text, Integer, Float, etc.) go to LexicalIndexConfig, vector fields (HNSW, Flat, IVF) go to VectorIndexConfig
  2. Create prefixed storage – Each component gets an isolated namespace (lexical/, vector/, documents/)
  3. Initialize storesLexicalStore and VectorStore are created with their respective configs
  4. Recover from WAL – Any uncommitted operations from a previous session are replayed

Schema Splitting

The Schema contains both lexical and vector fields. At build time, split_schema() separates them:

graph LR
    S["Schema<br/>title: Text<br/>body: Text<br/>page: Integer<br/>content_vec: HNSW"]

    S --> LC["LexicalIndexConfig<br/>title: TextOption<br/>body: TextOption<br/>page: IntegerOption<br/>_id: KeywordAnalyzer"]

    S --> VC["VectorIndexConfig<br/>content_vec: HnswOption<br/>(dim=384, m=16, ef=200)"]

The reserved _id field is always added to the lexical config with KeywordAnalyzer for exact match lookups.

Per-Field Dispatch

PerFieldAnalyzer

When a PerFieldAnalyzer is provided, text analysis is dispatched to field-specific analyzers:

graph LR
    PFA["PerFieldAnalyzer"]
    PFA -->|"title"| KA["KeywordAnalyzer"]
    PFA -->|"body"| SA["StandardAnalyzer"]
    PFA -->|"description"| JA["JapaneseAnalyzer"]
    PFA -->|"_id"| KA2["KeywordAnalyzer<br/>(always)"]
    PFA -->|other fields| DEF["Default Analyzer"]

PerFieldEmbedder

Similarly, PerFieldEmbedder routes embedding to field-specific embedders:

graph LR
    PFE["PerFieldEmbedder"]
    PFE -->|"text_vec"| BERT["CandleBertEmbedder<br/>(384 dim)"]
    PFE -->|"image_vec"| CLIP["CandleClipEmbedder<br/>(512 dim)"]
    PFE -->|other fields| DEF["Default Embedder"]

Engine Methods

Document Operations

MethodDescription
put_document(id, doc)Upsert – replaces any existing document with the same ID
add_document(id, doc)Append – adds as a new chunk (multiple chunks can share an ID)
get_documents(id)Retrieve all documents/chunks by external ID
delete_documents(id)Delete all documents/chunks by external ID
commit()Flush pending changes to storage (makes documents searchable)
recover()Replay WAL to restore uncommitted state after crash
add_field(name, field_option)Dynamically add a new field to the schema at runtime
delete_field(name)Remove a field from the schema at runtime
schema()Return the current Schema

Search

MethodDescription
search(request)Execute a unified search (lexical, vector, or hybrid)

The search() method accepts a SearchRequest which can contain a lexical query, a vector query, or both. When both are present, results are merged using the specified FusionAlgorithm.

#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, FusionAlgorithm};
use laurus::lexical::TermQuery;
use laurus::lexical::search::searcher::LexicalSearchQuery;

// Lexical-only search
let request = SearchRequestBuilder::new()
    .lexical_query(
        LexicalSearchQuery::Obj(Box::new(TermQuery::new("body", "rust")))
    )
    .limit(10)
    .build();

// Hybrid search with RRF fusion
let request = SearchRequestBuilder::new()
    .lexical_query(lexical_query)
    .vector_query(vector_query)
    .fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
    .limit(10)
    .build();

let results = engine.search(request).await?;
}

SearchRequest

FieldTypeDefaultDescription
querySearchQueryDsl("")Search query specification (Dsl, Lexical, Vector, or Hybrid)
limitusize10Maximum results to return
offsetusize0Pagination offset
fusion_algorithmOption<FusionAlgorithm>RRF (k=60)How to merge lexical + vector results
filter_queryOption<Box<dyn Query>>NoneFilter applied to both search types
lexical_optionsLexicalSearchOptionsDefaultParameters controlling lexical search behavior
vector_optionsVectorSearchOptionsDefaultParameters controlling vector search behavior

FusionAlgorithm

VariantDescription
RRF { k: f64 }Reciprocal Rank Fusion – rank-based combining. Score = sum(1 / (k + rank)). Handles incomparable score magnitudes.
WeightedSum { lexical_weight, vector_weight }Weighted combination with min-max score normalization. Weights clamped to [0.0, 1.0].

See also: Architecture for the high-level data flow diagrams.

Scoring & Ranking

Laurus provides multiple scoring algorithms for lexical search and uses distance-based similarity for vector search. This page covers all scoring mechanisms and how they interact in hybrid search.

Lexical Scoring

BM25 (Default)

BM25 is the default scoring function for lexical search. It balances term frequency with document length normalization:

score = IDF * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (doc_len / avg_doc_len)))

Where:

  • tf – term frequency in the document
  • IDF – inverse document frequency (rarity of the term across all documents)
  • k1 – term frequency saturation parameter
  • b – document length normalization factor
  • doc_len / avg_doc_len – ratio of document length to average document length

ScoringConfig

ScoringConfig controls BM25 and other scoring parameters:

ParameterTypeDefaultDescription
k1f321.2Term frequency saturation. Higher values give more weight to term frequency.
bf320.75Field length normalization. 0.0 = no normalization, 1.0 = full normalization.
tf_idf_boostf321.0Global TF-IDF boost factor
enable_field_normbooltrueEnable field length normalization
field_boostsHashMap<String, f32>emptyPer-field score multipliers
enable_coordbooltrueEnable query coordination factor (matched_terms / total_query_terms)

Alternative Scoring Functions

FunctionDescription
BM25ScoringFunctionBM25 with configurable k1 and b (default)
TfIdfScoringFunctionLog-normalized TF-IDF with field length normalization
VectorSpaceScoringFunctionCosine similarity over document term vector space
CustomScoringFunctionUser-provided closure for custom scoring logic

ScoringRegistry

The ScoringRegistry provides a central registry for scoring algorithms:

#![allow(unused)]
fn main() {
// Pre-registered algorithms:
// - "bm25"          -> BM25ScoringFunction
// - "tf_idf"        -> TfIdfScoringFunction
// - "vector_space"  -> VectorSpaceScoringFunction
}

Field Boosts

Field boosts multiply the score contribution from specific fields. This is useful when some fields are more important than others:

#![allow(unused)]
fn main() {
use std::collections::HashMap;

let mut field_boosts = HashMap::new();
field_boosts.insert("title".to_string(), 2.0);  // title matches score 2x
field_boosts.insert("body".to_string(), 1.0);   // body matches score 1x
}

Coordination Factor

When enable_coord is true, the AdvancedScorer applies a coordination factor:

coord = matched_query_terms / total_query_terms

This rewards documents that match more query terms. For example, if the query has 3 terms and a document matches 2 of them, the coordination factor is 2/3 = 0.667.

Vector Scoring

Vector search ranks results by distance-based similarity:

similarity = 1 / (1 + distance)

The distance is computed using the configured distance metric:

MetricDescriptionBest For
Cosine1 - cosine similarityText embeddings (most common)
EuclideanL2 distanceSpatial data
ManhattanL1 distanceFeature vectors
DotProductNegated dot productPre-normalized vectors
AngularAngular distanceDirectional similarity

Hybrid Search Score Normalization

When lexical and vector results are combined, their scores must be made comparable.

RRF (Reciprocal Rank Fusion)

RRF avoids score normalization entirely by using ranks instead of raw scores:

rrf_score = sum(1 / (k + rank))

The k parameter (default: 60) controls smoothing. Higher values give less weight to top-ranked results.

WeightedSum

WeightedSum normalizes scores from each search type independently using min-max normalization, then combines them:

norm_score = (score - min_score) / (max_score - min_score)
final_score = (norm_lexical * lexical_weight) + (norm_vector * vector_weight)

Both weights are clamped to [0.0, 1.0].

Faceting

Faceting enables counting and categorizing search results by field values. It is commonly used to build navigation filters in search UIs (e.g., “Electronics (42)”, “Books (18)”).

Concepts

FacetPath

A FacetPath represents a hierarchical facet value. For example, a product category “Electronics > Computers > Laptops” is a facet path with three levels.

#![allow(unused)]
fn main() {
use laurus::lexical::search::features::facet::FacetPath;

// Single-level facet
let facet = FacetPath::from_value("category", "Electronics");

// Hierarchical facet from components
let facet = FacetPath::new("category", vec![
    "Electronics".to_string(),
    "Computers".to_string(),
    "Laptops".to_string(),
]);

// From a delimited string
let facet = FacetPath::from_delimited("category", "Electronics/Computers/Laptops", "/");
}

FacetPath Methods

MethodDescription
new(field, path)Create a facet path from field name and path components
from_value(field, value)Create a single-level facet
from_delimited(field, path_str, delimiter)Parse a delimited path string
depth()Number of levels in the path
is_parent_of(other)Check if this path is a parent of another
parent()Get the parent path (one level up)
child(component)Create a child path by appending a component
to_string_with_delimiter(delimiter)Convert to a delimited string

FacetCount

FacetCount represents the result of a facet aggregation:

#![allow(unused)]
fn main() {
pub struct FacetCount {
    pub path: FacetPath,
    pub count: u64,
    pub children: Vec<FacetCount>,
}
}
FieldTypeDescription
pathFacetPathThe facet value
countu64Number of matching documents
childrenVec<FacetCount>Child facets for hierarchical drill-down

Example: Hierarchical Facets

Category
├── Electronics (42)
│   ├── Computers (18)
│   │   ├── Laptops (12)
│   │   └── Desktops (6)
│   └── Phones (24)
└── Books (35)
    ├── Fiction (20)
    └── Non-Fiction (15)

Each node in this tree corresponds to a FacetCount with its children populated for drill-down navigation.

Use Cases

  • E-commerce: Filter by category, brand, price range, rating
  • Document search: Filter by author, department, date range, document type
  • Content management: Filter by tags, topics, content status

Highlighting

Highlighting marks matching terms in search results, helping users see why a document matched their query. Laurus generates highlighted text fragments with configurable HTML tags.

HighlightConfig

HighlightConfig controls how highlights are generated:

#![allow(unused)]
fn main() {
use laurus::lexical::search::features::highlight::HighlightConfig;

let config = HighlightConfig::default()
    .tag("mark")
    .css_class("highlight")
    .max_fragments(3)
    .fragment_size(200);
}

Configuration Options

OptionTypeDefaultDescription
tagString"mark"HTML tag used for highlighting
css_classOption<String>NoneOptional CSS class added to the tag
max_fragmentsusize5Maximum number of fragments to return
fragment_sizeusize150Target fragment length in characters
fragment_overlapusize20Character overlap between adjacent fragments
fragment_separatorString" ... "Separator between fragments
return_entire_field_if_no_highlightboolfalseReturn the full field value if no matches found
max_analyzed_charsusize1,000,000Maximum characters to analyze for highlights

Builder Methods

MethodDescription
tag(tag)Set the HTML tag (e.g., "em", "strong", "mark")
css_class(class)Set the CSS class for the tag
max_fragments(count)Set maximum fragment count
fragment_size(size)Set target fragment size in characters
opening_tag()Get the opening HTML tag string (e.g., <mark class="highlight">)
closing_tag()Get the closing HTML tag string (e.g., </mark>)

HighlightFragment

Each highlight result is a HighlightFragment:

#![allow(unused)]
fn main() {
pub struct HighlightFragment {
    pub text: String,
}
}

The text field contains the fragment with matching terms wrapped in the configured HTML tags.

Output Example

Given a document with body = "Rust is a systems programming language focused on safety and performance." and a search for “rust programming”:

<mark>Rust</mark> is a systems <mark>programming</mark> language focused on safety and performance.

With css_class("highlight"):

<mark class="highlight">Rust</mark> is a systems <mark class="highlight">programming</mark> language focused on safety and performance.

Fragment Selection

When a field is long, Laurus selects the most relevant fragments:

  1. The text is split into overlapping windows of fragment_size characters
  2. Each fragment is scored by how many query terms it contains
  3. The top max_fragments fragments are returned, joined by fragment_separator

If no fragments contain matches and return_entire_field_if_no_highlight is true, the full field value is returned instead.

Spelling Correction

Laurus includes a built-in spelling correction system that can suggest corrections for misspelled query terms and provide “Did you mean?” functionality.

Overview

The spelling corrector uses edit distance (Levenshtein distance) combined with word frequency data to suggest corrections. It supports:

  • Word-level suggestions — correct individual misspelled words
  • Auto-correction — automatically apply high-confidence corrections
  • “Did you mean?” — suggest alternative queries to the user
  • Query learning — improve suggestions by learning from user queries
  • Custom dictionaries — use your own word lists

Basic Usage

SpellingCorrector

#![allow(unused)]
fn main() {
use laurus::spelling::corrector::SpellingCorrector;

// Create a corrector with the built-in English dictionary
let mut corrector = SpellingCorrector::new();

// Correct a query
let result = corrector.correct("programing langauge");

// Check if suggestions are available
if result.has_suggestions() {
    for (word, suggestions) in &result.word_suggestions {
        println!("'{}' -> {:?}", word, suggestions);
    }
}

// Get the best corrected query
if let Some(corrected) = result.query() {
    println!("Corrected: {}", corrected);
}
}

“Did You Mean?”

The DidYouMean wrapper provides a higher-level interface for search UIs:

#![allow(unused)]
fn main() {
use laurus::spelling::corrector::{SpellingCorrector, DidYouMean};

let corrector = SpellingCorrector::new();
let mut did_you_mean = DidYouMean::new(corrector);

if let Some(suggestion) = did_you_mean.suggest("programing") {
    println!("Did you mean: {}?", suggestion);
}
}

Configuration

Use CorrectorConfig to customize behavior:

#![allow(unused)]
fn main() {
use laurus::spelling::corrector::{CorrectorConfig, SpellingCorrector};

let config = CorrectorConfig {
    max_distance: 2,              // Maximum edit distance (default: 2)
    max_suggestions: 5,           // Max suggestions per word (default: 5)
    min_frequency: 1,             // Minimum word frequency threshold (default: 1)
    auto_correct: false,          // Enable auto-correction (default: false)
    auto_correct_threshold: 0.8,  // Confidence threshold for auto-correction (default: 0.8)
    use_index_terms: true,        // Use indexed terms as dictionary (default: true)
    learn_from_queries: true,     // Learn from user queries (default: true)
};
}

Configuration Options

OptionTypeDefaultDescription
max_distanceusize2Maximum Levenshtein edit distance for candidate suggestions
max_suggestionsusize5Maximum number of suggestions returned per word
min_frequencyu321Minimum frequency a word must have in the dictionary to be suggested
auto_correctboolfalseWhen true, automatically apply corrections above the threshold
auto_correct_thresholdf640.8Confidence score (0.0–1.0) required for auto-correction
use_index_termsbooltrueUse terms from the search index as dictionary words
learn_from_queriesbooltrueLearn new words from user search queries

CorrectionResult

The correct() method returns a CorrectionResult with detailed information:

FieldTypeDescription
originalStringThe original query string
correctedOption<String>The corrected query (if auto-correction was applied)
word_suggestionsHashMap<String, Vec<Suggestion>>Suggestions grouped by misspelled word
confidencef64Overall confidence score (0.0–1.0)
auto_correctedboolWhether auto-correction was applied

Helper Methods

MethodReturnsDescription
has_suggestions()boolTrue if any word has suggestions
best_suggestion()Option<&Suggestion>The single highest-scoring suggestion
query()Option<String>The corrected query string, if corrections were made
should_show_did_you_mean()boolWhether to display a “Did you mean?” prompt

Custom Dictionaries

You can provide your own dictionary instead of using the built-in English one:

#![allow(unused)]
fn main() {
use laurus::spelling::corrector::SpellingCorrector;
use laurus::spelling::dictionary::SpellingDictionary;

// Build a custom dictionary
let mut dictionary = SpellingDictionary::new();
dictionary.add_word("elasticsearch", 100);
dictionary.add_word("lucene", 80);
dictionary.add_word("laurus", 90);

let corrector = SpellingCorrector::with_dictionary(dictionary);
}

Learning from Index Terms

When use_index_terms is enabled, the corrector can learn from terms in your search index:

#![allow(unused)]
fn main() {
let mut corrector = SpellingCorrector::new();

// Feed index terms to the corrector
let index_terms = vec!["rust", "programming", "search", "engine"];
corrector.learn_from_terms(&index_terms);
}

This improves suggestion quality by incorporating domain-specific vocabulary.

Statistics

Monitor the corrector’s state with stats():

#![allow(unused)]
fn main() {
let stats = corrector.stats();
println!("Dictionary words: {}", stats.dictionary_words);
println!("Total frequency: {}", stats.dictionary_total_frequency);
println!("Learned queries: {}", stats.queries_learned);
}

Next Steps

ID Management

Laurus uses a dual-tiered ID management strategy to ensure efficient document retrieval, updates, and aggregation in distributed environments.

1. External ID (String)

The External ID is a logical identifier used by users and applications to uniquely identify a document.

  • Type: String
  • Role: You can use any unique value, such as UUIDs, URLs, or database primary keys.
  • Storage: Persisted transparently as a reserved system field name _id within the Lexical Index.
  • Uniqueness: Expected to be unique across the entire system.
  • Updates: Indexing a document with an existing external_id triggers an automatic “Delete-then-Insert” (Upsert) operation, replacing the old version with the newest.

2. Internal ID (u64 / Stable ID)

The Internal ID is a physical handle used internally by Laurus’s engines (Lexical and Vector) for high-performance operations.

  • Type: Unsigned 64-bit Integer (u64)
  • Role: Used for bitmap operations, point references, and routing between distributed nodes.
  • Immutability (Stable): Once assigned, an Internal ID never changes due to index merges (segment compaction) or restarts. This prevents inconsistencies in deletion logs and caches.

ID Structure (Shard-Prefixed)

Laurus employs a Shard-Prefixed Stable ID scheme designed for multi-node distributed environments.

Bit RangeNameDescription
48-63 bitShard IDPrefix identifying the node or partition (up to 65,535 shards).
0-47 bitLocal IDMonotonically increasing document number within a shard (up to ~281 trillion documents).

Why this structure?

  1. Zero-Cost Aggregation: Since u64 IDs are globally unique, the aggregator can perform fast sorting and deduplication without worrying about ID collisions between nodes.
  2. Fast Routing: The aggregator can immediately identify the physical node responsible for a document just by looking at the upper bits, avoiding expensive hash lookups.
  3. High-Performance Fetching: Internal IDs map directly to physical data structures. This allows Laurus to skip the “External-to-Internal ID” conversion step during retrieval, achieving O(1) access speed.

ID Lifecycle

  1. Registration (engine.put_document() / engine.add_document()): User provides a document with an External ID.
  2. ID Assignment: The Engine combines the current shard_id with a new Local ID to issue a Shard-Prefixed Internal ID.
  3. Mapping: The engine maintains the relationship between the External ID and the new Internal ID.
  4. Search: Search results return the External ID (String), resolved from the Internal ID.
  5. Retrieval/Deletion: While the user-facing API accepts External IDs for convenience, the engine internally converts them to Internal IDs for near-instant processing.

Persistence & WAL

Laurus uses a Write-Ahead Log (WAL) to ensure data durability. Every write operation is persisted to the WAL before modifying in-memory structures, guaranteeing that no data is lost even if the process crashes.

Write Path

sequenceDiagram
    participant App as Application
    participant Engine
    participant WAL as DocumentLog (WAL)
    participant Mem as In-Memory Buffers
    participant Disk as Storage (segments)

    App->>Engine: add_document() / delete_documents()
    Engine->>WAL: 1. Append operation to WAL
    Engine->>Mem: 2. Update in-memory buffers

    Note over Mem: Document is buffered but\nNOT yet searchable

    App->>Engine: commit()
    Engine->>Disk: 3. Flush segments to storage
    Engine->>WAL: 4. Truncate WAL
    Note over Disk: Documents are now\nsearchable and durable

Key Principles

  1. WAL-first: Every write (add or delete) is appended to the WAL before updating in-memory structures
  2. Buffered writes: In-memory buffers accumulate changes until commit() is called
  3. Atomic commit: commit() flushes all buffered changes to segment files and truncates the WAL
  4. Crash safety: If the process crashes between writes and commit, the WAL is replayed on the next startup

Write-Ahead Log (WAL)

The WAL is managed by the DocumentLog component and stored at the root level of the storage backend (engine.wal).

WAL Entry Types

Entry TypeDescription
UpsertDocument content + external ID + assigned internal ID
DeleteExternal ID of the document to remove

WAL File

The WAL file (engine.wal) is an append-only binary log. Each entry is self-contained with:

  • Operation type (add/delete)
  • Sequence number
  • Payload (document data or ID)

Recovery

When an engine is built (Engine::builder(...).build().await), it automatically checks for remaining WAL entries and replays them (the WAL is truncated on commit, so any remaining entries are from a crashed session):

graph TD
    Start["Engine::build()"] --> Check["Check WAL for\nuncommitted entries"]
    Check -->|"Entries found"| Replay["Replay operations\ninto in-memory buffers"]
    Replay --> Ready["Engine ready"]
    Check -->|"No entries"| Ready

Recovery is transparent — you do not need to handle it manually.

The Commit Lifecycle

#![allow(unused)]
fn main() {
// 1. Add documents (buffered, not yet searchable)
engine.add_document("doc-1", doc1).await?;
engine.add_document("doc-2", doc2).await?;

// 2. Commit — flush to persistent storage
engine.commit().await?;
// Documents are now searchable

// 3. Add more documents
engine.add_document("doc-3", doc3).await?;

// 4. If the process crashes here, doc-3 is in the WAL
//    and will be recovered on next startup
}

When to Commit

StrategyDescriptionUse Case
After each documentMaximum durability, minimum search latencyReal-time search with few writes
After a batchGood balance of throughput and latencyBulk indexing
PeriodicallyMaximum write throughputHigh-volume ingestion

Tip: Commits are relatively expensive because they flush segments to storage. For bulk indexing, batch many documents before calling commit().

Storage Layout

The engine uses PrefixedStorage to organize data:

<storage root>/
├── lexical/          # Inverted index segments
│   ├── seg-000/
│   │   ├── terms.dict
│   │   ├── postings.post
│   │   └── ...
│   └── metadata.json
├── vector/           # Vector index segments
│   ├── seg-000/
│   │   ├── graph.hnsw
│   │   ├── vectors.vecs
│   │   └── ...
│   └── metadata.json
├── documents/        # Document storage
│   └── ...
└── engine.wal        # Write-ahead log

Next Steps

Deletions & Compaction

Laurus uses a two-phase deletion strategy: fast logical deletion followed by periodic physical compaction.

Deleting Documents

#![allow(unused)]
fn main() {
// Delete a document by its external ID
engine.delete_documents("doc-1").await?;
engine.commit().await?;
}

Logical Deletion

When a document is deleted, it is not immediately removed from the index files. Instead:

graph LR
    Del["delete_documents('doc-1')"] --> Bitmap["Add internal ID\nto Deletion Bitmap"]
    Bitmap --> Search["Search skips\ndeleted IDs"]
  1. The document’s internal ID is added to a deletion bitmap
  2. The bitmap is checked during every search, filtering out deleted documents from results
  3. The original data remains in the segment files

Why Logical Deletion?

BenefitDescription
SpeedO(1) — flipping a bit is instant
Immutable segmentsSegment files are never modified in place, simplifying concurrency
Safe recoveryIf a crash occurs, the deletion bitmap can be reconstructed from the WAL

Upserts (Update = Delete + Insert)

When you index a document with an existing external ID, Laurus performs an automatic upsert:

  1. The old document is logically deleted (its ID is added to the deletion bitmap)
  2. A new document is inserted with a new internal ID
  3. The external-to-internal ID mapping is updated
#![allow(unused)]
fn main() {
// First insert
engine.put_document("doc-1", doc_v1).await?;
engine.commit().await?;

// Update: old version is logically deleted, new version is inserted
engine.put_document("doc-1", doc_v2).await?;
engine.commit().await?;
}

Physical Compaction

Over time, logically deleted documents accumulate and waste space. Compaction reclaims this space by rewriting segment files without the deleted entries.

graph LR
    subgraph "Before Compaction"
        S1["Segment 0\ndoc-1 (deleted)\ndoc-2\ndoc-3 (deleted)"]
        S2["Segment 1\ndoc-4\ndoc-5"]
    end

    Compact["Compaction"]

    subgraph "After Compaction"
        S3["Segment 0\ndoc-2\ndoc-4\ndoc-5"]
    end

    S1 --> Compact
    S2 --> Compact
    Compact --> S3

What Compaction Does

  1. Reads all live (non-deleted) documents from existing segments
  2. Rebuilds the inverted index and/or vector index without deleted entries
  3. Writes new, clean segment files
  4. Removes the old segment files
  5. Resets the deletion bitmap

Cost and Frequency

AspectDetail
CPU costHigh — rebuilds index structures from scratch
I/O costHigh — reads all data, writes new segments
BlockingSearches continue during compaction (reads see the old segments until the new ones are ready)
FrequencyRun when deleted documents exceed a threshold (e.g., 10-20% of total)

When to Compact

  • Low-write workloads: Compact periodically (e.g., daily or weekly)
  • High-write workloads: Compact when the deletion ratio exceeds a threshold
  • After bulk updates: Compact after a large batch of upserts

Deletion Bitmap

The deletion bitmap tracks which internal IDs have been deleted:

  • Storage: HashSet of deleted document IDs (AHashSet<u64>)
  • Lookup: O(1) — hash set lookup

The bitmap is persisted alongside the index segments and is rebuilt from the WAL during recovery.

Next Steps

Error Handling

Laurus uses a unified error type for all operations. Understanding the error system helps you write robust applications that handle failures gracefully.

LaurusError

All Laurus operations return Result<T>, which is an alias for std::result::Result<T, LaurusError>.

LaurusError is an enum with variants for each category of failure:

VariantDescriptionCommon Causes
IoI/O errorsFile not found, permission denied, disk full
IndexIndex operation errorsCorrupt index, segment read failure
SchemaSchema-related errorsUnknown field name, type mismatch
AnalysisText analysis errorsTokenizer failure, invalid filter config
QueryQuery parsing/execution errorsMalformed Query DSL, unknown field in query
StorageStorage backend errorsFailed to open storage, write failure
FieldField definition errorsInvalid field options, duplicate field name
BenchmarkFailedBenchmark errorsBenchmark execution failure
ThreadJoinErrorThread join errorsPanic in a worker thread
JsonJSON serialization errorsMalformed document JSON
AnyhowWrapped anyhow errorsErrors from third-party crates via anyhow
InvalidOperationInvalid operationSearching before commit, double close
ResourceExhaustedResource limits exceededOut of memory, too many open files
SerializationErrorBinary serialization errorsCorrupt data on disk
OperationCancelledOperation was cancelledTimeout, user cancellation
NotImplementedFeature not availableUnimplemented operation
OtherGeneric errorsTimeout, invalid config, invalid argument

Basic Error Handling

Using the ? Operator

The simplest approach — propagate errors to the caller:

#![allow(unused)]
fn main() {
use laurus::{Engine, Result};

async fn index_documents(engine: &Engine) -> Result<()> {
    let doc = laurus::Document::builder()
        .add_text("title", "Rust Programming")
        .build();

    engine.put_document("doc1", doc).await?;
    engine.commit().await?;
    Ok(())
}
}

Matching on Error Variants

When you need different behavior for different error types:

#![allow(unused)]
fn main() {
use laurus::{Engine, LaurusError};

async fn safe_search(engine: &Engine, query: &str) {
    match engine.search(/* request */).await {
        Ok(results) => {
            for result in results {
                println!("{}: {}", result.id, result.score);
            }
        }
        Err(LaurusError::Query(msg)) => {
            eprintln!("Invalid query syntax: {}", msg);
        }
        Err(LaurusError::Io(e)) => {
            eprintln!("Storage I/O error: {}", e);
        }
        Err(e) => {
            eprintln!("Unexpected error: {}", e);
        }
    }
}
}

Checking Error Types with downcast

Since LaurusError implements std::error::Error, you can use standard error handling patterns:

#![allow(unused)]
fn main() {
use laurus::LaurusError;

fn is_retriable(error: &LaurusError) -> bool {
    matches!(error, LaurusError::Io(_) | LaurusError::ResourceExhausted(_))
}
}

Common Error Scenarios

Schema Mismatch

Adding a document with fields that don’t match the schema:

#![allow(unused)]
fn main() {
// Schema has "title" (Text) and "year" (Integer)
let doc = Document::builder()
    .add_text("title", "Hello")
    .add_text("unknown_field", "this field is not in schema")
    .build();

// Fields not in the schema are silently ignored during indexing.
// No error is raised — only schema-defined fields are processed.
}

Query Parsing Errors

Invalid Query DSL syntax returns a Query error:

#![allow(unused)]
fn main() {
use laurus::engine::query::UnifiedQueryParser;

let parser = UnifiedQueryParser::new();
match parser.parse("title:\"unclosed phrase") {
    Ok(request) => { /* ... */ }
    Err(LaurusError::Query(msg)) => {
        // msg contains details about the parse failure
        eprintln!("Bad query: {}", msg);
    }
    Err(e) => { /* other errors */ }
}
}

Storage I/O Errors

File-based storage may encounter I/O errors:

#![allow(unused)]
fn main() {
use laurus::storage::{StorageConfig, StorageFactory};

match StorageFactory::open(StorageConfig::File {
    path: "/nonexistent/path".into(),
    loading_mode: Default::default(),
}) {
    Ok(storage) => { /* ... */ }
    Err(LaurusError::Io(e)) => {
        eprintln!("Cannot open storage: {}", e);
    }
    Err(e) => { /* other errors */ }
}
}

Convenience Constructors

LaurusError provides factory methods for creating errors in custom implementations:

MethodCreates
LaurusError::index(msg)Index variant
LaurusError::schema(msg)Schema variant
LaurusError::analysis(msg)Analysis variant
LaurusError::query(msg)Query variant
LaurusError::storage(msg)Storage variant
LaurusError::field(msg)Field variant
LaurusError::other(msg)Other variant
LaurusError::cancelled(msg)OperationCancelled variant
LaurusError::invalid_argument(msg)Other with “Invalid argument” prefix
LaurusError::invalid_config(msg)Other with “Invalid configuration” prefix
LaurusError::not_found(msg)Other with “Not found” prefix
LaurusError::timeout(msg)Other with “Timeout” prefix

These are useful when implementing custom Analyzer, Embedder, or Storage traits:

#![allow(unused)]
fn main() {
use laurus::{LaurusError, Result};

fn validate_dimension(dim: usize) -> Result<()> {
    if dim == 0 {
        return Err(LaurusError::invalid_argument("dimension must be > 0"));
    }
    Ok(())
}
}

Automatic Conversions

LaurusError implements From for common error types, so they convert automatically with ?:

Source TypeTarget Variant
std::io::ErrorLaurusError::Io
serde_json::ErrorLaurusError::Json
anyhow::ErrorLaurusError::Anyhow

Next Steps

Extensibility

Laurus uses trait-based abstractions for its core components. You can implement these traits to provide custom analyzers, embedders, and storage backends.

Custom Analyzer

Implement the Analyzer trait to create a custom text analysis pipeline:

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::analyzer::Analyzer;
use laurus::analysis::token::{Token, TokenStream};
use laurus::Result;

#[derive(Debug)]
struct ReverseAnalyzer;

impl Analyzer for ReverseAnalyzer {
    fn analyze(&self, text: &str) -> Result<TokenStream> {
        let tokens: Vec<Token> = text
            .split_whitespace()
            .enumerate()
            .map(|(i, word)| Token {
                text: word.chars().rev().collect(),
                position: i,
                ..Default::default()
            })
            .collect();
        Ok(Box::new(tokens.into_iter()))
    }

    fn name(&self) -> &str {
        "reverse"
    }

    fn as_any(&self) -> &dyn std::any::Any {
        self
    }
}
}

Required Methods

MethodDescription
analyze(&self, text: &str) -> Result<TokenStream>Process text into a stream of tokens
name(&self) -> &strReturn a unique identifier for this analyzer
as_any(&self) -> &dyn AnyEnable downcasting to the concrete type

Using a Custom Analyzer

Pass your analyzer to EngineBuilder:

#![allow(unused)]
fn main() {
use std::sync::Arc;

let analyzer = Arc::new(ReverseAnalyzer);
let engine = Engine::builder(storage, schema)
    .analyzer(analyzer)
    .build()
    .await?;
}

For per-field analyzers, wrap with PerFieldAnalyzer:

#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::per_field::PerFieldAnalyzer;
use laurus::analysis::analyzer::standard::StandardAnalyzer;

let per_field = PerFieldAnalyzer::new(Arc::new(StandardAnalyzer::new()?));
per_field.add_analyzer("custom_field", Arc::new(ReverseAnalyzer));

let engine = Engine::builder(storage, schema)
    .analyzer(Arc::new(per_field))
    .build()
    .await?;
}

Custom Embedder

Implement the Embedder trait to integrate your own vector embedding model:

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use laurus::embedding::embedder::{Embedder, EmbedInput, EmbedInputType};
use laurus::vector::core::vector::Vector;
use laurus::{LaurusError, Result};

#[derive(Debug)]
struct MyEmbedder {
    dimension: usize,
}

#[async_trait]
impl Embedder for MyEmbedder {
    async fn embed(&self, input: &EmbedInput<'_>) -> Result<Vector> {
        match input {
            EmbedInput::Text(text) => {
                // Your embedding logic here
                let vector = vec![0.0f32; self.dimension];
                Ok(Vector::new(vector))
            }
            _ => Err(LaurusError::invalid_argument(
                "this embedder only supports text input",
            )),
        }
    }

    fn supported_input_types(&self) -> Vec<EmbedInputType> {
        vec![EmbedInputType::Text]
    }

    fn name(&self) -> &str {
        "my-embedder"
    }

    fn as_any(&self) -> &dyn std::any::Any {
        self
    }
}
}

Required Methods

MethodDescription
async embed(&self, input: &EmbedInput) -> Result<Vector>Generate an embedding vector for the given input
supported_input_types(&self) -> Vec<EmbedInputType>Declare supported input types (Text, Image)
as_any(&self) -> &dyn AnyEnable downcasting

Optional Methods

MethodDefaultDescription
async embed_batch(&self, inputs) -> Result<Vec<Vector>>Sequential calls to embedOverride for batch optimization
name(&self) -> &str"unknown"Identifier for logging
supports(&self, input_type) -> boolChecks supported_input_typesInput type support check
supports_text() -> boolChecks for TextText support shorthand
supports_image() -> boolChecks for ImageImage support shorthand
is_multimodal() -> boolBoth text and imageMultimodal check

Using a Custom Embedder

#![allow(unused)]
fn main() {
let embedder = Arc::new(MyEmbedder { dimension: 384 });
let engine = Engine::builder(storage, schema)
    .embedder(embedder)
    .build()
    .await?;
}

For per-field embedders, wrap with PerFieldEmbedder:

#![allow(unused)]
fn main() {
use laurus::embedding::per_field::PerFieldEmbedder;

let per_field = PerFieldEmbedder::new(Arc::new(MyEmbedder { dimension: 384 }));
per_field.add_embedder("image_vec", Arc::new(ClipEmbedder::new()?));

let engine = Engine::builder(storage, schema)
    .embedder(Arc::new(per_field))
    .build()
    .await?;
}

Custom Storage

Implement the Storage trait to add a new storage backend:

#![allow(unused)]
fn main() {
use laurus::storage::{Storage, StorageInput, StorageOutput, LoadingMode, FileMetadata};
use laurus::Result;

#[derive(Debug)]
struct S3Storage {
    bucket: String,
    prefix: String,
}

impl Storage for S3Storage {
    fn loading_mode(&self) -> LoadingMode {
        LoadingMode::Eager  // S3 requires full download
    }

    fn open_input(&self, name: &str) -> Result<Box<dyn StorageInput>> {
        // Download from S3 and return a reader
        todo!()
    }

    fn create_output(&self, name: &str) -> Result<Box<dyn StorageOutput>> {
        // Create an upload stream to S3
        todo!()
    }

    fn create_output_append(&self, name: &str) -> Result<Box<dyn StorageOutput>> {
        todo!()
    }

    fn file_exists(&self, name: &str) -> bool {
        todo!()
    }

    fn delete_file(&self, name: &str) -> Result<()> {
        todo!()
    }

    fn list_files(&self) -> Result<Vec<String>> {
        todo!()
    }

    fn file_size(&self, name: &str) -> Result<u64> {
        todo!()
    }

    fn metadata(&self, name: &str) -> Result<FileMetadata> {
        todo!()
    }

    fn rename_file(&self, old_name: &str, new_name: &str) -> Result<()> {
        todo!()
    }

    fn create_temp_output(&self, prefix: &str) -> Result<(String, Box<dyn StorageOutput>)> {
        todo!()
    }

    fn sync(&self) -> Result<()> {
        todo!()
    }

    fn close(&mut self) -> Result<()> {
        todo!()
    }
}
}

Required Methods

MethodDescription
open_input(name) -> Result<Box<dyn StorageInput>>Open a file for reading
create_output(name) -> Result<Box<dyn StorageOutput>>Create a file for writing
create_output_append(name) -> Result<Box<dyn StorageOutput>>Open a file for appending
file_exists(name) -> boolCheck if a file exists
delete_file(name) -> Result<()>Delete a file
list_files() -> Result<Vec<String>>List all files
file_size(name) -> Result<u64>Get file size in bytes
metadata(name) -> Result<FileMetadata>Get file metadata
rename_file(old, new) -> Result<()>Rename a file
create_temp_output(prefix) -> Result<(String, Box<dyn StorageOutput>)>Create a temporary file
sync() -> Result<()>Flush all pending writes
close(&mut self) -> Result<()>Close storage and release resources

Optional Methods

MethodDefaultDescription
loading_mode() -> LoadingModeLoadingMode::EagerPreferred data loading mode

Thread Safety

All three traits require Send + Sync. This means your implementations must be safe to share across threads. Use Arc<Mutex<_>> or lock-free data structures for shared mutable state.

Next Steps

API Reference

This page provides a quick reference of the most important types and methods in Laurus. For full details, generate the Rustdoc:

cargo doc --open

Engine

The central coordinator for all indexing and search operations.

MethodDescription
Engine::builder(storage, schema)Create an EngineBuilder
engine.put_document(id, doc).await?Upsert a document (replace if ID exists)
engine.add_document(id, doc).await?Add a document as a chunk (multiple chunks can share an ID)
engine.delete_documents(id).await?Delete all documents/chunks by external ID
engine.get_documents(id).await?Get all documents/chunks by external ID
engine.search(request).await?Execute a search request
engine.commit().await?Flush all pending changes to storage
engine.add_field(name, field_option).await?Dynamically add a new field to the schema at runtime
engine.delete_field(name).await?Remove a field from the schema at runtime
engine.schema()Return the current Schema
engine.stats()?Get index statistics

put_document vs add_document: put_document performs an upsert — if a document with the same external ID already exists, it is deleted and replaced. add_document always appends, allowing multiple document chunks to share the same external ID. See Schema & Fields — Indexing Documents for details.

EngineBuilder

MethodDescription
EngineBuilder::new(storage, schema)Create a builder with storage and schema
.analyzer(Arc<dyn Analyzer>)Set the text analyzer (default: StandardAnalyzer)
.embedder(Arc<dyn Embedder>)Set the vector embedder (optional)
.build().await?Build the Engine

Schema

Defines document structure.

MethodDescription
Schema::builder()Create a SchemaBuilder

SchemaBuilder

MethodDescription
.add_text_field(name, TextOption)Add a full-text field
.add_integer_field(name, IntegerOption)Add an integer field
.add_float_field(name, FloatOption)Add a float field
.add_boolean_field(name, BooleanOption)Add a boolean field
.add_datetime_field(name, DateTimeOption)Add a datetime field
.add_geo_field(name, GeoOption)Add a geographic field
.add_bytes_field(name, BytesOption)Add a binary field
.add_hnsw_field(name, HnswOption)Add an HNSW vector field
.add_flat_field(name, FlatOption)Add a Flat vector field
.add_ivf_field(name, IvfOption)Add an IVF vector field
.add_default_field(name)Set a default search field
.build()Build the Schema

Document

A collection of named field values.

MethodDescription
Document::builder()Create a DocumentBuilder
doc.get(name)Get a field value by name
doc.has_field(name)Check if a field exists
doc.field_names()Get all field names

DocumentBuilder

MethodDescription
.add_text(name, value)Add a text field
.add_integer(name, value)Add an integer field
.add_float(name, value)Add a float field
.add_boolean(name, value)Add a boolean field
.add_datetime(name, value)Add a datetime field
.add_vector(name, vec)Add a pre-computed vector
.add_geo(name, lat, lon)Add a geographic point
.add_bytes(name, data)Add binary data
.build()Build the Document

Search

SearchRequestBuilder

MethodDescription
SearchRequestBuilder::new()Create a new builder
.query_dsl(dsl)Set a unified DSL string (parsed at search time)
.lexical_query(query)Set the lexical search query (LexicalSearchQuery)
.vector_query(query)Set the vector search query (VectorSearchQuery)
.filter_query(query)Set a pre-filter query
.fusion_algorithm(algo)Set the fusion algorithm (default: RRF)
.limit(n)Maximum results (default: 10)
.offset(n)Skip N results (default: 0)
.add_field_boost(field, boost)Add a field-level boost for lexical search
.lexical_min_score(f32)Set minimum score threshold for lexical search
.lexical_timeout_ms(u64)Set lexical search timeout in milliseconds
.lexical_parallel(bool)Enable parallel lexical search
.sort_by(SortField)Set sort order for lexical search results
.vector_score_mode(VectorScoreMode)Set score combination mode for vector search
.vector_min_score(f32)Set minimum score threshold for vector search
.build()Build the SearchRequest

LexicalSearchQuery

VariantDescription
LexicalSearchQuery::Dsl(String)Query specified as a DSL string (parsed at search time)
LexicalSearchQuery::Obj(Box<dyn Query>)Query specified as a pre-built Query object

VectorSearchQuery

VariantDescription
VectorSearchQuery::Payloads(Vec<QueryPayload>)Raw payloads (text, bytes, etc.) to be embedded at search time
VectorSearchQuery::Vectors(Vec<QueryVector>)Pre-embedded query vectors ready for nearest-neighbor search

SearchResult

FieldTypeDescription
idStringExternal document ID
scoref32Relevance score
documentOption<Document>Document content (if loaded)

FusionAlgorithm

VariantDescription
RRF { k: f64 }Reciprocal Rank Fusion (default k=60.0)
WeightedSum { lexical_weight, vector_weight }Linear combination of scores

Query Types (Lexical)

QueryDescriptionExample
TermQuery::new(field, term)Exact term matchTermQuery::new("body", "rust")
PhraseQuery::new(field, terms)Exact phrasePhraseQuery::new("body", vec!["machine".into(), "learning".into()])
BooleanQueryBuilder::new()Boolean combination.must(q1).should(q2).must_not(q3).build()
FuzzyQuery::new(field, term)Fuzzy match (default max_edits=2)FuzzyQuery::new("body", "programing").max_edits(1)
WildcardQuery::new(field, pattern)WildcardWildcardQuery::new("file", "*.pdf")
NumericRangeQuery::new(...)Numeric rangeSee Lexical Search
GeoQuery::within_radius(...)Geo radiusSee Lexical Search
SpanNearQuery::new(...)ProximitySee Lexical Search
PrefixQuery::new(field, prefix)Prefix matchPrefixQuery::new("body", "pro")
RegexpQuery::new(field, pattern)?Regex matchRegexpQuery::new("body", "^pro.*ing$")?

Query Parsers

ParserDescription
QueryParser::new(analyzer)Parse lexical DSL queries
VectorQueryParser::new(embedder)Parse vector DSL queries
UnifiedQueryParser::new(lexical, vector)Parse hybrid DSL queries

Analyzers

TypeDescription
StandardAnalyzerRegexTokenizer + lowercase + stop words
SimpleAnalyzerTokenization only (no filtering)
EnglishAnalyzerRegexTokenizer + lowercase + English stop words
JapaneseAnalyzerJapanese morphological analysis
KeywordAnalyzerNo tokenization (exact match)
PipelineAnalyzerCustom tokenizer + filter chain
PerFieldAnalyzerPer-field analyzer dispatch

Embedders

TypeFeature FlagDescription
CandleBertEmbedderembeddings-candleLocal BERT model
OpenAIEmbedderembeddings-openaiOpenAI API
CandleClipEmbedderembeddings-multimodalLocal CLIP model
PrecomputedEmbedder(default)Pre-computed vectors
PerFieldEmbedder(default)Per-field embedder dispatch

Storage

TypeDescription
MemoryStorageIn-memory (non-durable)
FileStorageFile-system based (supports use_mmap for memory-mapped I/O)
StorageFactory::create(config)Create from config

DataValue

VariantRust Type
DataValue::Null
DataValue::Bool(bool)bool
DataValue::Int64(i64)i64
DataValue::Float64(f64)f64
DataValue::Text(String)String
DataValue::Bytes(Vec<u8>, Option<String>)(data, mime_type)
DataValue::Vector(Vector)Vector
DataValue::DateTime(DateTime<Utc>)chrono::DateTime<Utc>
DataValue::Geo(f64, f64)(latitude, longitude)

CLI Overview

Laurus provides a command-line tool laurus that lets you create indexes, manage documents, and run search queries without writing code.

Features

  • Index management – Create and inspect indexes from TOML schema files, with an interactive schema generator
  • Document CRUD – Add, retrieve, and delete documents via JSON
  • Search – Execute queries using the Query DSL
  • Dual output – Human-readable tables or machine-parseable JSON
  • Interactive REPL – Explore your index in a live session
  • gRPC server – Start a gRPC server with laurus serve

Getting Started

# Install
cargo install laurus-cli

# Generate a schema interactively
laurus create schema

# Create an index from the schema
laurus --index-dir ./my_index create index --schema schema.toml

# Add a document
laurus --index-dir ./my_index add doc --id doc1 --data '{"title":"Hello","body":"World"}'

# Commit changes
laurus --index-dir ./my_index commit

# Search
laurus --index-dir ./my_index search "body:world"

See the sub-sections for detailed documentation:

Installation

From crates.io

cargo install laurus-cli

This installs the laurus binary to ~/.cargo/bin/.

From source

git clone https://github.com/mosuka/laurus.git
cd laurus
cargo install --path laurus-cli

Verify

laurus --version

Hands-on Tutorial

This tutorial walks you through a complete workflow using the laurus CLI: creating a schema, building an index, adding documents, searching, updating, deleting, and using the interactive REPL.

Prerequisites

Step 1: Create a Schema

First, create a schema file that defines your index structure. You can generate one interactively:

laurus create schema

The interactive wizard guides you through defining fields, their types, and options. For this tutorial, create a schema file manually instead:

cat > schema.toml << 'EOF'
default_fields = ["title", "body"]

[fields.title.Text]
indexed = true
stored = true
term_vectors = false

[fields.body.Text]
indexed = true
stored = true
term_vectors = false

[fields.category.Text]
indexed = true
stored = true
term_vectors = false
EOF

This defines three text fields. The default_fields setting means queries without a field prefix will search title and body.

Step 2: Create an Index

Create an index using the schema:

laurus --index-dir ./tutorial_data create index --schema schema.toml

Verify the index was created:

laurus --index-dir ./tutorial_data get stats

The output shows the document count is 0.

Step 3: Add Documents

Add documents to the index. Each document needs an ID and a JSON object with field values:

laurus --index-dir ./tutorial_data add doc \
  --id doc001 \
  --data '{"title":"Introduction to Rust Programming","body":"Rust is a modern systems programming language that focuses on safety, speed, and concurrency.","category":"programming"}'
laurus --index-dir ./tutorial_data add doc \
  --id doc002 \
  --data '{"title":"Web Development with Rust","body":"Building web applications with Rust has become increasingly popular. Frameworks like Actix and Rocket make it easy to create fast and secure web services.","category":"web-development"}'
laurus --index-dir ./tutorial_data add doc \
  --id doc003 \
  --data '{"title":"Python for Data Science","body":"Python is the most popular language for data science and machine learning. Libraries like NumPy and Pandas provide powerful tools for data analysis.","category":"data-science"}'

Step 4: Commit Changes

Documents are not searchable until committed:

laurus --index-dir ./tutorial_data commit

Step 5: Search Documents

Search for documents containing “rust”:

laurus --index-dir ./tutorial_data search "rust"

This searches the default fields (title and body). Results show doc001 and doc002.

Search only in the title field:

laurus --index-dir ./tutorial_data search "title:python"

Only doc003 is returned.

laurus --index-dir ./tutorial_data search "category:programming"

Only doc001 is returned.

Boolean Queries

Combine conditions with + (must) and - (must not):

laurus --index-dir ./tutorial_data search "+body:rust -body:web"

Only doc001 is returned (contains “rust” but not “web”).

Search for an exact phrase:

laurus --index-dir ./tutorial_data search 'body:"data science"'

Only doc003 is returned.

Search with typo tolerance using ~:

laurus --index-dir ./tutorial_data search "body:programing~1"

Matches “programming” despite the typo.

JSON Output

Get results in JSON format for programmatic use:

laurus --index-dir ./tutorial_data --format json search "rust"

Step 6: Retrieve a Document

Fetch a specific document by ID:

laurus --index-dir ./tutorial_data get docs --id doc001

Step 7: Delete a Document

Delete a document and commit the change:

laurus --index-dir ./tutorial_data delete docs --id doc003
laurus --index-dir ./tutorial_data commit

Verify it was deleted:

laurus --index-dir ./tutorial_data search "python"

No results are returned.

Step 8: Use the REPL

The REPL provides an interactive session for exploring your index:

laurus --index-dir ./tutorial_data repl

Try these commands in the REPL:

> get stats
> search rust
> add doc doc004 {"title":"Go Programming","body":"Go is a statically typed language designed for simplicity and efficiency.","category":"programming"}
> commit
> search programming
> get docs doc004
> delete docs doc004
> commit
> quit

The REPL supports command history (Up/Down arrows) and line editing.

Step 9: Clean Up

Remove the tutorial data:

rm -rf ./tutorial_data schema.toml

Next Steps

Command Reference

Global Options

Every command accepts these options:

OptionEnvironment VariableDefaultDescription
--index-dir <PATH>LAURUS_INDEX_DIR./laurus_indexPath to the index data directory
--format <FORMAT>tableOutput format: table or json
# Example: use JSON output with a custom data directory
laurus --index-dir /var/data/my_index --format json search "title:rust"

create — Create a Resource

create index

Create a new index. If --schema is given, uses that TOML file; otherwise launches the interactive schema wizard.

laurus create index [--schema <FILE>]

Arguments:

FlagRequiredDescription
--schema <FILE>NoPath to a TOML file defining the index schema. When omitted, the command checks if a schema.toml already exists in the index directory and uses it; otherwise the interactive wizard is launched.

Schema file format:

The schema file follows the same structure as the Schema type in the Laurus library. See Schema Format Reference for full details. Example:

default_fields = ["title", "body"]

[fields.title.Text]
stored = true
indexed = true

[fields.body.Text]
stored = true
indexed = true

[fields.category.Text]
stored = true
indexed = true

Examples:

# From a schema file
laurus --index-dir ./my_index create index --schema schema.toml
# Index created at ./my_index.

# Interactive wizard (no --schema flag)
laurus --index-dir ./my_index create index
# === Laurus Schema Generator ===
# Field name: title
# ...
# Index created at ./my_index.

Note: If both schema.toml and store/ already exist, an error is returned. Delete the index directory to recreate. If only schema.toml exists (e.g. after an interrupted creation), running create index without --schema recovers the index by creating the missing storage from the existing schema.

create schema

Interactively generate a schema TOML file through a guided wizard.

laurus create schema [--output <FILE>]

Arguments:

FlagRequiredDefaultDescription
--output <FILE>Noschema.tomlOutput file path for the generated schema

The wizard guides you through:

  1. Field definition — Enter a field name, select the type, and configure type-specific options
  2. Repeat — Add as many fields as needed
  3. Default fields — Select which lexical fields to use as default search fields
  4. Preview — Review the generated TOML before saving
  5. Save — Write the schema file

Supported field types:

TypeCategoryOptions
TextLexicalindexed, stored, term_vectors
IntegerLexicalindexed, stored
FloatLexicalindexed, stored
BooleanLexicalindexed, stored
DateTimeLexicalindexed, stored
GeoLexicalindexed, stored
BytesLexicalstored
HnswVectordimension, distance, m, ef_construction
FlatVectordimension, distance
IvfVectordimension, distance, n_clusters, n_probe

Example:

# Generate schema.toml interactively
laurus create schema

# Specify output path
laurus create schema --output my_schema.toml

# Then create an index from the generated schema
laurus create index --schema schema.toml

get — Get a Resource

get stats

Display statistics about the index.

laurus get stats

Table output example:

Document count: 42

Vector fields:
╭──────────┬─────────┬───────────╮
│ Field    │ Vectors │ Dimension │
├──────────┼─────────┼───────────┤
│ text_vec │ 42      │ 384       │
╰──────────┴─────────┴───────────╯

JSON output example:

laurus --format json get stats
{
  "document_count": 42,
  "fields": {
    "text_vec": {
      "vector_count": 42,
      "dimension": 384
    }
  }
}

get schema

Display the current index schema as JSON.

laurus get schema

Example:

laurus get schema
# {
#   "fields": { ... },
#   "default_fields": ["title", "body"],
#   ...
# }

get docs

Retrieve all documents (including chunks) by external ID.

laurus get docs --id <ID>

Table output example:

╭──────┬─────────────────────────────────────────╮
│ ID   │ Fields                                  │
├──────┼─────────────────────────────────────────┤
│ doc1 │ body: This is a test, title: Hello World │
╰──────┴─────────────────────────────────────────╯

JSON output example:

laurus --format json get docs --id doc1
[
  {
    "id": "doc1",
    "document": {
      "title": "Hello World",
      "body": "This is a test document."
    }
  }
]

add — Add a Resource

add doc

Add a document to the index. Documents are not searchable until commit is called.

laurus add doc --id <ID> --data <JSON>

Arguments:

FlagRequiredDescription
--id <ID>YesExternal document ID (string)
--data <JSON>YesDocument fields as a JSON string

The JSON format is a flat object mapping field names to values:

{
  "title": "Introduction to Rust",
  "body": "Rust is a systems programming language.",
  "category": "programming"
}

Example:

laurus add doc --id doc1 --data '{"title":"Hello World","body":"This is a test document."}'
# Document 'doc1' added. Run 'commit' to persist changes.

Tip: Multiple documents can share the same external ID (chunking pattern). Use add doc for each chunk.


put — Put (Upsert) a Resource

put doc

Put (upsert) a document into the index. If a document with the same ID already exists, all its chunks are deleted before the new document is indexed. Documents are not searchable until commit is called.

laurus put doc --id <ID> --data <JSON>

Arguments:

FlagRequiredDescription
--id <ID>YesExternal document ID (string)
--data <JSON>YesDocument fields as a JSON string

Example:

laurus put doc --id doc1 --data '{"title":"Updated Title","body":"This replaces the existing document."}'
# Document 'doc1' put (upserted). Run 'commit' to persist changes.

Note: Unlike add doc, put doc replaces all existing chunks for the given ID. Use add doc when you want to append chunks, and put doc when you want to replace the entire document.


add field

Dynamically add a new field to an existing index.

laurus add field --index-dir ./data \
    --name category \
    --field-option '{"Text": {"indexed": true, "stored": true}}'

The --field-option argument accepts a JSON string using the same externally-tagged format as the schema file. The schema is automatically persisted after the field is added.


delete — Delete a Resource

delete docs

Delete all documents (including chunks) by external ID.

laurus delete docs --id <ID>

Example:

laurus delete docs --id doc1
# Documents 'doc1' deleted. Run 'commit' to persist changes.

delete field

Remove a field from the index schema.

laurus delete field --name <FIELD_NAME>

Example:

laurus delete field --name category
# Field 'category' deleted.

Existing indexed data for the field remains in storage but becomes inaccessible. Per-field analyzers and embedders are unregistered.


commit

Commit pending changes (additions and deletions) to the index. Until committed, changes are not visible to search.

laurus commit

Example:

laurus --index-dir ./my_index commit
# Changes committed successfully.

search

Execute a search query using the Query DSL.

laurus search <QUERY> [--limit <N>] [--offset <N>]

Arguments:

Argument / FlagRequiredDefaultDescription
<QUERY>YesQuery string in Laurus Query DSL
--limit <N>No10Maximum number of results
--offset <N>No0Number of results to skip

Query syntax examples:

# Term query
laurus search "body:rust"

# Phrase query
laurus search 'body:"machine learning"'

# Boolean query
laurus search "+body:programming -body:python"

# Fuzzy query (typo tolerance)
laurus search "body:programing~2"

# Wildcard query
laurus search "title:intro*"

# Range query
laurus search "price:[10 TO 50]"

Table output example:

╭──────┬────────┬─────────────────────────────────────────╮
│ ID   │ Score  │ Fields                                  │
├──────┼────────┼─────────────────────────────────────────┤
│ doc1 │ 0.8532 │ body: Rust is a systems..., title: Intr │
│ doc3 │ 0.4210 │ body: JavaScript powers..., title: Web  │
╰──────┴────────┴─────────────────────────────────────────╯

JSON output example:

laurus --format json search "body:rust" --limit 5
[
  {
    "id": "doc1",
    "score": 0.8532,
    "document": {
      "title": "Introduction to Rust",
      "body": "Rust is a systems programming language."
    }
  }
]

repl

Start an interactive REPL session. See REPL for details.

laurus repl

serve

Start the gRPC server (and optionally the HTTP Gateway).

laurus serve [OPTIONS]

For startup options, configuration, and usage examples, see the laurus-server documentation:

Schema Format Reference

The schema file defines the structure of your index — what fields exist, their types, and how they are indexed. Laurus uses TOML format for schema files.

Overview

A schema consists of two top-level elements:

# Fields to search by default when a query does not specify a field.
default_fields = ["title", "body"]

# Field definitions. Each field has a name and a typed configuration.
[fields.<field_name>.<FieldType>]
# ... type-specific options
  • default_fields — A list of field names used as default search targets by the Query DSL. Only lexical fields (Text, Integer, Float, etc.) can be default fields. This key is optional and defaults to an empty list.
  • fields — A map of field names to their typed configuration. Each field must specify exactly one field type.

Field Naming

  • Field names are arbitrary strings (e.g., title, body_vec, created_at).
  • The _id field is reserved by Laurus for internal document ID management — do not use it.
  • Field names must be unique within a schema.

Field Types

Fields fall into two categories: Lexical (for keyword/full-text search) and Vector (for similarity search). A single field cannot be both.

Lexical Fields

Text

Full-text searchable field. Text is processed by the analysis pipeline (tokenization, normalization, stemming, etc.).

[fields.title.Text]
indexed = true       # Whether to index this field for search
stored = true        # Whether to store the original value for retrieval
term_vectors = false # Whether to store term positions (for phrase queries, highlighting)
OptionTypeDefaultDescription
indexedbooltrueEnables searching this field
storedbooltrueStores the original value so it can be returned in results
term_vectorsbooltrueStores term positions for phrase queries, highlighting, and more-like-this

Integer

64-bit signed integer field. Supports range queries and exact match.

[fields.year.Integer]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables range and exact-match queries
storedbooltrueStores the original value

Float

64-bit floating point field. Supports range queries.

[fields.rating.Float]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables range queries
storedbooltrueStores the original value

Boolean

Boolean field (true / false).

[fields.published.Boolean]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables filtering by boolean value
storedbooltrueStores the original value

DateTime

UTC timestamp field. Supports range queries.

[fields.created_at.DateTime]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables range queries on date/time
storedbooltrueStores the original value

Geo

Geographic point field (latitude/longitude). Supports radius and bounding box queries.

[fields.location.Geo]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables geo queries (radius, bounding box)
storedbooltrueStores the original value

Bytes

Raw binary data field. Not indexed — stored only.

[fields.thumbnail.Bytes]
stored = true
OptionTypeDefaultDescription
storedbooltrueStores the binary data

Vector Fields

Vector fields are indexed for approximate nearest neighbor (ANN) search. They require a dimension (the length of each vector) and a distance metric.

Hnsw

Hierarchical Navigable Small World graph index. Best for most use cases — offers a good balance of speed and recall.

[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger128Vector dimensionality (must match your embedding model)
distancestring"Cosine"Distance metric (see Distance Metrics)
minteger16Max bi-directional connections per node. Higher = better recall, more memory
ef_constructioninteger200Search width during index construction. Higher = better quality, slower build
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobjectnoneOptional quantization method (see Quantization)

Tuning guidelines:

  • m: 12–48 is typical. Use higher values for higher-dimensional vectors.
  • ef_construction: 100–500. Higher values produce a better graph but increase build time.
  • dimension: Must exactly match the output dimension of your embedding model (e.g., 384 for all-MiniLM-L6-v2, 768 for BERT-base, 1536 for text-embedding-3-small).

Flat

Brute-force linear scan index. Provides exact results with no approximation. Best for small datasets (< 10,000 vectors).

[fields.embedding.Flat]
dimension = 384
distance = "Cosine"
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger128Vector dimensionality
distancestring"Cosine"Distance metric (see Distance Metrics)
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobjectnoneOptional quantization method (see Quantization)

Ivf

Inverted File Index. Clusters vectors and searches only a subset of clusters. Suitable for very large datasets.

[fields.embedding.Ivf]
dimension = 384
distance = "Cosine"
n_clusters = 100
n_probe = 1
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger(required)Vector dimensionality
distancestring"Cosine"Distance metric (see Distance Metrics)
n_clustersinteger100Number of clusters. More clusters = finer partitioning
n_probeinteger1Number of clusters to search at query time. Higher = better recall, slower
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobjectnoneOptional quantization method (see Quantization)

Note: Unlike Hnsw and Flat, the dimension field in Ivf is required and has no default value.

Tuning guidelines:

  • n_clusters: A common heuristic is sqrt(N) where N is the total number of vectors.
  • n_probe: Start with 1 and increase until recall is acceptable. Typical range is 1–20.

Distance Metrics

The distance option for vector fields accepts the following values:

ValueDescriptionUse When
"Cosine"Cosine distance (1 - cosine similarity). Default.Normalized text/image embeddings
"Euclidean"L2 (Euclidean) distanceSpatial data, non-normalized vectors
"Manhattan"L1 (Manhattan) distanceSparse feature vectors
"DotProduct"Dot product (higher = more similar)Pre-normalized vectors where magnitude matters
"Angular"Angular distanceSimilar to cosine, but based on angle

For most embedding models (BERT, Sentence Transformers, OpenAI, etc.), "Cosine" is the correct choice.

Quantization

Vector fields optionally support quantization to reduce memory usage at the cost of some accuracy. Specify the quantizer option as a TOML table.

None (default)

No quantization — full precision 32-bit floats.

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
# quantizer is omitted (no quantization)

Scalar 8-bit

Compresses each float32 component to uint8 (~4x memory reduction).

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
quantizer = "Scalar8Bit"

Product Quantization

Splits the vector into subvectors and quantizes each independently.

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"

[fields.embedding.Hnsw.quantizer.ProductQuantization]
subvector_count = 48
OptionTypeDescription
subvector_countintegerNumber of subvectors. Must evenly divide dimension.

Complete Examples

Full-text search only

A simple blog post index with lexical search:

default_fields = ["title", "body"]

[fields.title.Text]
indexed = true
stored = true
term_vectors = false

[fields.body.Text]
indexed = true
stored = true
term_vectors = false

[fields.category.Text]
indexed = true
stored = true
term_vectors = false

[fields.published_at.DateTime]
indexed = true
stored = true

Vector search only

A vector-only index for semantic similarity:

[fields.embedding.Hnsw]
dimension = 768
distance = "Cosine"
m = 16
ef_construction = 200

Hybrid search (lexical + vector)

Combine lexical and vector search for best-of-both-worlds retrieval:

default_fields = ["title", "body"]

[fields.title.Text]
indexed = true
stored = true
term_vectors = false

[fields.body.Text]
indexed = true
stored = true
term_vectors = true

[fields.category.Text]
indexed = true
stored = true
term_vectors = false

[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200

Tip: A single field cannot be both lexical and vector. Use separate fields (e.g., body for text, body_vec for embedding) and map them both to the same source content.

E-commerce product index

A more complex schema with mixed field types:

default_fields = ["name", "description"]

[fields.name.Text]
indexed = true
stored = true
term_vectors = false

[fields.description.Text]
indexed = true
stored = true
term_vectors = true

[fields.price.Float]
indexed = true
stored = true

[fields.in_stock.Boolean]
indexed = true
stored = true

[fields.created_at.DateTime]
indexed = true
stored = true

[fields.location.Geo]
indexed = true
stored = true

[fields.description_vec.Hnsw]
dimension = 384
distance = "Cosine"

Generating a Schema

You can generate a schema TOML file interactively using the CLI:

laurus create schema
laurus create schema --output my_schema.toml

See create schema for details.

Using a Schema

Once you have a schema file, create an index from it:

laurus create index --schema schema.toml

Or load it programmatically in Rust:

#![allow(unused)]
fn main() {
use laurus::Schema;

let toml_str = std::fs::read_to_string("schema.toml")?;
let schema: Schema = toml::from_str(&toml_str)?;
}

REPL (Interactive Mode)

The REPL provides an interactive session for exploring your index without typing the full laurus command each time.

Starting the REPL

laurus --index-dir ./my_index repl

If an index already exists at the specified directory, it is opened automatically:

Laurus REPL (type 'help' for commands, 'quit' to exit)
laurus>

If no index exists yet, the REPL starts without a loaded index and guides you to create one:

Laurus REPL — no index found at ./my_index.
Use 'create index <schema_path>' to create one, or 'help' for commands.
laurus>

Available Commands

Commands follow the same <operation> <resource> ordering as the CLI.

CommandDescription
create index [schema_path]Create a new index (interactive wizard if no path given)
create schema <output_path>Interactive schema generation wizard
search <query>Search the index
add field <name> <json>Add a field to the schema
add doc <id> <json>Add a document (append, allows multiple chunks per ID)
put doc <id> <json>Put (upsert) a document (replaces existing with same ID)
get statsShow index statistics
get schemaShow the current schema
get docs <id>Get all documents (including chunks) by ID
delete field <name>Remove a field from the schema
delete docs <id>Delete all documents (including chunks) by ID
commitCommit pending changes
helpShow available commands
quit / exitExit the REPL

Note: Commands other than create, help, and quit require a loaded index. If no index is loaded, the REPL displays a message asking you to run create index first.

Usage Examples

Creating an Index

laurus> create index ./schema.toml
Index created at ./my_index.
laurus> add doc doc1 {"title":"Hello","body":"World"}
Document 'doc1' added.

Searching

laurus> search body:rust
╭──────┬────────┬────────────────────────────────────╮
│ ID   │ Score  │ Fields                             │
├──────┼────────┼────────────────────────────────────┤
│ doc1 │ 0.8532 │ body: Rust is a systems..., title… │
╰──────┴────────┴────────────────────────────────────╯

Managing Fields

laurus> add field category {"Text": {"indexed": true, "stored": true}}
Field 'category' added.
laurus> delete field category
Field 'category' deleted.

Adding and Committing Documents

laurus> add doc doc4 {"title":"New Document","body":"Some content here."}
Document 'doc4' added.
laurus> commit
Changes committed.

Retrieving Information

laurus> get stats
Document count: 3

laurus> get schema
{
  "fields": { ... },
  "default_fields": ["title", "body"]
}

laurus> get docs doc4
╭──────┬───────────────────────────────────────────────╮
│ ID   │ Fields                                        │
├──────┼───────────────────────────────────────────────┤
│ doc4 │ body: Some content here., title: New Document │
╰──────┴───────────────────────────────────────────────╯

Deleting Documents

laurus> delete docs doc4
Documents 'doc4' deleted.
laurus> commit
Changes committed.

Features

  • Line editing — Arrow keys, Home/End, and standard readline shortcuts
  • History — Use Up/Down arrows to recall previous commands
  • Ctrl+C / Ctrl+D — Exit the REPL gracefully

Server Overview

The laurus-server crate provides a gRPC server with an optional HTTP/JSON gateway for the Laurus search engine. It keeps the engine resident in memory, eliminating per-command startup overhead.

Features

  • Persistent engine – The index stays open across requests; no WAL replay on every call
  • Full gRPC API – Index management, document CRUD, commit, and search (unary + streaming)
  • HTTP Gateway – Optional HTTP/JSON gateway alongside gRPC for REST-style access
  • Health checking – Standard health check endpoint for load balancers and orchestrators
  • Graceful shutdown – Pending changes are committed automatically on Ctrl+C / SIGINT
  • TOML configuration – Optional config file with CLI and environment variable overrides

Architecture

graph LR
    subgraph "laurus-server"
        GW["HTTP Gateway\n(axum)"]
        GRPC["gRPC Server\n(tonic)"]
        ENG["Engine\n(Arc&lt;RwLock&gt;)"]
    end

    Client1["HTTP Client"] --> GW
    Client2["gRPC Client"] --> GRPC
    GW --> GRPC
    GRPC --> ENG

The gRPC server always runs. The HTTP Gateway is optional and proxies HTTP/JSON requests to the gRPC server internally.

Quick Start

# Start with default settings (gRPC on port 50051)
laurus serve

# Start with HTTP Gateway
laurus serve --http-port 8080

# Start with a configuration file
laurus serve --config config.toml

Sections

Getting Started with the gRPC Server

Starting the Server

The gRPC server is started via the serve subcommand of the laurus CLI:

laurus serve [OPTIONS]

Options

OptionShortEnv VariableDefaultDescription
--config <PATH>-cLAURUS_CONFIGPath to a TOML configuration file
--host <HOST>-HLAURUS_HOST0.0.0.0Listen address
--port <PORT>-pLAURUS_PORT50051Listen port
--http-port <PORT>LAURUS_HTTP_PORTHTTP Gateway port (enables HTTP gateway when set)

Log verbosity is controlled by the standard RUST_LOG environment variable (default: info). See env_logger syntax for filter directives such as RUST_LOG=laurus=debug,tonic=warn.

The global --index-dir option (env: LAURUS_INDEX_DIR) specifies the index data directory:

# Using CLI arguments
laurus --index-dir ./my_index serve --port 8080

# Using environment variables
export LAURUS_INDEX_DIR=./my_index
export LAURUS_PORT=8080
export RUST_LOG=debug
laurus serve

Startup Behavior

On startup, the server attempts to open an existing index at the configured data directory. If no index exists, the server starts without one – you can create an index later via the CreateIndex RPC.

Configuration

You can use a TOML configuration file instead of (or in addition to) command-line options. See Configuration for the full reference.

laurus serve --config config.toml

HTTP Gateway

When --http-port is set, an HTTP/JSON gateway starts alongside the gRPC server. See HTTP Gateway for the full endpoint reference and examples.

laurus serve --http-port 8080

Graceful Shutdown

When the server receives a shutdown signal (Ctrl+C / SIGINT), it automatically:

  1. Stops accepting new connections
  2. Commits any pending changes to the index
  3. Exits cleanly

Connecting via gRPC

Any gRPC client can connect to the server. For quick testing, grpcurl is useful:

# Health check
grpcurl -plaintext localhost:50051 laurus.v1.HealthService/Check

# Create an index
grpcurl -plaintext -d '{
  "schema": {
    "fields": {
      "title": {"text": {"indexed": true, "stored": true, "term_vectors": true}},
      "body": {"text": {"indexed": true, "stored": true, "term_vectors": true}}
    },
    "default_fields": ["title", "body"]
  }
}' localhost:50051 laurus.v1.IndexService/CreateIndex

# Add a document
grpcurl -plaintext -d '{
  "id": "doc1",
  "document": {
    "fields": {
      "title": {"text_value": "Hello World"},
      "body": {"text_value": "This is a test document."}
    }
  }
}' localhost:50051 laurus.v1.DocumentService/AddDocument

# Commit
grpcurl -plaintext localhost:50051 laurus.v1.DocumentService/Commit

# Search
grpcurl -plaintext -d '{"query": "body:test", "limit": 10}' \
  localhost:50051 laurus.v1.SearchService/Search

See gRPC API Reference for the full API documentation, or try the Hands-on Tutorial for a step-by-step walkthrough using the HTTP Gateway.

Hands-on Tutorial

This tutorial walks you through a complete workflow with laurus-server: starting the server, creating an index, adding documents, searching, updating, and deleting. All examples use curl via the HTTP Gateway.

Prerequisites

  • laurus CLI installed (see Installation)
  • curl available on your system

Step 1: Start the Server

Start laurus-server with the HTTP Gateway enabled:

laurus --index-dir /tmp/laurus/tutorial serve --port 50051 --http-port 8080

You should see log output indicating the gRPC server (port 50051) and the HTTP Gateway (port 8080) have started.

Verify the server is running:

curl http://localhost:8080/v1/health

Expected response:

{"status":"SERVING_STATUS_SERVING"}

Step 2: Create an Index

Create an index with a schema that defines text fields for lexical search and a vector field for vector search. This example demonstrates custom analyzers, embedder definitions, and per-field configuration:

curl -X POST http://localhost:8080/v1/index \
  -H 'Content-Type: application/json' \
  -d '{
    "schema": {
      "analyzers": {
        "body_analyzer": {
          "char_filters": [{"type": "unicode_normalization", "form": "nfkc"}],
          "tokenizer": {"type": "regex"},
          "token_filters": [
            {"type": "lowercase"},
            {"type": "stop", "words": ["the", "a", "an", "is", "it"]}
          ]
        }
      },
      "embedders": {
        "my_embedder": {"type": "precomputed"}
      },
      "fields": {
        "title": {"text": {"indexed": true, "stored": true, "term_vectors": false, "analyzer": "standard"}},
        "body": {"text": {"indexed": true, "stored": true, "term_vectors": false, "analyzer": "body_analyzer"}},
        "category": {"text": {"indexed": true, "stored": true, "term_vectors": false, "analyzer": "keyword"}},
        "embedding": {"hnsw": {"dimension": 4, "distance": "DISTANCE_METRIC_COSINE", "m": 16, "ef_construction": 200, "embedder": "my_embedder"}}
      },
      "default_fields": ["title", "body"]
    }
  }'

This creates an index with three text fields and one vector field:

  • title — uses the built-in standard analyzer (tokenizes and lowercases).
  • body — uses the custom body_analyzer defined in the analyzers section (NFKC normalization + regex tokenizer + lowercase + custom stop words).
  • category — uses the keyword analyzer (treats the entire value as a single token for exact matching).
  • embedding — HNSW vector index with 4 dimensions, cosine distance, using the my_embedder embedder defined in embedders. In this tutorial we use precomputed (vectors supplied externally). In production, use a dimension matching your embedding model (e.g. 384 or 768).

The default_fields setting means that queries without a field prefix will search both title and body.

Built-in analyzers

standard, keyword, english, japanese, simple, noop. If omitted, the engine default (standard) is used.

Custom analyzer components

You can compose custom analyzers from the following components:

  • Tokenizers: whitespace, unicode_word, regex, ngram, lindera, whole
  • Char filters: unicode_normalization, pattern_replace, mapping, japanese_iteration_mark
  • Token filters: lowercase, stop, stem, boost, limit, strip, remove_empty, flatten_graph

Embedders

The embedders section defines how vectors are generated. Each vector field can reference an embedder by name via the embedder option. Available types:

  • precomputed — vectors are supplied externally (no automatic embedding).
  • candle_bert — local BERT model via Candle. Params: model (HuggingFace model ID). Requires embeddings-candle feature.
  • candle_clip — local CLIP multimodal model. Params: model (HuggingFace model ID). Requires embeddings-multimodal feature.
  • openai — OpenAI API. Params: model (e.g. "text-embedding-3-small"). Requires embeddings-openai feature and OPENAI_API_KEY env var.

Example with a BERT embedder (requires the embeddings-candle feature):

{
  "embedders": {
    "bert": {"type": "candle_bert", "model": "sentence-transformers/all-MiniLM-L6-v2"}
  },
  "fields": {
    "embedding": {"hnsw": {"dimension": 384, "embedder": "bert"}}
  }
}

Verify the index was created:

curl http://localhost:8080/v1/index

Expected response:

{"document_count":0,"vector_fields":{}}

Step 3: Add Documents

Add a few documents to the index. Use PUT to upsert documents by ID. Each document includes text fields and an embedding vector (in production, these vectors would come from an embedding model):

curl -X PUT http://localhost:8080/v1/documents/doc001 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Introduction to Rust Programming",
        "body": "Rust is a modern systems programming language that focuses on safety, speed, and concurrency.",
        "category": "programming",
        "embedding": [0.9, 0.1, 0.2, 0.0]
      }
    }
  }'
curl -X PUT http://localhost:8080/v1/documents/doc002 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Web Development with Rust",
        "body": "Building web applications with Rust has become increasingly popular. Frameworks like Actix and Rocket make it easy to create fast and secure web services.",
        "category": "web-development",
        "embedding": [0.7, 0.3, 0.5, 0.1]
      }
    }
  }'
curl -X PUT http://localhost:8080/v1/documents/doc003 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Python for Data Science",
        "body": "Python is the most popular language for data science and machine learning. Libraries like NumPy and Pandas provide powerful tools for data analysis.",
        "category": "data-science",
        "embedding": [0.1, 0.8, 0.1, 0.9]
      }
    }
  }'

Vector fields are specified as JSON arrays of numbers. The array length must match the dimension configured in the schema (4 in this tutorial).

Step 4: Commit Changes

Documents are not searchable until committed. Commit the pending changes:

curl -X POST http://localhost:8080/v1/commit

Step 5: Search Documents

Basic Search

Search for documents containing “rust”:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "rust", "limit": 10}'

This searches the default fields (title and body). Expected result: doc001 and doc002 are returned.

Field-Specific Search

Search only in the title field:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "title:python", "limit": 10}'

Expected result: only doc003 is returned.

Search by Category

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "category:programming", "limit": 10}'

Expected result: only doc001 is returned.

Boolean Queries

Combine conditions with AND, OR, and NOT:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "rust AND web", "limit": 10}'

Expected result: only doc002 is returned (contains both “rust” and “web”).

Field Boosting

Boost the title field to prioritize title matches:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "rust",
    "limit": 10,
    "field_boosts": {"title": 2.0}
  }'

Vector Search

Search by vector similarity. Provide a query vector in query_vectors and specify which field to search:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query_vectors": [
      {
        "vector": [0.85, 0.15, 0.2, 0.05],
        "fields": ["embedding"]
      }
    ],
    "limit": 10
  }'

This finds documents whose embedding vectors are closest to the query vector. Expected result: doc001 ranks highest (most similar vector).

Hybrid Search

Combine lexical search and vector search for best results. The fusion parameter controls how scores from both searches are merged:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "rust",
    "query_vectors": [
      {
        "vector": [0.85, 0.15, 0.2, 0.05],
        "fields": ["embedding"]
      }
    ],
    "fusion": {"rrf": {"k": 60.0}},
    "limit": 10
  }'

This uses Reciprocal Rank Fusion (RRF) to merge lexical and vector search results. You can also use weighted sum fusion:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "programming",
    "query_vectors": [
      {
        "vector": [0.85, 0.15, 0.2, 0.05],
        "fields": ["embedding"]
      }
    ],
    "fusion": {"weighted_sum": {"lexical_weight": 0.3, "vector_weight": 0.7}},
    "limit": 10
  }'

Step 6: Retrieve a Document

Fetch a specific document by its ID:

curl http://localhost:8080/v1/documents/doc001

Expected response (includes the stored vector field):

{
  "documents": [
    {
      "fields": {
        "title": "Introduction to Rust Programming",
        "body": "Rust is a modern systems programming language that focuses on safety, speed, and concurrency.",
        "category": "programming",
        "embedding": [0.9, 0.1, 0.2, 0.0]
      }
    }
  ]
}

Step 7: Update a Document

Update a document by PUT-ing with the same ID. This replaces the entire document:

curl -X PUT http://localhost:8080/v1/documents/doc001 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Introduction to Rust Programming",
        "body": "Rust is a modern systems programming language that focuses on safety, speed, and concurrency. It provides memory safety without garbage collection.",
        "category": "programming",
        "embedding": [0.9, 0.1, 0.2, 0.0]
      }
    }
  }'

Commit and verify:

curl -X POST http://localhost:8080/v1/commit
curl http://localhost:8080/v1/documents/doc001

The updated body text is now stored.

Step 8: Delete a Document

Delete a document by its ID:

curl -X DELETE http://localhost:8080/v1/documents/doc003

Commit and verify:

curl -X POST http://localhost:8080/v1/commit

Confirm the document was deleted:

curl http://localhost:8080/v1/documents/doc003

Expected response:

{"documents":[]}

Search results will no longer include the deleted document:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "python", "limit": 10}'

Expected result: no results returned.

Step 9: Check Index Statistics

View the current index statistics:

curl http://localhost:8080/v1/index

The document_count should reflect the remaining documents after the deletion.

Step 10: Clean Up

Stop the server with Ctrl+C. The server performs a graceful shutdown, committing any pending changes before exiting.

To remove the tutorial data:

rm -rf /tmp/laurus/tutorial

Going Further: Using a Real Embedding Model

The tutorial above uses precomputed vectors for simplicity. In production, you typically use an embedding model to automatically convert text into vectors. Here is how to set up a BERT-based embedder.

Prerequisites

Build laurus with the embeddings-candle feature:

cargo build --release --features embeddings-candle

Schema with BERT Embedder

Create an index:

curl -X POST http://localhost:8080/v1/index \
  -H 'Content-Type: application/json' \
  -d '{
    "schema": {
      "embedders": {
        "bert": {
          "type": "candle_bert",
          "model": "sentence-transformers/all-MiniLM-L6-v2"
        }
      },
      "fields": {
        "title": {"text": {"indexed": true, "stored": true, "analyzer": "standard"}},
        "body": {"text": {"indexed": true, "stored": true, "analyzer": "standard"}},
        "embedding": {"hnsw": {"dimension": 384, "distance": "DISTANCE_METRIC_COSINE", "m": 16, "ef_construction": 200, "embedder": "bert"}}
      },
      "default_fields": ["title", "body"]
    }
  }'

The model is automatically downloaded from HuggingFace Hub on first use. The dimension (384) must match the model’s output dimension.

Add documents. Pass text to the embedding field — the embedder automatically converts it to a vector:

curl -X PUT http://localhost:8080/v1/documents/doc001 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Introduction to Rust Programming",
        "body": "Rust is a modern systems programming language.",
        "embedding": "Rust is a modern systems programming language."
      }
    }
  }'
curl -X PUT http://localhost:8080/v1/documents/doc002 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Web Development with Rust",
        "body": "Building web applications with Rust using Actix and Rocket.",
        "embedding": "Building web applications with Rust using Actix and Rocket."
      }
    }
  }'

Commit:

curl -X POST http://localhost:8080/v1/commit

Search with both lexical and semantic queries. The embedder also handles text-to-vector conversion at search time:

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "systems programming",
    "query_vectors": [
      {
        "vector": "systems programming language",
        "fields": ["embedding"]
      }
    ],
    "fusion": {"rrf": {"k": 60.0}},
    "limit": 10
  }'

With the precomputed embedder you must pass raw vectors, but text-capable embedders like candle_bert accept text directly for both indexing and searching.

Using OpenAI Embeddings

For OpenAI’s embedding API, set the OPENAI_API_KEY environment variable and build with the embeddings-openai feature:

cargo build --release --features embeddings-openai
export OPENAI_API_KEY="sk-..."

Create an index:

curl -X POST http://localhost:8080/v1/index \
  -H 'Content-Type: application/json' \
  -d '{
    "schema": {
      "embedders": {
        "openai": {
          "type": "openai",
          "model": "text-embedding-3-small"
        }
      },
      "fields": {
        "title": {"text": {"indexed": true, "stored": true}},
        "embedding": {"hnsw": {"dimension": 1536, "distance": "DISTANCE_METRIC_COSINE", "embedder": "openai"}}
      },
      "default_fields": ["title"]
    }
  }'

The text-embedding-3-small model outputs 1536-dimensional vectors.

Available Embedding Models

TypeFeature FlagExample ModelDimension
candle_bertembeddings-candlesentence-transformers/all-MiniLM-L6-v2384
candle_clipembeddings-multimodalopenai/clip-vit-base-patch32512
openaiembeddings-openaitext-embedding-3-small1536

Next Steps

Configuration

The laurus-server can be configured through CLI arguments, environment variables, and a TOML configuration file.

Configuration Priority

Server and index settings are resolved in the following order (highest priority first):

CLI arguments > Environment variables > Config file > Defaults

Log verbosity is controlled exclusively by the RUST_LOG environment variable (default: info).

For example:

# CLI argument wins over environment variable and config file
LAURUS_PORT=4567 laurus serve --config config.toml --port 1234
# -> Listens on port 1234

# Environment variable wins over config file
LAURUS_PORT=4567 laurus serve --config config.toml
# -> Listens on port 4567

# Config file value is used when no CLI argument or env var is set
laurus serve --config config.toml
# -> Uses port from config.toml (or default 50051 if not set)

TOML Configuration File

Format

[server]
host = "0.0.0.0"
port = 50051
http_port = 8080  # Optional: enables HTTP Gateway

[index]
data_dir = "./laurus_index"

Log verbosity is controlled by the RUST_LOG environment variable (default: info), not through the config file.

Field Reference

[server] Section

FieldTypeDefaultDescription
hostString"0.0.0.0"Listen address for the gRPC server
portInteger50051Listen port for the gRPC server
http_portIntegerHTTP Gateway port. When set, the HTTP/JSON gateway starts alongside gRPC.

[index] Section

FieldTypeDefaultDescription
data_dirString"./laurus_index"Path to the index data directory

Environment Variables

VariableMaps ToDescription
LAURUS_HOSTserver.hostListen address
LAURUS_PORTserver.portgRPC listen port
LAURUS_HTTP_PORTserver.http_portHTTP Gateway port
LAURUS_INDEX_DIRindex.data_dirIndex data directory
RUST_LOGLog filter directive (e.g. info, debug, laurus=debug,tonic=warn)
LAURUS_CONFIGPath to TOML config file

CLI Arguments

OptionShortDefaultDescription
--config <PATH>-cPath to TOML configuration file
--host <HOST>-H0.0.0.0Listen address
--port <PORT>-p50051gRPC listen port
--http-port <PORT>HTTP Gateway port
--index-dir <PATH>./laurus_indexIndex data directory (global option)

Common Configurations

Development (gRPC only)

[server]
host = "127.0.0.1"
port = 50051

[index]
data_dir = "./dev_data"
RUST_LOG=debug laurus serve --config config.toml

Production (gRPC + HTTP Gateway)

[server]
host = "0.0.0.0"
port = 50051
http_port = 8080

[index]
data_dir = "/var/lib/laurus/data"

Minimal (environment variables only)

export LAURUS_INDEX_DIR=/var/lib/laurus/data
export LAURUS_PORT=50051
export LAURUS_HTTP_PORT=8080
export RUST_LOG=info
laurus serve

gRPC API Reference

All services are defined under the laurus.v1 protobuf package.

Services Overview

ServiceRPCsDescription
HealthServiceCheckHealth checking
IndexServiceCreateIndex, GetIndex, GetSchema, AddField, DeleteFieldIndex lifecycle and schema
DocumentServicePutDocument, AddDocument, GetDocuments, DeleteDocuments, CommitDocument CRUD and commit
SearchServiceSearch, SearchStreamUnary and streaming search

HealthService

Check

Returns the current serving status of the server.

rpc Check(HealthCheckRequest) returns (HealthCheckResponse);

Response fields:

FieldTypeDescription
statusServingStatusSERVING_STATUS_SERVING when the server is ready

IndexService

CreateIndex

Create a new index with the given schema. Fails with ALREADY_EXISTS if an index is already open.

rpc CreateIndex(CreateIndexRequest) returns (CreateIndexResponse);

Request fields:

FieldTypeRequiredDescription
schemaSchemaYesIndex schema definition

Schema structure:

message Schema {
  map<string, FieldOption> fields = 1;
  repeated string default_fields = 2;
  map<string, AnalyzerDefinition> analyzers = 3;
  map<string, EmbedderConfig> embedders = 4;
}
  • fields — Field definitions keyed by field name.
  • default_fields — Field names used as default search targets when a query does not specify a field.
  • analyzers — Custom analyzer pipelines keyed by name. Referenced by TextOption.analyzer.
  • embedders — Embedder configurations keyed by name. Referenced by vector field options (HnswOption.embedder, etc.).

AnalyzerDefinition:

message AnalyzerDefinition {
  repeated ComponentConfig char_filters = 1;
  ComponentConfig tokenizer = 2;
  repeated ComponentConfig token_filters = 3;
}

ComponentConfig (used for char filters, tokenizer, and token filters):

FieldTypeDescription
typestringComponent type name (e.g. "whitespace", "lowercase", "unicode_normalization")
paramsmap<string, string>Type-specific parameters as string key-value pairs

EmbedderConfig:

FieldTypeDescription
typestringEmbedder type name (e.g. "precomputed", "candle_bert", "openai")
paramsmap<string, string>Type-specific parameters (e.g. "model""sentence-transformers/all-MiniLM-L6-v2")

Each FieldOption is a oneof with one of the following field types:

Lexical FieldsVector Fields
TextOption (indexed, stored, term_vectors, analyzer)HnswOption (dimension, distance, m, ef_construction, base_weight, quantizer, embedder)
IntegerOption (indexed, stored)FlatOption (dimension, distance, base_weight, quantizer, embedder)
FloatOption (indexed, stored)IvfOption (dimension, distance, n_clusters, n_probe, base_weight, quantizer, embedder)
BooleanOption (indexed, stored)
DateTimeOption (indexed, stored)
GeoOption (indexed, stored)
BytesOption (stored)

The embedder field in vector options specifies the name of an embedder defined in Schema.embedders. When set, the server automatically generates vectors from document text fields at index time. Leave empty to supply pre-computed vectors directly.

Distance metrics: COSINE, EUCLIDEAN, MANHATTAN, DOT_PRODUCT, ANGULAR

Quantization methods: NONE, SCALAR_8BIT, PRODUCT_QUANTIZATION

QuantizationConfig structure:

FieldTypeDescription
methodQuantizationMethodQuantization method (QUANTIZATION_METHOD_NONE, QUANTIZATION_METHOD_SCALAR_8BIT, or QUANTIZATION_METHOD_PRODUCT_QUANTIZATION)
subvector_countuint32Number of subvectors (only used when method is PRODUCT_QUANTIZATION; must evenly divide dimension)

Example:

{
  "schema": {
    "fields": {
      "title": {"text": {"indexed": true, "stored": true, "term_vectors": true}},
      "embedding": {"hnsw": {"dimension": 384, "distance": "DISTANCE_METRIC_COSINE", "m": 16, "ef_construction": 200}}
    },
    "default_fields": ["title"]
  }
}

GetIndex

Get index statistics.

rpc GetIndex(GetIndexRequest) returns (GetIndexResponse);

Response fields:

FieldTypeDescription
document_countuint64Total number of documents in the index
vector_fieldsmap<string, VectorFieldStats>Per-field vector statistics

Each VectorFieldStats contains vector_count and dimension.

AddField

Add a new field to the running index at runtime.

FieldTypeDescription
namestringThe field name
field_optionFieldOptionThe field configuration

Response: Returns the updated Schema.

DeleteField

Remove a field from the running index schema.

rpc DeleteField(DeleteFieldRequest) returns (DeleteFieldResponse);

Request fields:

FieldTypeDescription
namestringThe field name to remove

Response fields:

FieldTypeDescription
schemaSchemaThe updated schema after removal

Existing indexed data for the field remains in storage but becomes inaccessible. Per-field analyzers and embedders are unregistered.

HTTP gateway: DELETE /v1/schema/fields/{name}

GetSchema

Retrieve the current index schema.

rpc GetSchema(GetSchemaRequest) returns (GetSchemaResponse);

Response fields:

FieldTypeDescription
schemaSchemaThe index schema

DocumentService

PutDocument

Insert or replace a document by ID. If a document with the same ID already exists, it is replaced.

rpc PutDocument(PutDocumentRequest) returns (PutDocumentResponse);

Request fields:

FieldTypeRequiredDescription
idstringYesExternal document ID
documentDocumentYesDocument content

Document structure:

message Document {
  map<string, Value> fields = 1;
}

Each Value is a oneof with these types:

TypeProto FieldDescription
Nullnull_valueNull value
Booleanbool_valueBoolean value
Integerint64_value64-bit integer
Floatfloat64_value64-bit floating point
Texttext_valueUTF-8 string
Bytesbytes_valueRaw bytes
Vectorvector_valueVectorValue (list of floats)
DateTimedatetime_valueUnix microseconds (UTC)
Geogeo_valueGeoPoint (latitude, longitude)

AddDocument

Add a document. Unlike PutDocument, this does not replace existing documents with the same ID — multiple documents can share an ID (chunking pattern).

rpc AddDocument(AddDocumentRequest) returns (AddDocumentResponse);

Request fields are the same as PutDocument.

GetDocuments

Retrieve all documents matching the given external ID.

rpc GetDocuments(GetDocumentsRequest) returns (GetDocumentsResponse);

Request fields:

FieldTypeRequiredDescription
idstringYesExternal document ID

Response fields:

FieldTypeDescription
documentsrepeated DocumentMatching documents

DeleteDocuments

Delete all documents matching the given external ID.

rpc DeleteDocuments(DeleteDocumentsRequest) returns (DeleteDocumentsResponse);

Commit

Commit pending changes (additions and deletions) to the index. Changes are not visible to search until committed.

rpc Commit(CommitRequest) returns (CommitResponse);

SearchService

Search

Execute a search query and return results as a single response.

rpc Search(SearchRequest) returns (SearchResponse);

Response fields:

FieldTypeDescription
resultsrepeated SearchResultSearch results ordered by relevance
total_hitsuint64Total number of matching documents (before limit/offset)

SearchStream

Execute a search query and stream results back one at a time.

rpc SearchStream(SearchRequest) returns (stream SearchResult);

SearchRequest Fields

FieldTypeRequiredDescription
querystringNoLexical search query in Query DSL
query_vectorsrepeated QueryVectorNoVector search queries
limituint32NoMaximum number of results (default: engine default)
offsetuint32NoNumber of results to skip
fusionFusionAlgorithmNoFusion algorithm for hybrid search
lexical_paramsLexicalParamsNoLexical search parameters
vector_paramsVectorParamsNoVector search parameters
field_boostsmap<string, float>NoPer-field score boosting

At least one of query or query_vectors must be provided.

QueryVector

FieldTypeDescription
vectorrepeated floatQuery vector
weightfloatWeight for this vector (default: 1.0)
fieldsrepeated stringTarget vector fields (empty = all)

FusionAlgorithm

A oneof with two options:

  • RRF (Reciprocal Rank Fusion): k parameter (default: 60)
  • WeightedSum: lexical_weight and vector_weight

LexicalParams

FieldTypeDescription
min_scorefloatMinimum score threshold
timeout_msuint64Search timeout in milliseconds
parallelboolEnable parallel search
sort_bySortSpecSort by a field instead of score

SortSpec

FieldTypeDescription
fieldstringField name to sort by. Empty string means sort by relevance score
orderSortOrderSORT_ORDER_ASC (ascending) or SORT_ORDER_DESC (descending)

VectorParams

FieldTypeDescription
fieldsrepeated stringTarget vector fields
score_modeVectorScoreModeWEIGHTED_SUM, MAX_SIM, or LATE_INTERACTION
overfetchfloatOverfetch factor (default: 2.0)
min_scorefloatMinimum score threshold

SearchResult

FieldTypeDescription
idstringExternal document ID
scorefloatRelevance score
documentDocumentDocument content

Example

{
  "query": "body:rust",
  "query_vectors": [
    {"vector": [0.1, 0.2, 0.3], "weight": 1.0}
  ],
  "limit": 10,
  "fusion": {
    "rrf": {"k": 60}
  },
  "field_boosts": {
    "title": 2.0
  }
}

Error Handling

gRPC errors are returned as standard Status codes:

Laurus ErrorgRPC StatusWhen
Schema / Query / Field / JSONINVALID_ARGUMENTMalformed request or schema
No index openFAILED_PRECONDITIONRPC called before CreateIndex
Index already existsALREADY_EXISTSCreateIndex called twice
Not implementedUNIMPLEMENTEDFeature not yet supported
Internal errorsINTERNALI/O, storage, or unexpected errors

HTTP Gateway

The HTTP Gateway provides a RESTful HTTP/JSON interface to the Laurus search engine. It runs alongside the gRPC server and proxies requests internally:

Client (HTTP/JSON) --> HTTP Gateway (axum) --> gRPC Server (tonic) --> Engine

Enabling the HTTP Gateway

The gateway starts when http_port is configured:

# Via CLI argument
laurus serve --http-port 8080

# Via environment variable
LAURUS_HTTP_PORT=8080 laurus serve

# Via config file
laurus serve --config config.toml
# (set http_port in [server] section)

If http_port is not set, only the gRPC server starts.

Endpoints

MethodPathgRPC MethodDescription
GET/v1/healthHealthService/CheckHealth check
POST/v1/indexIndexService/CreateIndexCreate a new index
GET/v1/indexIndexService/GetIndexGet index statistics
GET/v1/schemaIndexService/GetSchemaGet the index schema
PUT/v1/documents/:idDocumentService/PutDocumentUpsert a document
POST/v1/documents/:idDocumentService/AddDocumentAdd a document (chunk)
GET/v1/documents/:idDocumentService/GetDocumentsGet documents by ID
DELETE/v1/documents/:idDocumentService/DeleteDocumentsDelete documents by ID
POST/v1/commitDocumentService/CommitCommit pending changes
POST/v1/searchSearchService/SearchSearch (unary)
POST/v1/search/streamSearchService/SearchStreamSearch (Server-Sent Events)

API Examples

Health Check

curl http://localhost:8080/v1/health

Create an Index

curl -X POST http://localhost:8080/v1/index \
  -H 'Content-Type: application/json' \
  -d '{
    "schema": {
      "fields": {
        "title": {"text": {"indexed": true, "stored": true, "term_vectors": true}},
        "body": {"text": {"indexed": true, "stored": true, "term_vectors": true}}
      },
      "default_fields": ["title", "body"]
    }
  }'

Get Index Statistics

curl http://localhost:8080/v1/index

Get Schema

curl http://localhost:8080/v1/schema

Upsert a Document (PUT)

Replaces the document if it already exists:

curl -X PUT http://localhost:8080/v1/documents/doc1 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Hello World",
        "body": "This is a test document."
      }
    }
  }'

Add a Document (POST)

Adds a new chunk without replacing existing documents with the same ID:

curl -X POST http://localhost:8080/v1/documents/doc1 \
  -H 'Content-Type: application/json' \
  -d '{
    "document": {
      "fields": {
        "title": "Hello World",
        "body": "This is a test document."
      }
    }
  }'

Get Documents

curl http://localhost:8080/v1/documents/doc1

Delete Documents

curl -X DELETE http://localhost:8080/v1/documents/doc1

Commit

curl -X POST http://localhost:8080/v1/commit

Search

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "body:test", "limit": 10}'

Search with Field Boosts

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "rust programming",
    "limit": 10,
    "field_boosts": {"title": 2.0}
  }'

Hybrid Search

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "body:rust",
    "query_vectors": [{"vector": [0.1, 0.2, 0.3], "weight": 1.0}],
    "limit": 10,
    "fusion": {"rrf": {"k": 60}}
  }'

Streaming Search (SSE)

The /v1/search/stream endpoint returns results as Server-Sent Events (SSE). Each result is sent as a separate event:

curl -N -X POST http://localhost:8080/v1/search/stream \
  -H 'Content-Type: application/json' \
  -d '{"query": "body:test", "limit": 10}'

The response is a stream of SSE events:

data: {"id":"doc1","score":0.8532,"document":{...}}

data: {"id":"doc2","score":0.4210,"document":{...}}

Request/Response Format

All request and response bodies use JSON. The JSON structure mirrors the gRPC protobuf messages. See gRPC API Reference for the full message definitions.

MCP Server Overview

The laurus-mcp crate provides a Model Context Protocol (MCP) server for the Laurus search engine. It acts as a gRPC client to a running laurus-server instance, enabling AI assistants such as Claude to index documents and perform searches through the standard MCP stdio transport.

Features

  • MCP stdio transport — Runs as a subprocess; communicates with the AI client via stdin/stdout
  • gRPC client — Proxies all tool calls to a running laurus-server instance
  • All laurus search modes — Lexical (BM25), vector (HNSW/Flat/IVF), and hybrid search
  • Dynamic connection — Connect to any laurus-server endpoint via the connect tool
  • Document lifecycle — Add, update, delete, and retrieve documents through MCP tools

Architecture

graph LR
    subgraph "laurus-mcp"
        MCP["MCP Server\n(stdio)"]
    end

    AI["AI Client\n(Claude, etc.)"] -->|"stdio (JSON-RPC)"| MCP
    MCP -->|"gRPC"| SRV["laurus-server\n(always running)"]
    SRV --> Disk["Index on Disk"]

The MCP server runs as a child process launched by the AI client. It proxies all tool calls to a laurus-server instance via gRPC. The laurus-server must be started separately before (or after) the MCP server.

Quick Start

# Step 1: Start the laurus-server
laurus serve --port 50051

# Step 2: Configure Claude Code and start the MCP server
claude mcp add laurus -- laurus mcp --endpoint http://localhost:50051

Or with a manual configuration:

{
  "mcpServers": {
    "laurus": {
      "command": "laurus",
      "args": ["mcp", "--endpoint", "http://localhost:50051"]
    }
  }
}

Sections

Getting Started with laurus-mcp

Prerequisites

  • The laurus CLI binary installed (cargo install laurus-cli)
  • A running laurus-server instance (see laurus-server getting started)
  • An AI client that supports MCP (Claude Desktop, Claude Code, etc.)

Configuration

Step 1: Start laurus-server

laurus serve --port 50051

Step 2: Configure the MCP client

Claude Code

Use the CLI command (recommended):

claude mcp add laurus -- laurus mcp --endpoint http://localhost:50051

Or edit ~/.claude/settings.json directly:

{
  "mcpServers": {
    "laurus": {
      "command": "laurus",
      "args": ["mcp", "--endpoint", "http://localhost:50051"]
    }
  }
}

Claude Desktop

Edit the configuration file for your platform:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "laurus": {
      "command": "laurus",
      "args": ["mcp", "--endpoint", "http://localhost:50051"]
    }
  }
}

Usage Workflows

Workflow 1: Pre-created index

Create the index using the CLI first, then use the MCP server to query it:

# Step 1: Create a schema file
cat > schema.toml << 'EOF'
[fields.title]
Text = { indexed = true, stored = true }

[fields.body]
Text = { indexed = true, stored = true }
EOF

# Step 2: Start the server and create the index
laurus serve --port 50051 &
laurus create index --schema schema.toml

# Step 3: Register the MCP server with Claude Code
claude mcp add laurus -- laurus mcp --endpoint http://localhost:50051

Workflow 2: AI-driven index creation

Start laurus-server first, then register the MCP server and let the AI create the index:

# Step 1: Start laurus-server (no index required)
laurus serve --port 50051

# Step 2: Register the MCP server with Claude Code
claude mcp add laurus -- laurus mcp --endpoint http://localhost:50051

Then ask Claude:

“Create a search index for blog posts. I need to search by title and body text, and I want to store the author and publication date.”

Claude will design the schema and call create_index automatically.

Workflow 3: Connect at runtime

Register the MCP server without specifying an endpoint:

claude mcp add laurus -- laurus mcp

Or edit the settings file directly:

{
  "mcpServers": {
    "laurus": {
      "command": "laurus",
      "args": ["mcp"]
    }
  }
}

Then ask Claude to connect:

“Connect to the laurus server at http://localhost:50051

Claude will call connect(endpoint: "http://localhost:50051") before using other tools.

Removing the MCP Server

To remove the registered MCP server from Claude Code:

claude mcp remove laurus

For Claude Desktop, remove the laurus entry from the configuration file and restart the application.

Lifecycle

laurus-server starts (separate process)
  └─ listens on gRPC port 50051

Claude starts
  └─ spawns: laurus mcp --endpoint `http://localhost:50051`
       └─ enters stdio event loop
            ├─ receives tool calls via stdin
            ├─ proxies calls to laurus-server via gRPC
            └─ sends results via stdout
Claude exits
  └─ laurus-mcp process terminates
  └─ laurus-server continues running

MCP Tools Reference

The laurus MCP server exposes the following tools.

connect

Connect to a running laurus-server gRPC endpoint. Call this before using other tools if the server was started without the --endpoint flag, or to switch to a different laurus-server at runtime.

Parameters

NameTypeRequiredDescription
endpointstringYesgRPC endpoint URL (e.g. http://localhost:50051)

Example

Tool: connect
endpoint: "http://localhost:50051"

Result: Connected to laurus-server at http://localhost:50051.


create_index

Create a new search index with the provided schema.

Parameters

NameTypeRequiredDescription
schema_jsonstringYesSchema definition as a JSON string

Schema JSON format

FieldOption uses serde’s externally-tagged representation where the variant name is the key:

{
  "fields": {
    "title":     { "Text":    { "indexed": true, "stored": true } },
    "body":      { "Text":    {} },
    "score":     { "Float":   {} },
    "count":     { "Integer": {} },
    "active":    { "Boolean": {} },
    "created":   { "DateTime": {} },
    "embedding": { "Hnsw":    { "dimension": 384 } }
  }
}

Example

Tool: create_index
schema_json: {"fields": {"title": {"Text": {}}, "body": {"Text": {}}}}

Result: Index created successfully at /path/to/index.


add_field

Add a new field to the index.

Parameters

NameTypeRequiredDescription
namestringYesThe field name
field_option_jsonstringYesField configuration as JSON

Example

{
  "name": "category",
  "field_option_json": "{\"Text\": {\"indexed\": true, \"stored\": true}}"
}

Result: Field 'category' added successfully.


delete_field

Remove a field from the index schema. Existing indexed data remains in storage but becomes inaccessible. Per-field analyzers and embedders are unregistered.

Parameters

NameTypeRequiredDescription
namestringYesThe name of the field to remove

Example

{
  "name": "category"
}

Result: Field 'category' deleted successfully.


get_stats

Get statistics for the current search index, including document count and vector field information.

Parameters

None.

Result

{
  "document_count": 42,
  "vector_fields": ["embedding"]
}

get_schema

Get the current index schema, including all field definitions and their configurations.

Parameters

None.

Result

{
  "fields": {
    "title": { "Text": { "indexed": true, "stored": true } },
    "body": { "Text": {} },
    "embedding": { "Hnsw": { "dimension": 384 } }
  },
  "default_fields": ["title", "body"]
}

put_document

Put (upsert) a document into the index. If a document with the same ID already exists, all its chunks are deleted before the new document is indexed. Call commit after adding documents.

Parameters

NameTypeRequiredDescription
idstringYesExternal document identifier
documentobjectYesDocument fields as a JSON object

Example

Tool: put_document
id: "doc-1"
document: {"title": "Hello World", "body": "This is a test document."}

Result: Document 'doc-1' put (upserted). Call commit to persist changes.


add_document

Add a document as a new chunk to the index. Unlike put_document, this appends without deleting existing documents with the same ID. Useful for splitting large documents into chunks. Call commit after adding documents.

Parameters

NameTypeRequiredDescription
idstringYesExternal document identifier
documentobjectYesDocument fields as a JSON object

Example

Tool: add_document
id: "doc-1"
document: {"title": "Hello World - Part 2", "body": "This is a continuation."}

Result: Document 'doc-1' added as chunk. Call commit to persist changes.


get_documents

Retrieve all stored documents (including chunks) by external ID.

Parameters

NameTypeRequiredDescription
idstringYesExternal document identifier

Result

{
  "id": "doc-1",
  "documents": [
    { "title": "Hello World", "body": "This is a test document." }
  ]
}

delete_documents

Delete all documents and chunks sharing the given external ID from the index. Call commit after deletion.

Parameters

NameTypeRequiredDescription
idstringYesExternal document identifier

Result: Documents 'doc-1' deleted. Call commit to persist changes.


commit

Commit pending changes to disk. Must be called after put_document, add_document, or delete_documents to make changes searchable and durable.

Parameters

None.

Result: Changes committed successfully.


search

Search documents using the laurus unified query DSL. Supports lexical search, vector search, and hybrid search in a single query string.

Parameters

NameTypeRequiredDescription
querystringYesSearch query in laurus unified query DSL
limitintegerNoMaximum results (default: 10)
offsetintegerNoResults to skip for pagination (default: 0)
fusionstringNoFusion algorithm as JSON (for hybrid search)
field_boostsstringNoPer-field boost factors as JSON

Query DSL examples

Lexical search

QueryDescription
helloTerm search across default fields
title:helloField-scoped term search
title:hello AND body:worldBoolean AND
"exact phrase"Phrase search
roam~2Fuzzy search (edit distance 2)
count:[1 TO 10]Range search
title:helo~1Fuzzy field search

Vector search

QueryDescription
content:"cute kitten"Vector search on a field (field must be a vector field in schema)
content:pythonVector search with unquoted text
content:"cute kitten"^0.8Vector search with weight/boost
a:"cats" b:"dogs"^0.5Multiple vector queries

Hybrid search

QueryDescription
title:hello content:"cute kitten"Lexical + vector (OR/union — results from either)
title:hello +content:"cute kitten"Lexical + vector (AND/intersection — only results in both)
+title:hello +content:"cute kitten"Both required (AND); + on lexical field = required clause
title:hello AND body:world content:"cats"^0.8Boolean lexical + weighted vector

Fusion algorithm examples

{"rrf": {"k": 60.0}}
{"weighted_sum": {"lexical_weight": 0.7, "vector_weight": 0.3}}

Field boosts example

{"title": 2.0, "body": 1.0}

Result

{
  "total": 2,
  "results": [
    {
      "id": "doc-1",
      "score": 3.14,
      "document": { "title": "Hello World", "body": "..." }
    },
    {
      "id": "doc-2",
      "score": 1.57,
      "document": { "title": "Hello Again", "body": "..." }
    }
  ]
}

Typical Workflow

1. connect          → connect to a running laurus-server
2. create_index     → define the schema (if index does not exist)
3. add_field        → dynamically add fields (optional)
   delete_field     → remove fields (optional)
4. put_document     → upsert documents (repeat as needed)
   add_document     → append document chunks (optional)
5. commit           → persist changes to disk
6. search           → query the index
7. get_documents    → retrieve documents by ID
8. delete_documents → remove documents
9. commit           → persist changes

Python Binding Overview

The laurus-python package provides Python bindings for the Laurus search engine. It is built as a native Rust extension using PyO3 and Maturin, giving Python programs direct access to Laurus’s lexical, vector, and hybrid search capabilities with near-native performance.

Features

  • Lexical Search – Full-text search powered by an inverted index with BM25 scoring
  • Vector Search – Approximate nearest neighbor (ANN) search using Flat, HNSW, or IVF indexes
  • Hybrid Search – Combine lexical and vector results with fusion algorithms (RRF, WeightedSum)
  • Rich Query DSL – Term, Phrase, Fuzzy, Wildcard, NumericRange, Geo, Boolean, Span queries
  • Text Analysis – Tokenizers, filters, stemmers, and synonym expansion
  • Flexible Storage – In-memory (ephemeral) or file-based (persistent) indexes
  • Pythonic API – Clean, intuitive Python classes with full type information

Architecture

graph LR
    subgraph "laurus-python"
        PyIndex["Index\n(Python class)"]
        PyQuery["Query classes"]
        PySearch["SearchRequest\n/ SearchResult"]
    end

    Python["Python application"] -->|"method calls"| PyIndex
    Python -->|"query objects"| PyQuery
    PyIndex -->|"PyO3 FFI"| Engine["laurus::Engine\n(Rust)"]
    PyQuery -->|"PyO3 FFI"| Engine
    Engine --> Storage["Storage\n(Memory / File)"]

The Python classes are thin wrappers around the Rust engine. Each call crosses the PyO3 FFI boundary once; the Rust engine then executes the operation entirely in native code.

Although the Rust engine uses async I/O internally, all Python methods are exposed as synchronous functions. This is because Python’s GIL (Global Interpreter Lock) prevents true concurrent execution within a single interpreter, making an async API cumbersome (it would require asyncio.run() everywhere). Instead, each method calls tokio::Runtime::block_on() under the hood to bridge async Rust to synchronous Python.

Note: The Node.js binding (laurus-nodejs) exposes the same Rust engine methods as native async / Promise APIs, since Node.js’s event loop supports async natively.

Quick Start

import laurus

# Create an in-memory index
index = laurus.Index()

# Index documents
index.put_document("doc1", {"title": "Introduction to Rust", "body": "Systems programming language."})
index.put_document("doc2", {"title": "Python for Data Science", "body": "Data analysis with Python."})
index.commit()

# Search
results = index.search("title:rust", limit=5)
for r in results:
    print(f"[{r.id}] score={r.score:.4f}  {r.document['title']}")

Sections

Installation

From PyPI

pip install laurus

From source

Building from source requires a Rust toolchain (1.75 or later) and Maturin.

# Install Maturin
pip install maturin

# Clone the repository
git clone https://github.com/mosuka/laurus.git
cd laurus/laurus-python

# Build and install in development mode
maturin develop

# Or build a release wheel
maturin build --release
pip install target/wheels/laurus-*.whl

Verify

import laurus
index = laurus.Index()
print(index)  # Index()

Requirements

  • Python 3.8 or later
  • No runtime dependencies beyond the compiled native extension

Quick Start

1. Create an index

import laurus

# In-memory index (ephemeral, useful for prototyping)
index = laurus.Index()

# File-based index (persistent)
schema = laurus.Schema()
schema.add_text_field("title")
schema.add_text_field("body")
index = laurus.Index(path="./myindex", schema=schema)

2. Index documents

index.put_document("doc1", {
    "title": "Introduction to Rust",
    "body": "Rust is a systems programming language focused on safety and performance.",
})
index.put_document("doc2", {
    "title": "Python for Data Science",
    "body": "Python is widely used for data analysis and machine learning.",
})
index.commit()
# DSL string
results = index.search("title:rust", limit=5)

# Query object
results = index.search(laurus.TermQuery("body", "python"), limit=5)

# Print results
for r in results:
    print(f"[{r.id}] score={r.score:.4f}  {r.document['title']}")

Vector search requires a schema with a vector field and pre-computed embeddings.

import laurus
import numpy as np

schema = laurus.Schema()
schema.add_text_field("title")
schema.add_hnsw_field("embedding", dimension=4)

index = laurus.Index(schema=schema)
index.put_document("doc1", {"title": "Rust", "embedding": [0.1, 0.2, 0.3, 0.4]})
index.put_document("doc2", {"title": "Python", "embedding": [0.9, 0.8, 0.7, 0.6]})
index.commit()

query_vec = [0.1, 0.2, 0.3, 0.4]
results = index.search(laurus.VectorQuery("embedding", query_vec), limit=3)
request = laurus.SearchRequest(
    lexical_query=laurus.TermQuery("title", "rust"),
    vector_query=laurus.VectorQuery("embedding", query_vec),
    fusion=laurus.RRF(k=60.0),
    limit=5,
)
results = index.search(request)

6. Update and delete

# Update: put_document replaces all existing versions
index.put_document("doc1", {"title": "Updated Title", "body": "New content."})
index.commit()

# Append a new version without removing existing ones (RAG chunking pattern)
index.add_document("doc1", {"title": "Chunk 2", "body": "Additional chunk."})
index.commit()

# Retrieve all versions
docs = index.get_documents("doc1")

# Delete
index.delete_documents("doc1")
index.commit()

7. Schema management

schema = laurus.Schema()
schema.add_text_field("title")
schema.add_text_field("body")
schema.add_int_field("year")
schema.add_float_field("score")
schema.add_bool_field("published")
schema.add_bytes_field("thumbnail")
schema.add_geo_field("location")
schema.add_datetime_field("created_at")
schema.add_hnsw_field("embedding", dimension=384)
schema.add_flat_field("small_vec", dimension=64)
schema.add_ivf_field("ivf_vec", dimension=128, n_clusters=100)

8. Index statistics

stats = index.stats()
print(stats["document_count"])
print(stats["vector_fields"])

API Reference

Index

The primary entry point. Wraps the Laurus search engine.

class Index:
    def __init__(self, path: str | None = None, schema: Schema | None = None) -> None: ...

Constructor

ParameterTypeDefaultDescription
pathstr | NoneNoneDirectory path for persistent storage. None creates an in-memory index.
schemaSchema | NoneNoneSchema definition. An empty schema is used when omitted.

Methods

MethodDescription
put_document(id, doc)Upsert a document. Replaces all existing versions with the same ID.
add_document(id, doc)Append a document chunk without removing existing versions.
get_documents(id) -> list[dict]Return all stored versions for the given ID.
delete_documents(id)Delete all versions for the given ID.
commit()Flush buffered writes and make all pending changes searchable.
search(query, *, limit=10, offset=0) -> list[SearchResult]Execute a search query.
stats() -> dictReturn index statistics (document_count, vector_fields).

search query argument

The query parameter accepts any of the following:

  • A DSL string (e.g. "title:hello", "embedding:\"memory safety\"")
  • A lexical query object (TermQuery, PhraseQuery, BooleanQuery, …)
  • A vector query object (VectorQuery, VectorTextQuery)
  • A SearchRequest for full control

Schema

Defines the fields and index types for an Index.

class Schema:
    def __init__(self) -> None: ...

Field methods

MethodDescription
add_text_field(name)Full-text field (inverted index, BM25).
add_int_field(name)64-bit integer field.
add_float_field(name)64-bit float field.
add_bool_field(name)Boolean field.
add_bytes_field(name)Raw bytes field.
add_geo_field(name)Geographic coordinate field (lat/lon).
add_datetime_field(name)UTC datetime field.
add_hnsw_field(name, dimension, *, distance="cosine", m=16, ef_construction=100)HNSW approximate nearest-neighbor vector field.
add_flat_field(name, dimension, *, distance="cosine")Flat (brute-force) vector field.
add_ivf_field(name, dimension, *, distance="cosine", n_clusters=100, n_probe=1)IVF approximate nearest-neighbor vector field.

Distance metrics

ValueDescription
"cosine"Cosine similarity (default)
"euclidean"Euclidean distance
"dot_product"Dot product

Query classes

TermQuery

TermQuery(field: str, term: str)

Matches documents containing the exact term in the given field.

PhraseQuery

PhraseQuery(field: str, terms: list[str])

Matches documents containing the terms in order.

FuzzyQuery

FuzzyQuery(field: str, term: str, max_edits: int = 1)

Approximate match allowing up to max_edits edit-distance errors.

WildcardQuery

WildcardQuery(field: str, pattern: str)

Pattern match. * matches any sequence of characters, ? matches any single character.

NumericRangeQuery

NumericRangeQuery(field: str, min: int | float | None, max: int | float | None)

Matches numeric values in the range [min, max]. Pass None for an open bound.

GeoQuery

GeoQuery(field: str, lat: float, lon: float, radius_km: float)

Geo-distance search. Returns documents whose (lat, lon) coordinate is within radius_km of the given point.

BooleanQuery

BooleanQuery(
    must: list[Query] | None = None,
    should: list[Query] | None = None,
    must_not: list[Query] | None = None,
)

Compound boolean query. must clauses all have to match; at least one should clause must match; must_not clauses must not match.

SpanNearQuery

SpanNearQuery(field: str, terms: list[str], slop: int = 0, in_order: bool = True)

Matches documents where the terms appear within slop positions of each other.

VectorQuery

VectorQuery(field: str, vector: list[float])

Approximate nearest-neighbor search using a pre-computed embedding vector.

VectorTextQuery

VectorTextQuery(field: str, text: str)

Converts text to an embedding at query time and runs vector search. Requires an embedder configured on the index.


SearchRequest

Full-featured search request for advanced control.

class SearchRequest:
    def __init__(
        self,
        *,
        query=None,
        lexical_query=None,
        vector_query=None,
        filter_query=None,
        fusion=None,
        limit: int = 10,
        offset: int = 0,
    ) -> None: ...
ParameterDescription
queryA DSL string or single query object. Mutually exclusive with lexical_query / vector_query.
lexical_queryLexical component for explicit hybrid search.
vector_queryVector component for explicit hybrid search.
filter_queryLexical filter applied after scoring.
fusionFusion algorithm (RRF or WeightedSum). Defaults to RRF(k=60) when both components are set.
limitMaximum number of results (default 10).
offsetPagination offset (default 0).

SearchResult

Returned by Index.search().

class SearchResult:
    id: str          # External document identifier
    score: float     # Relevance score
    document: dict | None  # Retrieved field values, or None if deleted

Fusion algorithms

RRF

RRF(k: float = 60.0)

Reciprocal Rank Fusion. Merges lexical and vector result lists by rank position. k is a smoothing constant; higher values reduce the influence of top-ranked results.

WeightedSum

WeightedSum(lexical_weight: float = 0.5, vector_weight: float = 0.5)

Normalises both score lists independently, then combines them as lexical_weight * lexical_score + vector_weight * vector_score.


Text analysis

SynonymDictionary

class SynonymDictionary:
    def __init__(self) -> None: ...
    def add_synonym_group(self, synonyms: list[str]) -> None: ...

WhitespaceTokenizer

class WhitespaceTokenizer:
    def __init__(self) -> None: ...
    def tokenize(self, text: str) -> list[Token]: ...

SynonymGraphFilter

class SynonymGraphFilter:
    def __init__(
        self,
        dictionary: SynonymDictionary,
        keep_original: bool = True,
        boost: float = 1.0,
    ) -> None: ...
    def apply(self, tokens: list[Token]) -> list[Token]: ...

Token

class Token:
    text: str
    position: int
    position_increment: int
    position_length: int
    boost: float

Field value types

Python values are automatically converted to Laurus DataValue types:

Python typeLaurus typeNotes
NoneNull
boolBoolChecked before int
intInt64
floatFloat64
strText
bytesBytes
list[float]VectorElements coerced to f32
(lat, lon) tupleGeoTwo float values
datetime.datetimeDateTimeConverted via isoformat()

Node.js Binding Overview

The laurus-nodejs package provides Node.js/TypeScript bindings for the Laurus search engine. It is built as a native addon using napi-rs, giving Node.js programs direct access to Laurus’s lexical, vector, and hybrid search capabilities with near-native performance.

Features

  • Lexical Search – Full-text search powered by an inverted index with BM25 scoring
  • Vector Search – Approximate nearest neighbor (ANN) search using Flat, HNSW, or IVF indexes
  • Hybrid Search – Combine lexical and vector results with fusion algorithms (RRF, WeightedSum)
  • Rich Query DSL – Term, Phrase, Fuzzy, Wildcard, NumericRange, Geo, Boolean, Span queries
  • Text Analysis – Tokenizers, filters, stemmers, and synonym expansion
  • Flexible Storage – In-memory (ephemeral) or file-based (persistent) indexes
  • TypeScript Types – Auto-generated .d.ts type definitions
  • Async API – All I/O operations return Promises

Architecture

graph LR
    subgraph "laurus-nodejs"
        JsIndex["Index\n(JS class)"]
        JsQuery["Query classes"]
        JsSearch["SearchRequest\n/ SearchResult"]
    end

    Node["Node.js application"] -->|"method calls"| JsIndex
    Node -->|"query objects"| JsQuery
    JsIndex -->|"napi-rs FFI"| Engine["laurus::Engine\n(Rust)"]
    JsQuery -->|"napi-rs FFI"| Engine
    Engine --> Storage["Storage\n(Memory / File)"]

The JavaScript classes are thin wrappers around the Rust engine. Each call crosses the napi-rs FFI boundary once; the Rust engine then executes the operation entirely in native code.

All I/O methods (search, commit, putDocument, etc.) are async and return Promises. They run on napi-rs’s built-in tokio runtime and return results to the Node.js event loop without blocking it. Schema construction, query creation, and stats() are synchronous since they involve no I/O.

Note: The Python binding (laurus-python) exposes the same Rust engine methods as synchronous functions because Python’s GIL (Global Interpreter Lock) makes an async API cumbersome. Node.js has no such constraint, so the async Rust engine is exposed directly as Promises.

Quick Start

import { Index, Schema } from "laurus-nodejs";

// Create an in-memory index
const schema = new Schema();
schema.addTextField("name");
schema.addTextField("description");
schema.setDefaultFields(["name", "description"]);

const index = await Index.create(null, schema);

// Index documents
await index.putDocument("express", {
  name: "Express",
  description: "Fast minimalist web framework for Node.js.",
});
await index.putDocument("fastify", {
  name: "Fastify",
  description: "Fast and low overhead web framework.",
});
await index.commit();

// Search
const results = await index.search("framework", 5);
for (const r of results) {
  console.log(`[${r.id}] score=${r.score.toFixed(4)}  ${r.document.name}`);
}

Sections

Installation

From npm

npm install laurus-nodejs

From source

Building from source requires a Rust toolchain (1.85 or later) and Node.js 18+.

# Clone the repository
git clone https://github.com/mosuka/laurus.git
cd laurus/laurus-nodejs

# Install dependencies
npm install

# Build the native module (release)
npm run build

# Or build in debug mode (faster builds)
npm run build:debug

Verify

import { Index } from "laurus-nodejs";
const index = await Index.create();
console.log(index.stats());
// { documentCount: 0, vectorFields: {} }

Requirements

  • Node.js 18 or later
  • No runtime dependencies beyond the compiled native addon

Quick Start

1. Create an index

import { Index, Schema } from "laurus-nodejs";

// In-memory index (ephemeral, useful for prototyping)
const index = await Index.create();

// File-based index (persistent)
const schema = new Schema();
schema.addTextField("name");
schema.addTextField("description");
const persistentIndex = await Index.create("./myindex", schema);

2. Index documents

await index.putDocument("express", {
  name: "Express",
  description: "Fast minimalist web framework for Node.js.",
});
await index.putDocument("fastify", {
  name: "Fastify",
  description: "Fast and low overhead web framework.",
});
await index.commit();

3. Lexical search

// DSL string
const results = await index.search("name:express", 5);

// Term query
const results2 = await index.searchTerm(
  "description", "framework", 5,
);

// Print results
for (const r of results) {
  console.log(`[${r.id}] score=${r.score.toFixed(4)}  ${r.document.name}`);
}

4. Vector search

Vector search requires a schema with a vector field and pre-computed embeddings.

import { Index, Schema } from "laurus-nodejs";

const schema = new Schema();
schema.addTextField("name");
schema.addHnswField("embedding", 4);

const index = await Index.create(null, schema);
await index.putDocument("express", {
  name: "Express",
  embedding: [0.1, 0.2, 0.3, 0.4],
});
await index.putDocument("pg", {
  name: "pg",
  embedding: [0.9, 0.8, 0.7, 0.6],
});
await index.commit();

const results = await index.searchVector(
  "embedding", [0.1, 0.2, 0.3, 0.4], 3,
);

5. Hybrid search

import { SearchRequest } from "laurus-nodejs";

const req = new SearchRequest(5);
req.setLexicalTermQuery("name", "express");
req.setVectorQuery("embedding", [0.1, 0.2, 0.3, 0.4]);
req.setRrfFusion(60.0);

const results = await index.searchWithRequest(req);

6. Update and delete

// Update: putDocument replaces all existing versions
await index.putDocument("express", {
  name: "Express v5",
  description: "Updated content.",
});
await index.commit();

// Append a new version (RAG chunking pattern)
await index.addDocument("express", {
  name: "Express chunk 2",
  description: "Additional chunk.",
});
await index.commit();

// Retrieve all versions
const docs = await index.getDocuments("express");

// Delete
await index.deleteDocuments("express");
await index.commit();

7. Schema management

const schema = new Schema();
schema.addTextField("name");
schema.addTextField("description");
schema.addIntegerField("stars");
schema.addFloatField("score");
schema.addBooleanField("published");
schema.addBytesField("thumbnail");
schema.addGeoField("location");
schema.addDatetimeField("createdAt");
schema.addHnswField("embedding", 384);
schema.addFlatField("smallVec", 64);
schema.addIvfField("ivfVec", 128, "cosine", 100, 1);

8. Index statistics

const stats = index.stats();
console.log(stats.documentCount);
console.log(stats.vectorFields);

API Reference

Index

The primary entry point. Wraps the Laurus search engine.

class Index {
  static create(
    path?: string | null,
    schema?: Schema,
  ): Promise<Index>;
}

Factory method

ParameterTypeDefaultDescription
pathstring | nullnullDirectory for persistent storage. null creates an in-memory index.
schemaSchemaemptySchema definition.

Methods

MethodDescription
putDocument(id, doc)Upsert a document. Replaces all existing versions.
addDocument(id, doc)Append a document chunk without removing existing versions.
getDocuments(id)Return all stored versions for the given ID.
deleteDocuments(id)Delete all versions for the given ID.
commit()Flush writes and make pending changes searchable.
search(query, limit?, offset?)Search with a DSL string.
searchTerm(field, term, limit?, offset?)Search with an exact term match.
searchVector(field, vector, limit?, offset?)Search with a pre-computed vector.
searchVectorText(field, text, limit?, offset?)Search with text (auto-embedded).
searchWithRequest(request)Search with a SearchRequest.
stats()Return index statistics.

All document methods and search methods are async and return Promises. stats() is synchronous.


Schema

Defines the fields and index types for an Index.

class Schema {
  constructor();
}

Field methods

MethodDescription
addTextField(name, stored?, indexed?, termVectors?, analyzer?)Full-text field (inverted index, BM25).
addIntegerField(name, stored?, indexed?)64-bit integer field.
addFloatField(name, stored?, indexed?)64-bit float field.
addBooleanField(name, stored?, indexed?)Boolean field.
addBytesField(name, stored?)Raw bytes field.
addGeoField(name, stored?, indexed?)Geographic coordinate field.
addDatetimeField(name, stored?, indexed?)UTC datetime field.
addHnswField(name, dimension, distance?, m?, efConstruction?, embedder?)HNSW vector field.
addFlatField(name, dimension, distance?, embedder?)Flat (brute-force) vector field.
addIvfField(name, dimension, distance?, nClusters?, nProbe?, embedder?)IVF vector field.
addEmbedder(name, config)Register a named embedder.
setDefaultFields(fields)Set default search fields.
fieldNames()Return all field names.

Distance metrics

ValueDescription
"cosine"Cosine similarity (default)
"euclidean"Euclidean distance
"dot_product"Dot product
"manhattan"Manhattan distance
"angular"Angular distance

Query classes

TermQuery

new TermQuery(field: string, term: string)

Matches documents containing the exact term in the given field.

PhraseQuery

new PhraseQuery(field: string, terms: string[])

Matches documents containing the terms in order.

FuzzyQuery

new FuzzyQuery(field: string, term: string, maxEdits?: number)

Approximate match allowing up to maxEdits edit-distance errors (default 2).

WildcardQuery

new WildcardQuery(field: string, pattern: string)

Pattern match. * matches any sequence, ? matches one character.

NumericRangeQuery

new NumericRangeQuery(
  field: string,
  min?: number | null,
  max?: number | null,
  isFloat?: boolean,
)

Matches numeric values in [min, max]. Pass null for an open bound.

GeoQuery

GeoQuery.withinRadius(
  field: string, lat: number, lon: number, distanceKm: number,
): GeoQuery

GeoQuery.withinBoundingBox(
  field: string,
  minLat: number, minLon: number,
  maxLat: number, maxLon: number,
): GeoQuery

Geographic search by radius or bounding box.

BooleanQuery

class BooleanQuery {
  constructor();
  mustTerm(field: string, term: string): void;
  shouldTerm(field: string, term: string): void;
  mustNotTerm(field: string, term: string): void;
}

Compound boolean query with MUST / SHOULD / MUST_NOT clauses.

SpanQuery

SpanQuery.term(field: string, term: string): SpanQuery
SpanQuery.near(
  field: string, terms: string[],
  slop?: number, ordered?: boolean,
): SpanQuery
SpanQuery.nearSpans(
  field: string, clauses: SpanQuery[],
  slop?: number, ordered?: boolean,
): SpanQuery
SpanQuery.containing(
  field: string, big: SpanQuery, little: SpanQuery,
): SpanQuery
SpanQuery.within(
  field: string,
  include: SpanQuery, exclude: SpanQuery, distance: number,
): SpanQuery

Positional/proximity span queries.

VectorQuery

new VectorQuery(field: string, vector: number[])

Nearest-neighbor search using a pre-computed embedding vector.

VectorTextQuery

new VectorTextQuery(field: string, text: string)

Converts text to an embedding at query time. Requires an embedder configured on the index.


SearchRequest

Full-featured search request for advanced control.

class SearchRequest {
  constructor(limit?: number, offset?: number);
}

Setter methods

MethodDescription
setQueryDsl(dsl)Set a DSL string query.
setLexicalTermQuery(field, term)Set a term-based lexical query.
setLexicalPhraseQuery(field, terms)Set a phrase-based lexical query.
setVectorQuery(field, vector)Set a pre-computed vector query.
setVectorTextQuery(field, text)Set a text-based vector query.
setFilterQuery(field, term)Set a post-scoring filter.
setRrfFusion(k?)Use RRF fusion (default k=60).
setWeightedSumFusion(lexicalWeight?, vectorWeight?)Use weighted sum fusion.

SearchResult

Returned by search methods as an array.

interface SearchResult {
  id: string;        // External document identifier
  score: number;     // Relevance score
  document: object | null; // Retrieved fields, or null
}

Fusion algorithms

RRF

new RRF(k?: number)  // default 60.0

Reciprocal Rank Fusion. Merges lexical and vector result lists by rank position.

WeightedSum

new WeightedSum(
  lexicalWeight?: number,  // default 0.5
  vectorWeight?: number,   // default 0.5
)

Normalises both score lists independently, then combines them.


Text analysis

SynonymDictionary

class SynonymDictionary {
  constructor();
  addSynonymGroup(terms: string[]): void;
}

WhitespaceTokenizer

class WhitespaceTokenizer {
  constructor();
  tokenize(text: string): Token[];
}

SynonymGraphFilter

class SynonymGraphFilter {
  constructor(
    dictionary: SynonymDictionary,
    keepOriginal?: boolean,  // default true
    boost?: number,          // default 1.0
  );
  apply(tokens: Token[]): Token[];
}

Token

interface Token {
  text: string;
  position: number;
  startOffset: number;
  endOffset: number;
  boost: number;
  stopped: boolean;
  positionIncrement: number;
  positionLength: number;
}

Field value types

JavaScript values are automatically converted to Laurus DataValue types:

JavaScript typeLaurus typeNotes
nullNull
booleanBool
number (integer)Int64
number (float)Float64
stringTextISO8601 strings become DateTime
number[]VectorCoerced to f32
{ lat, lon }GeoTwo number values
DateDateTimeVia timestamp
BufferBytes

Development Setup

This page covers how to set up a local development environment for the laurus-nodejs binding, build it, and run the test suite.

Prerequisites

  • Rust 1.85 or later with Cargo
  • Node.js 18 or later with npm
  • Repository cloned locally
git clone https://github.com/mosuka/laurus.git
cd laurus

Build

Development build

Compiles the Rust native addon in debug mode. Re-run after any Rust source change.

cd laurus-nodejs
npm install
npm run build:debug

Release build

npm run build

Verify the build

node -e "
const { Index } = require('./index.js');
Index.create().then(idx => console.log(idx.stats()));
"
// { documentCount: 0, vectorFields: {} }

Testing

Tests use Vitest and are located in __tests__/.

# Run all tests
npm test

To run a specific test by name:

npx vitest run -t "searches with DSL string"

Linting and formatting

# Rust lint (Clippy)
cargo clippy -p laurus-nodejs -- -D warnings

# Rust formatting
cargo fmt -p laurus-nodejs --check

# Apply formatting
cargo fmt -p laurus-nodejs

Cleaning up

# Remove build artifacts
rm -f *.node index.js index.d.ts

# Remove node_modules
rm -rf node_modules

Project layout

laurus-nodejs/
├── Cargo.toml          # Rust crate manifest
├── build.rs            # napi-build setup
├── package.json        # npm package metadata
├── README.md           # English README
├── README_ja.md        # Japanese README
├── src/                # Rust source (napi-rs binding)
│   ├── lib.rs          # Module registration
│   ├── index.rs        # Index class
│   ├── schema.rs       # Schema class
│   ├── query.rs        # Query classes
│   ├── search.rs       # SearchRequest / SearchResult / Fusion
│   ├── analysis.rs     # Tokenizer / Filter / Token
│   ├── convert.rs      # JS ↔ DataValue conversion
│   └── errors.rs       # Error mapping
├── __tests__/          # Vitest integration tests
│   └── index.spec.mjs
└── examples/           # Runnable Node.js examples
    ├── quickstart.mjs
    ├── lexical-search.mjs
    ├── vector-search.mjs
    └── hybrid-search.mjs

WASM Binding Overview

The laurus-wasm package provides WebAssembly bindings for the Laurus search engine. It enables lexical, vector, and hybrid search directly in browsers and edge runtimes (Cloudflare Workers, Vercel Edge Functions, Deno Deploy) without a server.

Features

  • Lexical Search – Full-text search powered by an inverted index with BM25 scoring
  • Vector Search – Approximate nearest neighbor (ANN) search using Flat, HNSW, or IVF indexes
  • Hybrid Search – Combine lexical and vector results with fusion algorithms (RRF, WeightedSum)
  • Rich Query DSL – Term, Phrase, Fuzzy, Wildcard, NumericRange, Geo, Boolean, Span queries
  • Text Analysis – Tokenizers, filters, and synonym expansion
  • In-memory Storage – Fast ephemeral indexes
  • OPFS Persistence – Indexes survive page reloads via the Origin Private File System
  • TypeScript Types – Auto-generated .d.ts type definitions
  • Async API – All I/O operations return Promises

Architecture

graph LR
    subgraph "laurus-wasm"
        WASM[wasm-bindgen API]
    end
    subgraph "laurus (core)"
        Engine
        MemoryStorage
    end
    subgraph "Browser"
        JS[JavaScript / TypeScript]
        OPFS[Origin Private File System]
    end
    JS --> WASM
    WASM --> Engine
    Engine --> MemoryStorage
    WASM -.->|persist| OPFS

Limitation: No In-Engine Auto-Embedding

One of Laurus’s key features on native platforms is automatic embedding – when a document is indexed, the engine can automatically convert text fields into vector embeddings using a registered embedder (Candle BERT, Candle CLIP, or OpenAI API). This allows searchVectorText("field", "query text") to work seamlessly without the caller computing embeddings.

In laurus-wasm, this feature is not available. Only the precomputed embedder type is supported. The reasons are:

EmbedderDependencyWhy it cannot run in WASM
candle_bertcandle (GPU/SIMD)Requires native SIMD intrinsics and file system for models
candle_clipcandleSame as above
openaireqwest (HTTP)Requires a full async HTTP client (tokio + TLS)

These dependencies are excluded from the WASM build via feature flags (embeddings-candle, embeddings-openai), which depend on the native feature that is disabled for wasm32-unknown-unknown.

Compute embeddings on the JavaScript side and pass precomputed vectors to putDocument() and searchVector():

// Using Transformers.js (all-MiniLM-L6-v2, 384-dim)
import { pipeline } from '@huggingface/transformers';

const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

async function embed(text) {
  const output = await embedder(text, { pooling: 'mean', normalize: true });
  return Array.from(output.data);
}

// Index with precomputed embedding
const vec = await embed("Introduction to Rust");
await index.putDocument("doc1", { title: "Introduction to Rust", embedding: vec });
await index.commit();

// Search with precomputed query embedding
const queryVec = await embed("safe systems programming");
const results = await index.searchVector("embedding", queryVec);

This approach gives you real semantic search in the browser using the same sentence-transformer models available on native platforms, with the embedding computation handled by Transformers.js (ONNX Runtime Web) instead of candle.

When to Use laurus-wasm vs laurus-nodejs

Criterionlaurus-wasmlaurus-nodejs
EnvironmentBrowser, Edge RuntimeNode.js server
PerformanceGood (single-threaded)Best (native, multi-threaded)
StorageIn-memory + OPFSIn-memory + File system
EmbeddingPrecomputed onlyCandle, OpenAI, Precomputed
Packagenpm install laurus-wasmnpm install laurus-nodejs
Binary size~5-10 MB (WASM)Platform-native

Installation

npm / yarn / pnpm

npm install laurus-wasm
# or
yarn add laurus-wasm
# or
pnpm add laurus-wasm

CDN (ES Module)

<script type="module">
  import init, { Index, Schema } from 'https://unpkg.com/laurus-wasm/laurus_wasm.js';
  await init();
  // ...
</script>

Build from Source

Prerequisites:

git clone https://github.com/mosuka/laurus.git
cd laurus/laurus-wasm

# For use with bundlers (webpack, vite, etc.)
wasm-pack build --target bundler --release

# For direct browser use (<script type="module">)
wasm-pack build --target web --release

The output will be in the pkg/ directory.

Browser Compatibility

laurus-wasm requires a browser that supports:

  • WebAssembly (all modern browsers)
  • ES Modules

For OPFS persistence, the following browsers are supported:

BrowserMinimum Version
Chrome102+
Firefox111+
Safari15.2+
Edge102+

Quick Start

Basic Usage (In-memory)

import init, { Index, Schema } from 'laurus-wasm';

// Initialize the WASM module
await init();

// Define a schema
const schema = new Schema();
schema.addTextField("title");
schema.addTextField("body");
schema.setDefaultFields(["title", "body"]);

// Create an in-memory index
const index = await Index.create(schema);

// Add documents
await index.putDocument("doc1", {
  title: "Introduction to Rust",
  body: "Rust is a systems programming language"
});
await index.putDocument("doc2", {
  title: "WebAssembly Guide",
  body: "WASM enables near-native performance in the browser"
});
await index.commit();

// Search
const results = await index.search("rust programming");
for (const result of results) {
  console.log(`${result.id}: ${result.score}`);
  console.log(result.document);
}

Persistent Storage (OPFS)

import init, { Index, Schema } from 'laurus-wasm';

await init();

const schema = new Schema();
schema.addTextField("title");
schema.addTextField("body");

// Open a persistent index (data survives page reloads)
const index = await Index.open("my-index", schema);

// Add documents
await index.putDocument("doc1", {
  title: "Hello",
  body: "World"
});

// commit() persists to OPFS automatically
await index.commit();

// On next page load, Index.open("my-index") will restore the data

Vector Search

import init, { Index, Schema } from 'laurus-wasm';

await init();

const schema = new Schema();
schema.addTextField("title");
schema.addHnswField("embedding", 3); // 3-dimensional vectors

const index = await Index.create(schema);

await index.putDocument("doc1", {
  title: "Rust",
  embedding: [1.0, 0.0, 0.0]
});
await index.putDocument("doc2", {
  title: "Python",
  embedding: [0.0, 1.0, 0.0]
});
await index.commit();

// Search by vector similarity
const results = await index.searchVector("embedding", [0.9, 0.1, 0.0]);
console.log(results[0].document.title); // "Rust"

Usage with Bundlers

Vite

// vite.config.js
import wasm from 'vite-plugin-wasm';

export default {
  plugins: [wasm()]
};

Webpack 5

Webpack 5 supports WASM natively with asyncWebAssembly:

// webpack.config.js
module.exports = {
  experiments: {
    asyncWebAssembly: true
  }
};

API Reference

Index

The main entry point for creating and querying search indexes.

Static Methods

Index.create(schema?)

Create a new in-memory (ephemeral) index.

  • Parameters:
    • schema (Schema, optional) – Schema definition.
  • Returns: Promise<Index>

Index.open(name, schema?)

Open or create a persistent index backed by OPFS.

  • Parameters:
    • name (string) – Index name (OPFS subdirectory).
    • schema (Schema, optional) – Schema definition.
  • Returns: Promise<Index>

Instance Methods

putDocument(id, document)

Replace a document (upsert).

  • Parameters:
    • id (string) – Document identifier.
    • document (object) – Key-value pairs matching schema fields.
  • Returns: Promise<void>

addDocument(id, document)

Append a document version (multi-version RAG pattern).

  • Parameters / Returns: Same as putDocument.

getDocuments(id)

Retrieve all versions of a document.

  • Parameters:
    • id (string)
  • Returns: Promise<object[]>

deleteDocuments(id)

Delete all versions of a document.

  • Parameters:
    • id (string)
  • Returns: Promise<void>

commit()

Flush writes and make changes searchable. If opened with Index.open(), data is also persisted to OPFS.

  • Returns: Promise<void>

search(query, limit?, offset?)

Search using a DSL string query.

  • Parameters:
    • query (string) – Query DSL (e.g. "title:hello").
    • limit (number, default 10)
    • offset (number, default 0)
  • Returns: Promise<SearchResult[]>

searchTerm(field, term, limit?, offset?)

Search for an exact term.

  • Parameters:
    • field (string) – Field name.
    • term (string) – Exact term.
    • limit, offset (number, optional)
  • Returns: Promise<SearchResult[]>

searchVector(field, vector, limit?, offset?)

Search by vector similarity.

  • Parameters:
    • field (string) – Vector field name.
    • vector (number[]) – Query embedding.
    • limit, offset (number, optional)
  • Returns: Promise<SearchResult[]>

searchVectorText(field, text, limit?, offset?)

Search by text (embedded by the registered embedder).

  • Parameters:
    • field (string) – Vector field name.
    • text (string) – Text to embed.
    • limit, offset (number, optional)
  • Returns: Promise<SearchResult[]>

stats()

Return index statistics.

  • Returns: { documentCount: number, vectorFields: { [name]: { count, dimension } } }

Schema

Builder for defining index fields and embedders.

Constructor

new Schema()

Create an empty schema.

Methods

addTextField(name, stored?, indexed?, termVectors?, analyzer?)

Add a full-text field.

addIntegerField(name, stored?, indexed?)

Add an integer field.

addFloatField(name, stored?, indexed?)

Add a float field.

addBooleanField(name, stored?, indexed?)

Add a boolean field.

addDateTimeField(name, stored?, indexed?)

Add a date/time field.

addGeoField(name, stored?, indexed?)

Add a geographic coordinate field.

addBytesField(name, stored?)

Add a binary data field.

addHnswField(name, dimension, distance?, m?, efConstruction?, embedder?)

Add an HNSW vector index field.

  • distance: "cosine" (default), "euclidean", "dot_product", "manhattan", "angular"
  • m: Branching factor (default 16)
  • efConstruction: Build-time expansion (default 200)

addFlatField(name, dimension, distance?, embedder?)

Add a brute-force vector index field.

addIvfField(name, dimension, distance?, nClusters?, nProbe?, embedder?)

Add an IVF vector index field.

addEmbedder(name, config)

Register a named embedder. In WASM, only "precomputed" type is supported.

schema.addEmbedder("my-embedder", { type: "precomputed" });

setDefaultFields(fields)

Set the default search fields.

fieldNames()

Returns an array of defined field names.

SearchResult

interface SearchResult {
  id: string;
  score: number;
  document: object | null;
}

Analysis

WhitespaceTokenizer

const tokenizer = new WhitespaceTokenizer();
const tokens = tokenizer.tokenize("hello world");
// [{ text: "hello", position: 0, ... }, { text: "world", position: 1, ... }]

SynonymDictionary

const dict = new SynonymDictionary();
dict.addSynonymGroup(["ml", "machine learning"]);

SynonymGraphFilter

const filter = new SynonymGraphFilter(dict, true, 0.8);
const expanded = filter.apply(tokens);

Development

Prerequisites

rustup target add wasm32-unknown-unknown
cargo install wasm-pack

Build

cd laurus-wasm

# Debug build (faster compilation)
wasm-pack build --target web --dev

# Release build (optimized)
wasm-pack build --target web --release

# For bundler targets (webpack, vite, etc.)
wasm-pack build --target bundler --release

Project Structure

laurus-wasm/
├── Cargo.toml          # Rust dependencies (wasm-bindgen, laurus core)
├── package.json        # npm package metadata
├── src/
│   ├── lib.rs          # Module declarations
│   ├── index.rs        # Index class (CRUD + search)
│   ├── schema.rs       # Schema builder
│   ├── search.rs       # SearchRequest / SearchResult
│   ├── query.rs        # Query type definitions
│   ├── convert.rs      # JsValue ↔ Document conversion
│   ├── analysis.rs     # Tokenizer / Filter wrappers
│   ├── errors.rs       # LaurusError → JsValue conversion
│   └── storage.rs      # OPFS persistence layer
└── js/
    └── opfs_bridge.js  # JS glue for Origin Private File System

Architecture Notes

Storage Strategy

laurus-wasm uses a two-layer storage approach:

  1. MemoryStorage (runtime) – All read/write operations go through Laurus’s in-memory storage, which satisfies the Storage trait’s Send + Sync requirement.

  2. OPFS (persistence) – On commit(), the entire MemoryStorage state is serialized to OPFS files. On Index.open(), OPFS files are loaded back into MemoryStorage.

This avoids the Send + Sync incompatibility of JS handles while keeping the core engine unchanged.

Feature Flags

The laurus core uses feature flags to support WASM:

# laurus-wasm depends on laurus without default features
laurus = { workspace = true, default-features = false }

This excludes native-only dependencies (tokio/full, rayon, memmap2, etc.) and uses #[cfg(target_arch = "wasm32")] fallbacks for parallelism.

Testing

# Build check
cargo build -p laurus-wasm --target wasm32-unknown-unknown

# Clippy
cargo clippy -p laurus-wasm --target wasm32-unknown-unknown -- -D warnings

Browser tests can be run with wasm-pack test:

wasm-pack test --headless --chrome

Ruby Binding Overview

The laurus gem provides Ruby bindings for the Laurus search engine. It is built as a native Rust extension using Magnus and rb_sys, giving Ruby programs direct access to Laurus’s lexical, vector, and hybrid search capabilities with near-native performance.

Features

  • Lexical Search – Full-text search powered by an inverted index with BM25 scoring
  • Vector Search – Approximate nearest neighbor (ANN) search using Flat, HNSW, or IVF indexes
  • Hybrid Search – Combine lexical and vector results with fusion algorithms (RRF, WeightedSum)
  • Rich Query DSL – Term, Phrase, Fuzzy, Wildcard, NumericRange, Geo, Boolean, Span queries
  • Text Analysis – Tokenizers, filters, stemmers, and synonym expansion
  • Flexible Storage – In-memory (ephemeral) or file-based (persistent) indexes
  • Idiomatic Ruby API – Clean, intuitive Ruby classes under the Laurus:: namespace

Architecture

graph LR
    subgraph "laurus-ruby (gem)"
        RbIndex["Index\n(Ruby class)"]
        RbQuery["Query classes"]
        RbSearch["SearchRequest\n/ SearchResult"]
    end

    Ruby["Ruby application"] -->|"method calls"| RbIndex
    Ruby -->|"query objects"| RbQuery
    RbIndex -->|"Magnus FFI"| Engine["laurus::Engine\n(Rust)"]
    RbQuery -->|"Magnus FFI"| Engine
    Engine --> Storage["Storage\n(Memory / File)"]

The Ruby classes are thin wrappers around the Rust engine. Each call crosses the Magnus FFI boundary once; the Rust engine then executes the operation entirely in native code.

Although the Rust engine uses async I/O internally, all Ruby methods are exposed as synchronous functions. Each method calls tokio::Runtime::block_on() under the hood to bridge async Rust to synchronous Ruby.

Quick Start

require "laurus"

# Create an in-memory index
index = Laurus::Index.new

# Index documents
index.put_document("doc1", { "title" => "Introduction to Rust", "body" => "Systems programming language." })
index.put_document("doc2", { "title" => "Ruby for Web Development", "body" => "Web applications with Ruby." })
index.commit

# Search
results = index.search("title:rust", limit: 5)
results.each do |r|
  puts "[#{r.id}] score=#{format('%.4f', r.score)}  #{r.document['title']}"
end

Sections

Installation

From RubyGems

gem install laurus

Or add it to your Gemfile:

gem "laurus"

Then run:

bundle install

From source

Building from source requires a Rust toolchain (1.85 or later) and rb_sys.

# Clone the repository
git clone https://github.com/mosuka/laurus.git
cd laurus/laurus-ruby

# Install dependencies
bundle install

# Compile the native extension
bundle exec rake compile

# Or install the gem locally
gem build laurus.gemspec
gem install laurus-*.gem

Verify

require "laurus"
index = Laurus::Index.new
puts index  # Index()

Requirements

  • Ruby 3.1 or later
  • Rust toolchain (automatically invoked during gem install via rb_sys)
  • No runtime dependencies beyond the compiled native extension

Quick Start

1. Create an index

require "laurus"

# In-memory index (ephemeral, useful for prototyping)
index = Laurus::Index.new

# File-based index (persistent)
schema = Laurus::Schema.new
schema.add_text_field("title")
schema.add_text_field("body")
index = Laurus::Index.new(path: "./myindex", schema: schema)

2. Index documents

index.put_document("doc1", {
  "title" => "Introduction to Rust",
  "body" => "Rust is a systems programming language focused on safety and performance.",
})
index.put_document("doc2", {
  "title" => "Ruby for Web Development",
  "body" => "Ruby is widely used for web applications and rapid prototyping.",
})
index.commit

3. Lexical search

# DSL string
results = index.search("title:rust", limit: 5)

# Query object
results = index.search(Laurus::TermQuery.new("body", "ruby"), limit: 5)

# Print results
results.each do |r|
  puts "[#{r.id}] score=#{format('%.4f', r.score)}  #{r.document['title']}"
end

4. Vector search

Vector search requires a schema with a vector field and pre-computed embeddings.

require "laurus"

schema = Laurus::Schema.new
schema.add_text_field("title")
schema.add_hnsw_field("embedding", 4)

index = Laurus::Index.new(schema: schema)
index.put_document("doc1", { "title" => "Rust", "embedding" => [0.1, 0.2, 0.3, 0.4] })
index.put_document("doc2", { "title" => "Ruby", "embedding" => [0.9, 0.8, 0.7, 0.6] })
index.commit

query_vec = [0.1, 0.2, 0.3, 0.4]
results = index.search(Laurus::VectorQuery.new("embedding", query_vec), limit: 3)

5. Hybrid search

request = Laurus::SearchRequest.new(
  lexical_query: Laurus::TermQuery.new("title", "rust"),
  vector_query: Laurus::VectorQuery.new("embedding", query_vec),
  fusion: Laurus::RRF.new(k: 60.0),
  limit: 5,
)
results = index.search(request)

6. Update and delete

# Update: put_document replaces all existing versions
index.put_document("doc1", { "title" => "Updated Title", "body" => "New content." })
index.commit

# Append a new version without removing existing ones (RAG chunking pattern)
index.add_document("doc1", { "title" => "Chunk 2", "body" => "Additional chunk." })
index.commit

# Retrieve all versions
docs = index.get_documents("doc1")

# Delete
index.delete_documents("doc1")
index.commit

7. Schema management

schema = Laurus::Schema.new
schema.add_text_field("title")
schema.add_text_field("body")
schema.add_integer_field("year")
schema.add_float_field("score")
schema.add_boolean_field("published")
schema.add_bytes_field("thumbnail")
schema.add_geo_field("location")
schema.add_datetime_field("created_at")
schema.add_hnsw_field("embedding", 384)
schema.add_flat_field("small_vec", 64)
schema.add_ivf_field("ivf_vec", 128, n_clusters: 100)

8. Index statistics

stats = index.stats
puts stats["document_count"]
puts stats["vector_fields"]

API Reference

Index

The primary entry point. Wraps the Laurus search engine.

Laurus::Index.new(path: nil, schema: nil)

Constructor

ParameterTypeDefaultDescription
path:String | nilnilDirectory path for persistent storage. nil creates an in-memory index.
schema:Schema | nilnilSchema definition. An empty schema is used when omitted.

Methods

MethodDescription
put_document(id, doc)Upsert a document. Replaces all existing versions with the same ID.
add_document(id, doc)Append a document chunk without removing existing versions.
get_documents(id) -> Array<Hash>Return all stored versions for the given ID.
delete_documents(id)Delete all versions for the given ID.
commitFlush buffered writes and make all pending changes searchable.
search(query, limit: 10, offset: 0) -> Array<SearchResult>Execute a search query.
stats -> HashReturn index statistics ("document_count", "vector_fields").

search query argument

The query parameter accepts any of the following:

  • A DSL string (e.g. "title:hello", "embedding:\"memory safety\"")
  • A lexical query object (TermQuery, PhraseQuery, BooleanQuery, …)
  • A vector query object (VectorQuery, VectorTextQuery)
  • A SearchRequest for full control

Schema

Defines the fields and index types for an Index.

Laurus::Schema.new

Field methods

MethodDescription
add_text_field(name, stored: true, indexed: true, term_vectors: false, analyzer: nil)Full-text field (inverted index, BM25).
add_integer_field(name, stored: true, indexed: true)64-bit integer field.
add_float_field(name, stored: true, indexed: true)64-bit float field.
add_boolean_field(name, stored: true, indexed: true)Boolean field.
add_bytes_field(name, stored: true)Raw bytes field.
add_geo_field(name, stored: true, indexed: true)Geographic coordinate field (lat/lon).
add_datetime_field(name, stored: true, indexed: true)UTC datetime field.
add_hnsw_field(name, dimension, distance: "cosine", m: 16, ef_construction: 200, embedder: nil)HNSW approximate nearest-neighbor vector field.
add_flat_field(name, dimension, distance: "cosine", embedder: nil)Flat (brute-force) vector field.
add_ivf_field(name, dimension, distance: "cosine", n_clusters: 100, n_probe: 1, embedder: nil)IVF approximate nearest-neighbor vector field.

Other methods

MethodDescription
add_embedder(name, config)Register a named embedder definition. config is a Hash with a "type" key (see below).
set_default_fields(fields)Set the default fields used when no field is specified in a query. fields is an Array of Strings.
field_names -> Array<String>Return the list of field names defined in this schema.

Embedder types

"type"Required keysFeature flag
"precomputed"(always available)
"candle_bert""model"embeddings-candle
"candle_clip""model"embeddings-multimodal
"openai""model"embeddings-openai

Distance metrics

ValueDescription
"cosine"Cosine similarity (default)
"euclidean"Euclidean distance
"dot_product"Dot product
"manhattan"Manhattan distance
"angular"Angular distance

Query classes

TermQuery

Laurus::TermQuery.new(field, term)

Matches documents containing the exact term in the given field.

PhraseQuery

Laurus::PhraseQuery.new(field, terms)

Matches documents containing the terms in order. terms is an Array of Strings.

FuzzyQuery

Laurus::FuzzyQuery.new(field, term, max_edits: 2)

Approximate match allowing up to max_edits edit-distance errors.

WildcardQuery

Laurus::WildcardQuery.new(field, pattern)

Pattern match. * matches any sequence of characters, ? matches any single character.

NumericRangeQuery

Laurus::NumericRangeQuery.new(field, min: nil, max: nil)

Matches numeric values in the range [min, max]. Pass nil for an open bound. The type (integer or float) is inferred from the Ruby type of min/max.

GeoQuery

# Radius search
Laurus::GeoQuery.within_radius(field, lat, lon, distance_km)

# Bounding box search
Laurus::GeoQuery.within_bounding_box(field, min_lat, min_lon, max_lat, max_lon)

within_radius returns documents whose coordinate is within distance_km of the given point. within_bounding_box returns documents within the specified bounding box.

BooleanQuery

bq = Laurus::BooleanQuery.new
bq.must(query)
bq.should(query)
bq.must_not(query)

Compound boolean query. must clauses all have to match; at least one should clause must match; must_not clauses must not match.

SpanQuery

# Single term
Laurus::SpanQuery.term(field, term)

# Near: terms within slop positions
Laurus::SpanQuery.near(field, terms, slop: 0, ordered: true)

# Near with nested SpanQuery clauses
Laurus::SpanQuery.near_spans(field, clauses, slop: 0, ordered: true)

# Containing: big span contains little span
Laurus::SpanQuery.containing(field, big, little)

# Within: include span within exclude span at max distance
Laurus::SpanQuery.within(field, include_span, exclude_span, distance)

Positional / proximity span queries. near takes an Array of term Strings, while near_spans takes an Array of SpanQuery objects for nested expressions.

VectorQuery

Laurus::VectorQuery.new(field, vector)

Approximate nearest-neighbor search using a pre-computed embedding vector. vector is an Array of Floats.

VectorTextQuery

Laurus::VectorTextQuery.new(field, text)

Converts text to an embedding at query time and runs vector search. Requires an embedder configured on the index.


SearchRequest

Full-featured search request for advanced control.

Laurus::SearchRequest.new(
  query: nil,
  lexical_query: nil,
  vector_query: nil,
  filter_query: nil,
  fusion: nil,
  limit: 10,
  offset: 0,
)
ParameterDescription
query:A DSL string or single query object. Mutually exclusive with lexical_query: / vector_query:.
lexical_query:Lexical component for explicit hybrid search.
vector_query:Vector component for explicit hybrid search.
filter_query:Lexical filter applied after scoring.
fusion:Fusion algorithm (RRF or WeightedSum). Defaults to RRF(k: 60) when both components are set.
limit:Maximum number of results (default 10).
offset:Pagination offset (default 0).

SearchResult

Returned by Index#search.

result.id        # => String   -- External document identifier
result.score     # => Float    -- Relevance score
result.document  # => Hash|nil -- Retrieved field values, or nil if deleted

Fusion algorithms

RRF

Laurus::RRF.new(k: 60.0)

Reciprocal Rank Fusion. Merges lexical and vector result lists by rank position. k is a smoothing constant; higher values reduce the influence of top-ranked results.

WeightedSum

Laurus::WeightedSum.new(lexical_weight: 0.5, vector_weight: 0.5)

Normalises both score lists independently, then combines them as lexical_weight * lexical_score + vector_weight * vector_score.


Text analysis

SynonymDictionary

dict = Laurus::SynonymDictionary.new
dict.add_synonym_group(["fast", "quick", "rapid"])

A dictionary of synonym groups. All terms in a group are treated as synonyms of each other.

WhitespaceTokenizer

tokenizer = Laurus::WhitespaceTokenizer.new
tokens = tokenizer.tokenize("hello world")

Splits text on whitespace boundaries and returns an Array of Token objects.

SynonymGraphFilter

filter = Laurus::SynonymGraphFilter.new(dictionary, keep_original: true, boost: 1.0)
expanded = filter.apply(tokens)

Token filter that expands tokens with their synonyms from a SynonymDictionary.

Token

token.text                # => String  -- The token text
token.position            # => Integer -- Position in the token stream
token.start_offset        # => Integer -- Character start offset in the original text
token.end_offset          # => Integer -- Character end offset in the original text
token.boost               # => Float   -- Score boost factor (1.0 = no adjustment)
token.stopped             # => Boolean -- Whether removed by a stop filter
token.position_increment  # => Integer -- Difference from the previous token's position
token.position_length     # => Integer -- Number of positions spanned

Field value types

Ruby values are automatically converted to Laurus DataValue types:

Ruby typeLaurus typeNotes
nilNull
true / falseBool
IntegerInt64
FloatFloat64
StringText
Array of numericsVectorElements coerced to f32
Hash with "lat", "lon"GeoTwo Float values
Time (responds to iso8601)DateTimeConverted via iso8601

Development Setup

This page covers how to set up a local development environment for the laurus-ruby binding, build it, and run the test suite.

Prerequisites

  • Rust 1.85 or later with Cargo
  • Ruby 3.1 or later with Bundler
  • Repository cloned locally
git clone https://github.com/mosuka/laurus.git
cd laurus

Build

Development build

Compiles the Rust native extension in debug mode. Re-run after any Rust source change.

cd laurus-ruby
bundle install
bundle exec rake compile

Release build

gem build laurus.gemspec

Verify the build

ruby -e "
require 'laurus'
index = Laurus::Index.new
puts index.stats
"
# {"document_count"=>0, "vector_fields"=>{}}

Testing

Tests use Minitest and are located in test/.

# Run all tests
bundle exec rake test

To run a specific test file:

bundle exec ruby -Ilib -Itest test/test_index.rb

Linting and formatting

# Rust lint (Clippy)
cargo clippy -p laurus-ruby -- -D warnings

# Rust formatting
cargo fmt -p laurus-ruby --check

# Apply formatting
cargo fmt -p laurus-ruby

Cleaning up

# Remove build artifacts
bundle exec rake clean

# Remove installed gems
rm -rf vendor/bundle

Project layout

laurus-ruby/
├── Cargo.toml          # Rust crate manifest
├── laurus.gemspec      # Gem specification
├── Gemfile             # Bundler dependency file
├── Rakefile            # Rake tasks (compile, test, clean)
├── lib/
│   └── laurus.rb       # Ruby entrypoint (loads native extension)
├── ext/
│   └── laurus_ruby/    # Native extension build configuration
│       └── extconf.rb  # rb_sys extension configuration
├── src/                # Rust source (Magnus binding)
│   ├── lib.rs          # Module registration
│   ├── index.rs        # Index class
│   ├── schema.rs       # Schema class
│   ├── query.rs        # Query classes
│   ├── search.rs       # SearchRequest / SearchResult / Fusion
│   ├── analysis.rs     # Tokenizer / Filter / Token
│   ├── convert.rs      # Ruby ↔ DataValue conversion
│   └── errors.rs       # Error mapping
├── test/               # Minitest tests
│   ├── test_helper.rb
│   └── test_index.rb
└── examples/           # Runnable Ruby examples

Build & Test

Prerequisites

  • Rust 1.85 or later (edition 2024)
  • Cargo (included with Rust)
  • protobuf compiler (protoc) – required for building laurus-server

Building

# Build all crates
cargo build

# Build with specific features
cargo build --features embeddings-candle

# Build in release mode
cargo build --release

Testing

# Run all tests
cargo test

# Run a specific test by name
cargo test <test_name>

# Run tests for a specific crate
cargo test -p laurus
cargo test -p laurus-cli
cargo test -p laurus-server

Linting

# Run clippy with warnings as errors
cargo clippy -- -D warnings

Formatting

# Check formatting
cargo fmt --check

# Apply formatting
cargo fmt

Documentation

API Documentation

# Generate and open Rust API docs
cargo doc --no-deps --open

mdBook Documentation

# Build the documentation site
mdbook build docs

# Start a local preview server (http://localhost:3000)
mdbook serve docs

# Lint markdown files
markdownlint-cli2 "docs/src/**/*.md"

Feature Flags

The laurus crate ships with no default features. Enable embedding support as needed.

Available Flags

FeatureDescriptionKey Dependencies
embeddings-candleLocal BERT embeddings via Hugging Face Candlecandle-core, candle-nn, candle-transformers, hf-hub, tokenizers
embeddings-openaiOpenAI API embeddingsreqwest
embeddings-multimodalCLIP multimodal embeddings (text + image)image, embeddings-candle
embeddings-allAll embedding features combinedAll of the above

What Each Flag Enables

embeddings-candle

Enables CandleBertEmbedder for running BERT models locally on the CPU. Models are downloaded from Hugging Face Hub on first use.

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-candle"] }

embeddings-openai

Enables OpenAIEmbedder for calling the OpenAI Embeddings API. Requires an OPENAI_API_KEY environment variable at runtime.

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-openai"] }

embeddings-multimodal

Enables CandleClipEmbedder for CLIP-based text and image embeddings. Implies embeddings-candle.

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-multimodal"] }

embeddings-all

Convenience flag that enables all embedding features.

[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-all"] }

Feature Flag Impact on Binary Size

Enabling embedding features adds dependencies that increase compile time and binary size:

ConfigurationApproximate Impact
No features (lexical only)Baseline
embeddings-candle+ Candle ML framework
embeddings-openai+ reqwest HTTP client
embeddings-multimodal+ image processing + Candle
embeddings-allAll of the above

If you only need lexical (keyword) search, you can use Laurus with no features enabled for the smallest binary and fastest compile time.

Project Structure

Laurus is organized as a Cargo workspace with three crates.

Workspace Layout

laurus/                          # Repository root
├── Cargo.toml                   # Workspace definition
├── laurus/                      # Core search engine library
│   ├── Cargo.toml
│   ├── src/
│   │   ├── lib.rs               # Public API and module declarations
│   │   ├── engine.rs            # Engine, EngineBuilder, SearchRequest
│   │   ├── analysis/            # Text analysis pipeline
│   │   ├── lexical/             # Inverted index and lexical search
│   │   ├── vector/              # Vector indexes (Flat, HNSW, IVF)
│   │   ├── embedding/           # Embedder implementations
│   │   ├── storage/             # Storage backends (memory, file, mmap)
│   │   ├── store/               # Document log (WAL)
│   │   ├── spelling/            # Spelling correction
│   │   ├── data/                # DataValue, Document types
│   │   └── error.rs             # LaurusError type
│   └── examples/                # Runnable examples
├── laurus-cli/                  # Command-line interface
│   ├── Cargo.toml
│   └── src/
│       └── main.rs              # CLI entry point (clap)
├── laurus-server/               # gRPC server + HTTP gateway
│   ├── Cargo.toml
│   ├── proto/                   # Protobuf service definitions
│   └── src/
│       ├── lib.rs               # Server library
│       ├── config.rs            # TOML configuration
│       ├── grpc/                # gRPC service implementations
│       └── gateway/             # HTTP/JSON gateway (axum)
└── docs/                        # mdBook documentation
    ├── book.toml
    └── src/
        └── SUMMARY.md           # Table of contents

Crate Responsibilities

CrateTypeDescription
laurusLibraryCore search engine with lexical, vector, and hybrid search
laurus-cliBinaryCLI tool for index management, document CRUD, search, and REPL
laurus-serverLibrary + BinarygRPC server with optional HTTP/JSON gateway

Both laurus-cli and laurus-server depend on the laurus library crate.

Design Conventions

  • Module style: File-based modules (Rust 2018 edition style), not mod.rs
    • src/tokenizer.rs + src/tokenizer/dictionary.rs
    • Not: src/tokenizer/mod.rs
  • Error handling: thiserror for library error types, anyhow only in binary crates
  • No unwrap() / expect() in production code (allowed in tests)
  • Async: All public APIs use async/await with Tokio runtime
  • Unsafe: Every unsafe block must have a // SAFETY: ... comment
  • Documentation: All public types, functions, and enums must have doc comments (///)
  • Licensing: Dependencies must be MIT or Apache-2.0 compatible