Laurus
A fast, featureful hybrid search library for Rust.
Laurus is a pure-Rust library that combines lexical search (keyword matching via inverted index) and vector search (semantic similarity via embeddings) into a single, unified engine. It is designed to be embedded directly into your Rust application — no external server required.
Key Features
| Feature | Description |
|---|---|
| Lexical Search | Full-text search powered by an inverted index with BM25 scoring |
| Vector Search | Approximate nearest neighbor (ANN) search using Flat, HNSW, or IVF indexes |
| Hybrid Search | Combine lexical and vector results with fusion algorithms (RRF, WeightedSum) |
| Text Analysis | Pluggable analyzer pipeline — tokenizers, filters, stemmers, synonyms |
| Embeddings | Built-in support for Candle (local BERT/CLIP), OpenAI API, or custom embedders |
| Storage | Pluggable backends — in-memory, file-based, or memory-mapped |
| Query DSL | Human-readable query syntax for lexical, vector, and hybrid search |
| Pure Rust | No C/C++ dependencies in the core — safe, portable, easy to build |
How It Works
graph LR
subgraph Your Application
D["Document"]
Q["Query"]
end
subgraph Laurus Engine
SCH["Schema"]
AN["Analyzer"]
EM["Embedder"]
LI["Lexical Index\n(Inverted Index)"]
VI["Vector Index\n(HNSW / Flat / IVF)"]
FU["Fusion\n(RRF / WeightedSum)"]
end
D --> SCH
SCH --> AN --> LI
SCH --> EM --> VI
Q --> LI --> FU
Q --> VI --> FU
FU --> R["Ranked Results"]
- Define a Schema — declare your fields and their types (text, integer, vector, etc.)
- Build an Engine — attach an analyzer for text and an embedder for vectors
- Index Documents — the engine routes each field to the correct index automatically
- Search — run lexical, vector, or hybrid queries and get ranked results
Document Map
| Section | What You Will Learn |
|---|---|
| Getting Started | Install Laurus and run your first search in minutes |
| Architecture | Understand the Engine, its components, and data flow |
| Core Concepts | Schema, text analysis, embeddings, and storage |
| Indexing | How inverted indexes and vector indexes work internally |
| Search | Query types, vector search, and hybrid fusion |
| Advanced Features | Query DSL, ID management, WAL, and compaction |
| API Reference | Key types and methods at a glance |
Quick Example
use std::sync::Arc;
use laurus::{Document, Engine, Schema, SearchRequestBuilder, Result};
use laurus::lexical::{TextOption, TermQuery};
use laurus::storage::memory::MemoryStorage;
#[tokio::main]
async fn main() -> Result<()> {
// 1. Storage
let storage = Arc::new(MemoryStorage::new(Default::default()));
// 2. Schema
let schema = Schema::builder()
.add_text_field("title", TextOption::default())
.add_text_field("body", TextOption::default())
.add_default_field("body")
.build();
// 3. Engine
let engine = Engine::builder(storage, schema).build().await?;
// 4. Index a document
let doc = Document::builder()
.add_text("title", "Hello Laurus")
.add_text("body", "A fast search library for Rust")
.build();
engine.add_document("doc-1", doc).await?;
engine.commit().await?;
// 5. Search
let request = SearchRequestBuilder::new()
.lexical_search_request(
laurus::LexicalSearchRequest::new(
Box::new(TermQuery::new("body", "rust"))
)
)
.limit(10)
.build();
let results = engine.search(request).await?;
for r in &results {
println!("{}: score={:.4}", r.id, r.score);
}
Ok(())
}
License
Laurus is dual-licensed under MIT and Apache 2.0.
Getting Started
Welcome to Laurus! This section will help you install the library and run your first search.
What You Will Build
By the end of this guide, you will have a working search engine that can:
- Index text documents
- Perform keyword (lexical) search
- Perform semantic (vector) search
- Combine both with hybrid search
Prerequisites
- Rust 1.85 or later (edition 2024)
- Cargo (included with Rust)
- Tokio runtime (Laurus uses async APIs)
Steps
- Installation — Add Laurus to your project and choose feature flags
- Quick Start — Build a complete search engine in 5 steps
Workflow Overview
Building a search application with Laurus follows a consistent pattern:
graph LR
A["1. Create\nStorage"] --> B["2. Define\nSchema"]
B --> C["3. Build\nEngine"]
C --> D["4. Index\nDocuments"]
D --> E["5. Search"]
| Step | What Happens |
|---|---|
| Create Storage | Choose where data lives — in memory, on disk, or memory-mapped |
| Define Schema | Declare fields and their types (text, integer, vector, etc.) |
| Build Engine | Attach an analyzer (for text) and an embedder (for vectors) |
| Index Documents | Add documents; the engine routes fields to the correct index |
| Search | Run lexical, vector, or hybrid queries and get ranked results |
Installation
Add Laurus to Your Project
Add laurus and tokio (async runtime) to your Cargo.toml:
[dependencies]
laurus = "0.1.0"
tokio = { version = "1", features = ["full"] }
Feature Flags
Laurus ships with a minimal default feature set. Enable additional features as needed:
| Feature | Description | Use Case |
|---|---|---|
| (default) | Core library (lexical search, storage, analyzers — no embedding) | Keyword search only |
embeddings-candle | Local BERT embeddings via Hugging Face Candle | Vector search without external API |
embeddings-openai | OpenAI API embeddings (text-embedding-3-small, etc.) | Cloud-based vector search |
embeddings-multimodal | CLIP embeddings for text + image via Candle | Multimodal (text-to-image) search |
embeddings-all | All embedding features above | Full embedding support |
Examples
Lexical search only (no embeddings needed):
[dependencies]
laurus = "0.1.0"
Vector search with local model (no API key required):
[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-candle"] }
Vector search with OpenAI:
[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-openai"] }
Everything:
[dependencies]
laurus = { version = "0.1.0", features = ["embeddings-all"] }
Verify Installation
Create a minimal program to verify that Laurus compiles:
use laurus::Result;
#[tokio::main]
async fn main() -> Result<()> {
println!("Laurus version: {}", laurus::VERSION);
Ok(())
}
cargo run
If you see the version printed, you are ready to proceed to the Quick Start.
Quick Start
This tutorial walks you through building a complete search engine in 5 steps. By the end, you will be able to index documents and search them by keyword.
Step 1 — Create Storage
Storage determines where Laurus persists index data. For development and testing, use MemoryStorage:
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::storage::memory::MemoryStorage;
use laurus::Storage;
let storage: Arc<dyn Storage> = Arc::new(
MemoryStorage::new(Default::default())
);
}
Tip: For production, consider
FileStorage(with optionaluse_mmapfor memory-mapped I/O). See Storage for details.
Step 2 — Define a Schema
A Schema declares the fields in your documents and how each field should be indexed:
#![allow(unused)]
fn main() {
use laurus::Schema;
use laurus::lexical::TextOption;
let schema = Schema::builder()
.add_text_field("title", TextOption::default())
.add_text_field("body", TextOption::default())
.add_default_field("body") // used when no field is specified in a query
.build();
}
Each field has a type. Common types include:
| Method | Field Type | Example Values |
|---|---|---|
add_text_field | Text (full-text searchable) | "Hello world" |
add_integer_field | 64-bit integer | 42 |
add_float_field | 64-bit float | 3.14 |
add_boolean_field | Boolean | true / false |
add_datetime_field | UTC datetime | 2024-01-15T10:30:00Z |
add_hnsw_field | Vector (HNSW index) | [0.1, 0.2, ...] |
add_flat_field | Vector (Flat index) | [0.1, 0.2, ...] |
See Schema & Fields for the full list.
Step 3 — Build an Engine
The Engine ties storage, schema, and runtime components together:
#![allow(unused)]
fn main() {
use laurus::Engine;
let engine = Engine::builder(storage, schema)
.build()
.await?;
}
When you only use text fields, the default StandardAnalyzer is used automatically. To customize analysis or add vector embeddings, see Architecture.
Step 4 — Index Documents
Create documents with the DocumentBuilder and add them to the engine:
#![allow(unused)]
fn main() {
use laurus::Document;
// Each document needs a unique external ID (string)
let doc = Document::builder()
.add_text("title", "Introduction to Rust")
.add_text("body", "Rust is a systems programming language focused on safety and performance.")
.build();
engine.add_document("doc-1", doc).await?;
let doc = Document::builder()
.add_text("title", "Python for Data Science")
.add_text("body", "Python is widely used in machine learning and data analysis.")
.build();
engine.add_document("doc-2", doc).await?;
let doc = Document::builder()
.add_text("title", "Web Development with JavaScript")
.add_text("body", "JavaScript powers interactive web applications and server-side code with Node.js.")
.build();
engine.add_document("doc-3", doc).await?;
// Commit to make documents searchable
engine.commit().await?;
}
Important: Documents are not searchable until
commit()is called.
Step 5 — Search
Use SearchRequestBuilder with a query to search the index:
#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, LexicalSearchRequest};
use laurus::lexical::TermQuery;
// Search for "rust" in the "body" field
let request = SearchRequestBuilder::new()
.lexical_search_request(
LexicalSearchRequest::new(
Box::new(TermQuery::new("body", "rust"))
)
)
.limit(10)
.build();
let results = engine.search(request).await?;
for result in &results {
println!("ID: {}, Score: {:.4}", result.id, result.score);
if let Some(doc) = &result.document {
if let Some(title) = doc.get("title") {
println!(" Title: {:?}", title);
}
}
}
}
Complete Example
Here is the full program that you can copy, paste, and run:
use std::sync::Arc;
use laurus::{
Document, Engine, LexicalSearchRequest,
Result, Schema, SearchRequestBuilder,
};
use laurus::lexical::{TextOption, TermQuery};
use laurus::storage::memory::MemoryStorage;
#[tokio::main]
async fn main() -> Result<()> {
// 1. Storage
let storage = Arc::new(MemoryStorage::new(Default::default()));
// 2. Schema
let schema = Schema::builder()
.add_text_field("title", TextOption::default())
.add_text_field("body", TextOption::default())
.add_default_field("body")
.build();
// 3. Engine
let engine = Engine::builder(storage, schema).build().await?;
// 4. Index documents
for (id, title, body) in [
("doc-1", "Introduction to Rust", "Rust is a systems programming language focused on safety."),
("doc-2", "Python for Data Science", "Python is widely used in machine learning."),
("doc-3", "Web Development", "JavaScript powers interactive web applications."),
] {
let doc = Document::builder()
.add_text("title", title)
.add_text("body", body)
.build();
engine.add_document(id, doc).await?;
}
engine.commit().await?;
// 5. Search
let request = SearchRequestBuilder::new()
.lexical_search_request(
LexicalSearchRequest::new(
Box::new(TermQuery::new("body", "rust"))
)
)
.limit(10)
.build();
let results = engine.search(request).await?;
for r in &results {
println!("{}: score={:.4}", r.id, r.score);
}
Ok(())
}
Next Steps
- Learn how the Engine works internally: Architecture
- Understand Schema and field types: Schema & Fields
- Add vector search: Vector Search
- Combine lexical + vector: Hybrid Search
Core Concepts
This section covers the foundational building blocks of Laurus. Understanding these concepts will help you design effective schemas and configure your search engine.
Topics
Schema & Fields
How to define the structure of your documents. Covers:
SchemaandSchemaBuilder- Lexical field types (Text, Integer, Float, Boolean, DateTime, Geo, Bytes)
- Vector field types (Flat, HNSW, IVF)
DocumentandDocumentBuilderDataValue— the unified value type
Text Analysis
How text is processed before indexing. Covers:
- The
Analyzertrait and the analysis pipeline - Built-in analyzers (Standard, Japanese, Keyword, Pipeline)
PerFieldAnalyzer— different analyzers for different fields- Tokenizers and token filters
Embeddings
How text and images are converted to vectors. Covers:
- The
Embeddertrait - Built-in embedders (Candle BERT, OpenAI, CLIP, Precomputed)
PerFieldEmbedder— different embedders for different fields
Storage
Where index data is stored. Covers:
- The
Storagetrait - Storage backends (Memory, File, Mmap)
PrefixedStoragefor component isolation- Choosing the right backend for your use case
Schema & Fields
The Schema defines the structure of your documents — what fields exist and how each field is indexed. It is the single source of truth for the Engine.
For the TOML file format used by the CLI, see Schema Format Reference.
Schema
A Schema is a collection of named fields. Each field is either a lexical field (for keyword search) or a vector field (for similarity search).
#![allow(unused)]
fn main() {
use laurus::Schema;
use laurus::lexical::TextOption;
use laurus::lexical::core::field::IntegerOption;
use laurus::vector::HnswOption;
let schema = Schema::builder()
.add_text_field("title", TextOption::default())
.add_text_field("body", TextOption::default())
.add_integer_field("year", IntegerOption::default())
.add_hnsw_field("embedding", HnswOption::default())
.add_default_field("body")
.build();
}
Default Fields
add_default_field() specifies which field(s) are searched when a query does not explicitly name a field. This is used by the Query DSL parser.
Field Types
graph TB
FO["FieldOption"]
FO --> T["Text"]
FO --> I["Integer"]
FO --> FL["Float"]
FO --> B["Boolean"]
FO --> DT["DateTime"]
FO --> G["Geo"]
FO --> BY["Bytes"]
FO --> FLAT["Flat"]
FO --> HNSW["HNSW"]
FO --> IVF["IVF"]
Lexical Fields
Lexical fields are indexed using an inverted index and support keyword-based queries.
| Type | Rust Type | SchemaBuilder Method | Description |
|---|---|---|---|
| Text | TextOption | add_text_field() | Full-text searchable; tokenized by the analyzer |
| Integer | IntegerOption | add_integer_field() | 64-bit signed integer; supports range queries |
| Float | FloatOption | add_float_field() | 64-bit floating point; supports range queries |
| Boolean | BooleanOption | add_boolean_field() | true / false |
| DateTime | DateTimeOption | add_datetime_field() | UTC timestamp; supports range queries |
| Geo | GeoOption | add_geo_field() | Latitude/longitude pair; supports radius and bounding box queries |
| Bytes | BytesOption | add_bytes_field() | Raw binary data |
Text Field Options
TextOption controls how text is indexed:
#![allow(unused)]
fn main() {
use laurus::lexical::TextOption;
// Default: indexed + stored
let opt = TextOption::default();
// Customize: indexed + stored + term vectors
let opt = TextOption::default()
.set_indexed(true)
.set_stored(true)
.set_term_vectors(true);
}
| Option | Default | Description |
|---|---|---|
indexed | true | Whether the field is searchable |
stored | true | Whether the original value is stored for retrieval |
term_vectors | false | Whether term positions are stored (needed for phrase queries) |
Vector Fields
Vector fields are indexed using vector indexes for approximate nearest neighbor (ANN) search.
| Type | Rust Type | SchemaBuilder Method | Description |
|---|---|---|---|
| Flat | FlatOption | add_flat_field() | Brute-force linear scan; exact results |
| HNSW | HnswOption | add_hnsw_field() | Hierarchical Navigable Small World graph; fast approximate |
| IVF | IvfOption | add_ivf_field() | Inverted File Index; cluster-based approximate |
HNSW Field Options (most common)
#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;
let opt = HnswOption {
dimension: 384, // vector dimensions
distance: DistanceMetric::Cosine, // distance metric
m: 16, // max connections per layer
ef_construction: 200, // construction search width
base_weight: 1.0, // default scoring weight
quantizer: None, // optional quantization
};
}
See Vector Indexing for detailed parameter guidance.
Document
A Document is a collection of named field values. Use DocumentBuilder to construct documents:
#![allow(unused)]
fn main() {
use laurus::Document;
let doc = Document::builder()
.add_text("title", "Introduction to Rust")
.add_text("body", "Rust is a systems programming language.")
.add_integer("year", 2024)
.add_float("rating", 4.8)
.add_boolean("published", true)
.build();
}
Indexing Documents
The Engine provides two methods for adding documents, each with different semantics:
| Method | Behavior | Use Case |
|---|---|---|
put_document(id, doc) | Upsert — if a document with the same ID exists, it is replaced | Standard document indexing |
add_document(id, doc) | Append — adds the document as a new chunk; multiple chunks can share the same ID | Chunked/split documents (e.g., long articles split into paragraphs) |
#![allow(unused)]
fn main() {
// Upsert: replaces any existing document with id "doc1"
engine.put_document("doc1", doc).await?;
// Append: adds another chunk under the same id "doc1"
engine.add_document("doc1", chunk2).await?;
// Always commit after indexing
engine.commit().await?;
}
Retrieving Documents
Use get_documents to retrieve all documents (including chunks) by external ID:
#![allow(unused)]
fn main() {
let docs = engine.get_documents("doc1").await?;
for doc in &docs {
if let Some(title) = doc.get("title") {
println!("Title: {:?}", title);
}
}
}
Deleting Documents
Delete all documents and chunks sharing an external ID:
#![allow(unused)]
fn main() {
engine.delete_documents("doc1").await?;
engine.commit().await?;
}
Document Lifecycle
graph LR
A["Build Document"] --> B["put/add_document()"]
B --> C["WAL"]
C --> D["commit()"]
D --> E["Searchable"]
E --> F["get_documents()"]
E --> G["delete_documents()"]
Important: Documents are not searchable until
commit()is called.
DocumentBuilder Methods
| Method | Value Type | Description |
|---|---|---|
add_text(name, value) | String | Add a text field |
add_integer(name, value) | i64 | Add an integer field |
add_float(name, value) | f64 | Add a float field |
add_boolean(name, value) | bool | Add a boolean field |
add_datetime(name, value) | DateTime<Utc> | Add a datetime field |
add_vector(name, value) | Vec<f32> | Add a pre-computed vector field |
add_geo(name, lat, lon) | (f64, f64) | Add a geographic point |
add_bytes(name, data) | Vec<u8> | Add binary data |
add_field(name, value) | DataValue | Add any value type |
DataValue
DataValue is the unified value enum that represents any field value in Laurus:
#![allow(unused)]
fn main() {
pub enum DataValue {
Null,
Bool(bool),
Int64(i64),
Float64(f64),
Text(String),
Bytes(Vec<u8>, Option<String>), // (data, optional MIME type)
Vector(Vec<f32>),
DateTime(DateTime<Utc>),
Geo(f64, f64), // (latitude, longitude)
}
}
DataValue implements From<T> for common types, so you can use .into() conversions:
#![allow(unused)]
fn main() {
use laurus::DataValue;
let v: DataValue = "hello".into(); // Text
let v: DataValue = 42i64.into(); // Int64
let v: DataValue = 3.14f64.into(); // Float64
let v: DataValue = true.into(); // Bool
let v: DataValue = vec![0.1f32, 0.2].into(); // Vector
}
Reserved Fields
The _id field is reserved by Laurus for internal use. It stores the external document ID and is always indexed with KeywordAnalyzer (exact match). You do not need to add it to your schema — it is managed automatically.
Schema Design Tips
-
Separate lexical and vector fields — a field is either lexical or vector, never both. For hybrid search, create separate fields (e.g.,
bodyfor text,body_vecfor vector). -
Use
KeywordAnalyzerfor exact-match fields — category, status, and tag fields should useKeywordAnalyzerviaPerFieldAnalyzerto avoid tokenization. -
Choose the right vector index — use HNSW for most cases, Flat for small datasets, IVF for very large datasets. See Vector Indexing.
-
Set default fields — if you use the Query DSL, set default fields so users can write
helloinstead ofbody:hello. -
Use the schema generator — run
laurus create schemato interactively build a schema TOML file instead of writing it by hand. See CLI Commands.
Text Analysis
Text analysis is the process of converting raw text into searchable tokens. When a document is indexed, the analyzer breaks text fields into individual terms; when a query is executed, the same analyzer processes the query text to ensure consistency.
The Analysis Pipeline
graph LR
Input["Raw Text\n'The quick brown FOX jumps!'"]
CF["UnicodeNormalizationCharFilter"]
T["Tokenizer\nSplit into words"]
F1["LowercaseFilter"]
F2["StopFilter"]
F3["StemFilter"]
Output["Terms\n'quick', 'brown', 'fox', 'jump'"]
Input --> CF --> T --> F1 --> F2 --> F3 --> Output
The analysis pipeline consists of:
- Char Filters — normalize raw text at the character level before tokenization
- Tokenizer — splits text into raw tokens (words, characters, n-grams)
- Token Filters — transform, remove, or expand tokens (lowercase, stop words, stemming, synonyms)
The Analyzer Trait
All analyzers implement the Analyzer trait:
#![allow(unused)]
fn main() {
pub trait Analyzer: Send + Sync + Debug {
fn analyze(&self, text: &str) -> Result<TokenStream>;
fn name(&self) -> &str;
fn as_any(&self) -> &dyn Any;
}
}
TokenStream is a Box<dyn Iterator<Item = Token> + Send> — a lazy iterator over tokens.
A Token contains:
| Field | Type | Description |
|---|---|---|
text | String | The token text |
position | usize | Position in the original text |
start_offset | usize | Start byte offset in original text |
end_offset | usize | End byte offset in original text |
position_increment | usize | Distance from previous token |
position_length | usize | Span of the token (>1 for synonyms) |
boost | f32 | Token-level scoring weight |
stopped | bool | Whether marked as a stop word |
metadata | Option<TokenMetadata> | Additional token metadata |
Built-in Analyzers
StandardAnalyzer
The default analyzer. Suitable for most Western languages.
Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilter → StopFilter (128 common English stop words)
#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::standard::StandardAnalyzer;
let analyzer = StandardAnalyzer::default();
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering)
}
JapaneseAnalyzer
Uses morphological analysis for Japanese text segmentation.
Pipeline: UnicodeNormalizationCharFilter (NFKC) → JapaneseIterationMarkCharFilter → LinderaTokenizer → LowercaseFilter → StopFilter (Japanese stop words)
#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::japanese::JapaneseAnalyzer;
let analyzer = JapaneseAnalyzer::new()?;
// "東京都に住んでいる" → ["東京", "都", "に", "住ん", "で", "いる"]
}
KeywordAnalyzer
Treats the entire input as a single token. No tokenization or normalization.
#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;
let analyzer = KeywordAnalyzer::new();
// "Hello World" → ["Hello World"]
}
Use this for fields that should match exactly (categories, tags, status codes).
SimpleAnalyzer
Tokenizes text without any filtering. The original case and all tokens are preserved. Useful when you need complete control over the analysis pipeline or want to test a tokenizer in isolation.
Pipeline: User-specified Tokenizer only (no char filters, no token filters)
#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::simple::SimpleAnalyzer;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use std::sync::Arc;
let tokenizer = Arc::new(RegexTokenizer::new()?);
let analyzer = SimpleAnalyzer::new(tokenizer);
// "Hello World" → ["Hello", "World"]
// (no lowercasing, no stop word removal)
}
Use this for testing tokenizers, or when you want to apply token filters manually in a separate step.
EnglishAnalyzer
An English-specific analyzer. Tokenizes, lowercases, and removes common English stop words.
Pipeline: RegexTokenizer (Unicode word boundaries) → LowercaseFilter → StopFilter (128 common English stop words)
#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::language::english::EnglishAnalyzer;
let analyzer = EnglishAnalyzer::new()?;
// "The Quick Brown Fox" → ["quick", "brown", "fox"]
// ("The" is removed by stop word filtering, remaining tokens are lowercased)
}
PipelineAnalyzer
Build a custom pipeline by combining any char filters, a tokenizer, and any sequence of token filters:
#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;
use laurus::analysis::token_filter::stop::StopFilter;
use laurus::analysis::token_filter::stem::StemFilter;
let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
.add_char_filter(Arc::new(UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC)))
.add_filter(Arc::new(LowercaseFilter::new()))
.add_filter(Arc::new(StopFilter::new()))
.add_filter(Arc::new(StemFilter::new())); // Porter stemmer
}
PerFieldAnalyzer
PerFieldAnalyzer lets you assign different analyzers to different fields within the same engine:
graph LR
PFA["PerFieldAnalyzer"]
PFA -->|"title"| KW["KeywordAnalyzer"]
PFA -->|"body"| STD["StandardAnalyzer"]
PFA -->|"description_ja"| JP["JapaneseAnalyzer"]
PFA -->|other fields| DEF["Default\n(StandardAnalyzer)"]
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use laurus::analysis::analyzer::keyword::KeywordAnalyzer;
use laurus::analysis::analyzer::per_field::PerFieldAnalyzer;
// Default analyzer for fields not explicitly configured
let mut per_field = PerFieldAnalyzer::new(
Arc::new(StandardAnalyzer::default())
);
// Use KeywordAnalyzer for exact-match fields
per_field.add_analyzer("category", Arc::new(KeywordAnalyzer::new()));
per_field.add_analyzer("status", Arc::new(KeywordAnalyzer::new()));
let engine = Engine::builder(storage, schema)
.analyzer(Arc::new(per_field))
.build()
.await?;
}
Note: The
_idfield is always analyzed withKeywordAnalyzerregardless of configuration.
Char Filters
Char filters operate on the raw input text before it reaches the tokenizer. They perform character-level normalization such as Unicode normalization, character mapping, and pattern-based replacement. This ensures that the tokenizer receives clean, normalized text.
All char filters implement the CharFilter trait:
#![allow(unused)]
fn main() {
pub trait CharFilter: Send + Sync {
fn filter(&self, input: &str) -> (String, Vec<Transformation>);
fn name(&self) -> &'static str;
}
}
The Transformation records describe how character positions shifted, allowing the engine to map token positions back to the original text.
| Char Filter | Description |
|---|---|
UnicodeNormalizationCharFilter | Unicode normalization (NFC, NFD, NFKC, NFKD) |
MappingCharFilter | Replaces character sequences based on a mapping dictionary |
PatternReplaceCharFilter | Replaces characters matching a regex pattern |
JapaneseIterationMarkCharFilter | Expands Japanese iteration marks (踊り字) to their base characters |
UnicodeNormalizationCharFilter
Applies Unicode normalization to the input text. NFKC is recommended for search use cases because it normalizes both compatibility characters and composed forms.
#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::unicode_normalize::{
NormalizationForm, UnicodeNormalizationCharFilter,
};
let filter = UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC);
// "Sony" (fullwidth) → "Sony" (halfwidth)
// "㌂" → "アンペア"
}
| Form | Description |
|---|---|
| NFC | Canonical decomposition followed by canonical composition |
| NFD | Canonical decomposition |
| NFKC | Compatibility decomposition followed by canonical composition |
| NFKD | Compatibility decomposition |
MappingCharFilter
Replaces character sequences using a dictionary. Matches are found using the Aho-Corasick algorithm (leftmost-longest match).
#![allow(unused)]
fn main() {
use std::collections::HashMap;
use laurus::analysis::char_filter::mapping::MappingCharFilter;
let mut mapping = HashMap::new();
mapping.insert("ph".to_string(), "f".to_string());
mapping.insert("qu".to_string(), "k".to_string());
let filter = MappingCharFilter::new(mapping)?;
// "phone queue" → "fone keue"
}
PatternReplaceCharFilter
Replaces all occurrences of a regex pattern with a fixed string.
#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;
// Remove hyphens
let filter = PatternReplaceCharFilter::new(r"-", "")?;
// "123-456-789" → "123456789"
// Normalize numbers
let filter = PatternReplaceCharFilter::new(r"\d+", "NUM")?;
// "Year 2024" → "Year NUM"
}
JapaneseIterationMarkCharFilter
Expands Japanese iteration marks (踊り字) to their base characters. Supports kanji (々), hiragana (ゝ, ゞ), and katakana (ヽ, ヾ) iteration marks.
#![allow(unused)]
fn main() {
use laurus::analysis::char_filter::japanese_iteration_mark::JapaneseIterationMarkCharFilter;
let filter = JapaneseIterationMarkCharFilter::new(
true, // normalize kanji iteration marks
true, // normalize kana iteration marks
);
// "佐々木" → "佐佐木"
// "いすゞ" → "いすず"
}
Using Char Filters in a Pipeline
Add char filters to a PipelineAnalyzer with add_char_filter(). Multiple char filters are applied in the order they are added, all before the tokenizer runs.
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::pipeline::PipelineAnalyzer;
use laurus::analysis::char_filter::unicode_normalize::{
NormalizationForm, UnicodeNormalizationCharFilter,
};
use laurus::analysis::char_filter::pattern_replace::PatternReplaceCharFilter;
use laurus::analysis::tokenizer::regex::RegexTokenizer;
use laurus::analysis::token_filter::lowercase::LowercaseFilter;
let analyzer = PipelineAnalyzer::new(Arc::new(RegexTokenizer::new()?))
.add_char_filter(Arc::new(
UnicodeNormalizationCharFilter::new(NormalizationForm::NFKC),
))
.add_char_filter(Arc::new(
PatternReplaceCharFilter::new(r"-", "")?,
))
.add_filter(Arc::new(LowercaseFilter::new()));
// "Tokyo-2024" → NFKC → "Tokyo-2024" → remove hyphens → "Tokyo2024" → tokenize → lowercase → ["tokyo2024"]
}
Tokenizers
| Tokenizer | Description |
|---|---|
RegexTokenizer | Unicode word boundaries; splits on whitespace and punctuation |
UnicodeWordTokenizer | Splits on Unicode word boundaries |
WhitespaceTokenizer | Splits on whitespace only |
WholeTokenizer | Returns the entire input as a single token |
LinderaTokenizer | Japanese morphological analysis (Lindera/MeCab) |
NgramTokenizer | Generates n-gram tokens of configurable size |
Token Filters
| Filter | Description |
|---|---|
LowercaseFilter | Converts tokens to lowercase |
StopFilter | Removes common words (“the”, “is”, “a”) |
StemFilter | Reduces words to their root form (“running” → “run”) |
SynonymGraphFilter | Expands tokens with synonyms from a dictionary |
BoostFilter | Adjusts token boost values |
LimitFilter | Limits the number of tokens |
StripFilter | Strips leading/trailing whitespace from tokens |
FlattenGraphFilter | Flattens token graphs (for synonym expansion) |
RemoveEmptyFilter | Removes empty tokens |
Synonym Expansion
The SynonymGraphFilter expands terms using a synonym dictionary:
#![allow(unused)]
fn main() {
use laurus::analysis::synonym::dictionary::SynonymDictionary;
use laurus::analysis::token_filter::synonym_graph::SynonymGraphFilter;
let mut dict = SynonymDictionary::new(None)?;
dict.add_synonym_group(vec!["ml".into(), "machine learning".into()]);
dict.add_synonym_group(vec!["ai".into(), "artificial intelligence".into()]);
// keep_original=true means original token is preserved alongside synonyms
let filter = SynonymGraphFilter::new(dict, true)
.with_boost(0.8); // synonyms get 80% weight
}
The boost parameter controls how much weight synonyms receive relative to original tokens. A value of 0.8 means synonym matches contribute 80% as much to the score as exact matches.
Embeddings
Embeddings convert text (or images) into dense numeric vectors that capture semantic meaning. Two texts with similar meanings produce vectors that are close together in vector space, enabling similarity-based search.
The Embedder Trait
All embedders implement the Embedder trait:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait Embedder: Send + Sync + Debug {
async fn embed(&self, input: &EmbedInput<'_>) -> Result<Vector>;
async fn embed_batch(&self, inputs: &[EmbedInput<'_>]) -> Result<Vec<Vector>>;
fn supported_input_types(&self) -> Vec<EmbedInputType>;
fn name(&self) -> &str;
fn as_any(&self) -> &dyn Any;
}
}
The embed() method returns a Vector (a struct wrapping Vec<f32>).
EmbedInput supports two modalities:
| Variant | Description |
|---|---|
EmbedInput::Text(&str) | Text input |
EmbedInput::Bytes(&[u8], Option<&str>) | Binary input with optional MIME type (for images) |
Built-in Embedders
CandleBertEmbedder
Runs a BERT model locally using Hugging Face Candle. No API key required.
Feature flag: embeddings-candle
#![allow(unused)]
fn main() {
use laurus::CandleBertEmbedder;
// Downloads model on first run (~80MB)
let embedder = CandleBertEmbedder::new(
"sentence-transformers/all-MiniLM-L6-v2" // model name
)?;
// Output: 384-dimensional vector
}
| Property | Value |
|---|---|
| Model | sentence-transformers/all-MiniLM-L6-v2 |
| Dimensions | 384 |
| Runtime | Local (CPU) |
| First-run download | ~80 MB |
OpenAIEmbedder
Calls the OpenAI Embeddings API. Requires an API key.
Feature flag: embeddings-openai
#![allow(unused)]
fn main() {
use laurus::OpenAIEmbedder;
let embedder = OpenAIEmbedder::new(
api_key,
"text-embedding-3-small".to_string()
).await?;
// Output: 1536-dimensional vector
}
| Property | Value |
|---|---|
| Model | text-embedding-3-small (or any OpenAI model) |
| Dimensions | 1536 (for text-embedding-3-small) |
| Runtime | Remote API call |
| Requires | OPENAI_API_KEY environment variable |
CandleClipEmbedder
Runs a CLIP model locally for multimodal (text + image) embeddings.
Feature flag: embeddings-multimodal
#![allow(unused)]
fn main() {
use laurus::CandleClipEmbedder;
let embedder = CandleClipEmbedder::new(
"openai/clip-vit-base-patch32"
)?;
// Text or images → 512-dimensional vector
}
| Property | Value |
|---|---|
| Model | openai/clip-vit-base-patch32 |
| Dimensions | 512 |
| Input types | Text AND images |
| Use case | Text-to-image search, image-to-image search |
PrecomputedEmbedder
Use pre-computed vectors directly without any embedding computation. Useful when vectors are generated externally.
#![allow(unused)]
fn main() {
use laurus::PrecomputedEmbedder;
let embedder = PrecomputedEmbedder::new(); // no parameters needed
}
When using PrecomputedEmbedder, you provide vectors directly in documents instead of text for embedding:
#![allow(unused)]
fn main() {
let doc = Document::builder()
.add_vector("embedding", vec![0.1, 0.2, 0.3, ...])
.build();
}
PerFieldEmbedder
PerFieldEmbedder routes embedding requests to field-specific embedders:
graph LR
PFE["PerFieldEmbedder"]
PFE -->|"text_vec"| BERT["CandleBertEmbedder\n(384 dim)"]
PFE -->|"image_vec"| CLIP["CandleClipEmbedder\n(512 dim)"]
PFE -->|other fields| DEF["Default Embedder"]
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::PerFieldEmbedder;
let bert = Arc::new(CandleBertEmbedder::new("...")?);
let clip = Arc::new(CandleClipEmbedder::new("...")?);
let mut per_field = PerFieldEmbedder::new(bert.clone());
per_field.add_embedder("text_vec", bert.clone());
per_field.add_embedder("image_vec", clip.clone());
let engine = Engine::builder(storage, schema)
.embedder(Arc::new(per_field))
.build()
.await?;
}
This is especially useful when:
- Different vector fields need different models (e.g., BERT for text, CLIP for images)
- Different fields have different vector dimensions
- You want to mix local and remote embedders
How Embeddings Are Used
At Index Time
When you add a text value to a vector field, the engine automatically embeds it:
#![allow(unused)]
fn main() {
let doc = Document::builder()
.add_text("text_vec", "Rust is a systems programming language")
.build();
engine.add_document("doc-1", doc).await?;
// The embedder converts the text to a vector before indexing
}
At Search Time
When you search with text, the engine embeds the query text as well:
#![allow(unused)]
fn main() {
// Builder API
let request = VectorSearchRequestBuilder::new()
.add_text("text_vec", "systems programming")
.build();
// Query DSL
let request = vector_parser.parse(r#"text_vec:~"systems programming""#).await?;
}
Both approaches embed the query text using the same embedder that was used at index time, ensuring consistent vector spaces.
Choosing an Embedder
| Scenario | Recommended Embedder |
|---|---|
| Quick prototyping, offline use | CandleBertEmbedder |
| Production with high accuracy | OpenAIEmbedder |
| Text + image search | CandleClipEmbedder |
| Pre-computed vectors from external pipeline | PrecomputedEmbedder |
| Multiple models per field | PerFieldEmbedder wrapping others |
Storage
Laurus uses a pluggable storage layer that abstracts how and where index data is persisted. All components — lexical index, vector index, and document log — share a single storage backend.
The Storage Trait
All backends implement the Storage trait:
#![allow(unused)]
fn main() {
pub trait Storage: Send + Sync + Debug {
fn loading_mode(&self) -> LoadingMode;
fn open_input(&self, name: &str) -> Result<Box<dyn StorageInput>>;
fn create_output(&self, name: &str) -> Result<Box<dyn StorageOutput>>;
fn file_exists(&self, name: &str) -> bool;
fn delete_file(&self, name: &str) -> Result<()>;
fn list_files(&self) -> Result<Vec<String>>;
fn file_size(&self, name: &str) -> Result<u64>;
// ... additional methods
}
}
This interface is file-oriented: all data (index segments, metadata, WAL entries, documents) is stored as named files accessed through streaming StorageInput / StorageOutput handles.
Storage Backends
MemoryStorage
All data lives in memory. Fast and simple, but not durable.
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::memory::MemoryStorage;
let storage: Arc<dyn Storage> = Arc::new(
MemoryStorage::new(Default::default())
);
}
| Property | Value |
|---|---|
| Durability | None (data lost on process exit) |
| Speed | Fastest |
| Use case | Testing, prototyping, ephemeral data |
FileStorage
Standard file-system based persistence. Each key maps to a file on disk.
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::file::{FileStorage, FileStorageConfig};
let config = FileStorageConfig::new("/tmp/laurus-data");
let storage: Arc<dyn Storage> = Arc::new(FileStorage::new("/tmp/laurus-data", config)?);
}
| Property | Value |
|---|---|
| Durability | Full (persisted to disk) |
| Speed | Moderate (disk I/O) |
| Use case | General production use |
FileStorage with Memory Mapping
FileStorage supports memory-mapped file access via the use_mmap configuration flag. When enabled, the OS manages paging between memory and disk.
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::Storage;
use laurus::storage::file::{FileStorage, FileStorageConfig};
let mut config = FileStorageConfig::new("/tmp/laurus-data");
config.use_mmap = true; // enable memory-mapped I/O
let storage: Arc<dyn Storage> = Arc::new(FileStorage::new("/tmp/laurus-data", config)?);
}
| Property | Value |
|---|---|
| Durability | Full (persisted to disk) |
| Speed | Fast (OS-managed memory mapping) |
| Use case | Large datasets, read-heavy workloads |
StorageFactory
You can also create storage via configuration:
#![allow(unused)]
fn main() {
use laurus::storage::{StorageConfig, StorageFactory};
use laurus::storage::memory::MemoryStorageConfig;
let storage = StorageFactory::create(
StorageConfig::Memory(MemoryStorageConfig::default())
)?;
}
PrefixedStorage
The engine uses PrefixedStorage to isolate components within a single storage backend:
graph TB
E["Engine"]
E --> P1["PrefixedStorage\nprefix = 'lexical/'"]
E --> P2["PrefixedStorage\nprefix = 'vector/'"]
E --> P3["PrefixedStorage\nprefix = 'documents/'"]
P1 --> S["Storage Backend"]
P2 --> S
P3 --> S
When the lexical store writes a key segments/seg-001.dict, it is actually stored as lexical/segments/seg-001.dict in the underlying backend. This ensures no key collisions between components.
You do not need to create PrefixedStorage yourself — the EngineBuilder handles this automatically.
Choosing a Backend
| Factor | MemoryStorage | FileStorage | FileStorage (mmap) |
|---|---|---|---|
| Durability | None | Full | Full |
| Read speed | Fastest | Moderate | Fast |
| Write speed | Fastest | Moderate | Moderate |
| Memory usage | Proportional to data size | Low | OS-managed |
| Max data size | Limited by RAM | Limited by disk | Limited by disk + address space |
| Best for | Tests, small datasets | General use | Large read-heavy datasets |
Recommendations
- Development / Testing: Use
MemoryStoragefor fast iteration without file cleanup - Production (general): Use
FileStoragefor reliable persistence - Production (large scale): Use
FileStoragewithuse_mmap = truewhen you have large indexes and want to leverage OS page cache
Next Steps
- Learn how the lexical index works: Lexical Indexing
- Learn how the vector index works: Vector Indexing
Indexing
This section explains how Laurus stores and organizes data internally. Understanding the indexing layer will help you choose the right field types and tune performance.
Topics
Lexical Indexing
How text, numeric, and geographic fields are indexed using an inverted index. Covers:
- The inverted index structure (term dictionary, posting lists)
- BKD trees for numeric range queries
- Segment files and their formats
- BM25 scoring
Vector Indexing
How vector fields are indexed for approximate nearest neighbor search. Covers:
- Index types: Flat, HNSW, IVF
- Parameter tuning (m, ef_construction, n_clusters, n_probe)
- Distance metrics (Cosine, Euclidean, DotProduct)
- Quantization (SQ8, PQ)
Lexical Indexing
Lexical indexing powers keyword-based search. When a document’s text field is indexed, Laurus builds an inverted index — a data structure that maps terms to the documents containing them.
How Lexical Indexing Works
sequenceDiagram
participant Doc as Document
participant Analyzer
participant Writer as IndexWriter
participant Seg as Segment
Doc->>Analyzer: "The quick brown fox"
Analyzer->>Analyzer: Tokenize + Filter
Analyzer-->>Writer: ["quick", "brown", "fox"]
Writer->>Writer: Buffer in memory
Writer->>Seg: Flush to segment on commit()
Step by Step
- Analyze: The text passes through the configured analyzer (tokenizer + filters), producing a stream of normalized terms
- Buffer: Terms are stored in an in-memory write buffer, organized by field
- Commit: On
commit(), the buffer is flushed to a new segment on storage
The Inverted Index
An inverted index is essentially a map from terms to document lists:
graph LR
subgraph "Term Dictionary"
T1["'brown'"]
T2["'fox'"]
T3["'quick'"]
T4["'rust'"]
end
subgraph "Posting Lists"
P1["doc_1, doc_3"]
P2["doc_1"]
P3["doc_1, doc_2"]
P4["doc_2, doc_3"]
end
T1 --> P1
T2 --> P2
T3 --> P3
T4 --> P4
| Component | Description |
|---|---|
| Term Dictionary | Sorted list of all unique terms in the index; supports fast prefix lookup |
| Posting Lists | For each term, a list of document IDs and metadata (term frequency, positions) |
| Doc Values | Column-oriented storage for sort/filter operations on numeric and date fields |
Posting List Contents
Each entry in a posting list contains:
| Field | Description |
|---|---|
| Document ID | Internal u64 identifier |
| Term Frequency | How many times the term appears in this document |
| Positions (optional) | Where in the document the term appears (needed for phrase queries) |
| Weight | Score weight for this posting |
Numeric and Date Fields
Integer, float, and datetime fields are indexed using a BKD tree — a space-partitioning data structure optimized for range queries:
graph TB
Root["BKD Root"]
Root --> L["values < 50"]
Root --> R["values >= 50"]
L --> LL["values < 25"]
L --> LR["25 <= values < 50"]
R --> RL["50 <= values < 75"]
R --> RR["values >= 75"]
BKD trees allow efficient evaluation of range queries like price:[10 TO 100] or date:[2024-01-01 TO 2024-12-31].
Geo Fields
Geographic fields store latitude/longitude pairs. They are indexed using a spatial data structure that supports:
- Radius queries: find all points within N kilometers of a center point
- Bounding box queries: find all points within a rectangular area
Segments
The lexical index is organized into segments. Each segment is an immutable, self-contained mini-index:
graph TB
LI["Lexical Index"]
LI --> S1["Segment 0"]
LI --> S2["Segment 1"]
LI --> S3["Segment 2"]
S1 --- F1[".dict (terms)"]
S1 --- F2[".post (postings)"]
S1 --- F3[".bkd (numerics)"]
S1 --- F4[".docs (doc store)"]
S1 --- F5[".dv (doc values)"]
S1 --- F6[".meta (metadata)"]
S1 --- F7[".lens (field lengths)"]
| File Extension | Contents |
|---|---|
.dict | Term dictionary (sorted terms + metadata offsets) |
.post | Posting lists (document IDs, term frequencies, positions) |
.bkd | BKD tree data for numeric and date fields |
.docs | Stored field values (the original document content) |
.dv | Doc values for sorting and filtering |
.meta | Segment metadata (doc count, term count, etc.) |
.lens | Field length norms (for BM25 scoring) |
Segment Lifecycle
- Create: A new segment is created each time
commit()is called - Search: All segments are searched in parallel and results are merged
- Merge: Periodically, multiple small segments are merged into larger ones to improve query performance
- Delete: When a document is deleted, its ID is added to a deletion bitmap rather than physically removed (see Deletions & Compaction)
BM25 Scoring
Laurus uses the BM25 algorithm to score lexical search results. BM25 considers:
- Term Frequency (TF): how often the term appears in the document (more = better, with diminishing returns)
- Inverse Document Frequency (IDF): how rare the term is across all documents (rarer = more important)
- Field Length Normalization: shorter fields are boosted relative to longer ones
The formula:
score(q, d) = IDF(q) * (TF(q, d) * (k1 + 1)) / (TF(q, d) + k1 * (1 - b + b * |d| / avgdl))
Where k1 = 1.2 and b = 0.75 are the default tuning parameters.
SIMD Optimization
Vector distance calculations leverage SIMD (Single Instruction, Multiple Data) instructions when available, providing significant speedups for similarity computations in vector search.
Code Example
use std::sync::Arc;
use laurus::{Document, Engine, Schema};
use laurus::lexical::TextOption;
use laurus::lexical::core::field::IntegerOption;
use laurus::storage::memory::MemoryStorage;
#[tokio::main]
async fn main() -> laurus::Result<()> {
let storage = Arc::new(MemoryStorage::new(Default::default()));
let schema = Schema::builder()
.add_text_field("title", TextOption::default())
.add_text_field("body", TextOption::default())
.add_integer_field("year", IntegerOption::default())
.build();
let engine = Engine::builder(storage, schema).build().await?;
// Index documents
engine.add_document("doc-1", Document::builder()
.add_text("title", "Rust Programming")
.add_text("body", "Rust is a systems programming language.")
.add_integer("year", 2024)
.build()
).await?;
// Commit to flush segments to storage
engine.commit().await?;
Ok(())
}
Next Steps
- Learn how vector indexes work: Vector Indexing
- Run queries against the lexical index: Lexical Search
Vector Indexing
Vector indexing powers similarity-based search. When a document’s vector field is indexed, Laurus stores the embedding vector in a specialized index structure that enables fast approximate nearest neighbor (ANN) retrieval.
How Vector Indexing Works
sequenceDiagram
participant Doc as Document
participant Embedder
participant Normalize as Normalizer
participant Index as Vector Index
Doc->>Embedder: "Rust is a systems language"
Embedder-->>Normalize: [0.12, -0.45, 0.78, ...]
Normalize->>Normalize: L2 normalize
Normalize-->>Index: [0.14, -0.52, 0.90, ...]
Index->>Index: Insert into index structure
Step by Step
- Embed: The text (or image) is converted to a vector by the configured embedder
- Normalize: The vector is L2-normalized (for cosine similarity)
- Index: The vector is inserted into the configured index structure (Flat, HNSW, or IVF)
- Commit: On
commit(), the index is flushed to persistent storage
Index Types
Laurus supports three vector index types, each with different performance characteristics:
Comparison
| Property | Flat | HNSW | IVF |
|---|---|---|---|
| Accuracy | 100% (exact) | ~95-99% (approximate) | ~90-98% (approximate) |
| Search speed | O(n) linear scan | O(log n) graph walk | O(n/k) cluster scan |
| Memory usage | Low | Higher (graph edges) | Moderate (centroids) |
| Index build time | Fast | Moderate | Slower (clustering) |
| Best for | < 10K vectors | 10K - 10M vectors | > 1M vectors |
Flat Index
The simplest index. Compares the query vector against every stored vector (brute-force).
#![allow(unused)]
fn main() {
use laurus::vector::FlatOption;
use laurus::vector::core::distance::DistanceMetric;
let opt = FlatOption {
dimension: 384,
distance: DistanceMetric::Cosine,
..Default::default()
};
}
- Pros: 100% recall (exact results), simple, low memory
- Cons: Slow for large datasets (linear scan)
- Use when: You have fewer than ~10,000 vectors, or you need exact results
HNSW Index
Hierarchical Navigable Small World graph. The default and most commonly used index type.
graph TB
subgraph "Layer 2 (sparse)"
A2["A"] --- C2["C"]
end
subgraph "Layer 1 (medium)"
A1["A"] --- B1["B"]
A1 --- C1["C"]
B1 --- D1["D"]
C1 --- D1
end
subgraph "Layer 0 (dense - all vectors)"
A0["A"] --- B0["B"]
A0 --- C0["C"]
B0 --- D0["D"]
B0 --- E0["E"]
C0 --- D0
C0 --- F0["F"]
D0 --- E0
E0 --- F0
end
A2 -.->|"entry point"| A1
A1 -.-> A0
C2 -.-> C1
C1 -.-> C0
B1 -.-> B0
D1 -.-> D0
The HNSW algorithm searches from the top (sparse) layer down to the bottom (dense) layer, narrowing the search space at each level.
#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;
let opt = HnswOption {
dimension: 384,
distance: DistanceMetric::Cosine,
m: 16, // max connections per node per layer
ef_construction: 200, // search width during index building
..Default::default()
};
}
HNSW Parameters
| Parameter | Default | Description | Impact |
|---|---|---|---|
m | 16 | Max bi-directional connections per layer | Higher = better recall, more memory |
ef_construction | 200 | Search width during index building | Higher = better recall, slower build |
dimension | 128 | Vector dimensions | Must match embedder output |
distance | Cosine | Distance metric | See Distance Metrics below |
Tuning tips:
- Increase
m(e.g., 32 or 64) for higher recall at the cost of memory - Increase
ef_construction(e.g., 400) for better index quality at the cost of build time - At search time, the
ef_searchparameter (set in the search request) controls the search width
IVF Index
Inverted File Index. Partitions vectors into clusters, then only searches relevant clusters.
graph TB
Q["Query Vector"]
Q --> C1["Cluster 1\n(centroid)"]
Q --> C2["Cluster 2\n(centroid)"]
C1 --> V1["vec_3"]
C1 --> V2["vec_7"]
C1 --> V3["vec_12"]
C2 --> V4["vec_1"]
C2 --> V5["vec_9"]
C2 --> V6["vec_15"]
style C1 fill:#f9f,stroke:#333
style C2 fill:#f9f,stroke:#333
#![allow(unused)]
fn main() {
use laurus::vector::IvfOption;
use laurus::vector::core::distance::DistanceMetric;
let opt = IvfOption {
dimension: 384,
distance: DistanceMetric::Cosine,
n_clusters: 100, // number of clusters
n_probe: 10, // clusters to search at query time
..Default::default()
};
}
IVF Parameters
| Parameter | Default | Description | Impact |
|---|---|---|---|
n_clusters | 100 | Number of Voronoi cells | More clusters = faster search, lower recall |
n_probe | 1 | Clusters to search at query time | Higher = better recall, slower search |
dimension | (required) | Vector dimensions | Must match embedder output |
distance | Cosine | Distance metric | See Distance Metrics below |
Tuning tips:
- Set
n_clustersto roughlysqrt(n)wherenis the number of vectors - Set
n_probeto 5-20% ofn_clustersfor a good recall/speed trade-off - IVF requires a training phase — initial indexing may be slower
Distance Metrics
| Metric | Description | Range | Best For |
|---|---|---|---|
Cosine | 1 - cosine similarity | [0, 2] | Text embeddings (most common) |
Euclidean | L2 distance | [0, +inf) | Spatial data |
Manhattan | L1 distance | [0, +inf) | Feature vectors |
DotProduct | Negative inner product | (-inf, +inf) | Pre-normalized vectors |
Angular | Angular distance | [0, pi] | Directional similarity |
#![allow(unused)]
fn main() {
use laurus::vector::core::distance::DistanceMetric;
let metric = DistanceMetric::Cosine; // Default for text
let metric = DistanceMetric::Euclidean; // For spatial data
let metric = DistanceMetric::Manhattan; // L1 distance
let metric = DistanceMetric::DotProduct; // For pre-normalized vectors
let metric = DistanceMetric::Angular; // Angular distance
}
Note: For cosine similarity, vectors are automatically L2-normalized before indexing. Lower distance = more similar.
Quantization
Quantization reduces memory usage by compressing vectors at the cost of some accuracy:
| Method | Enum Variant | Description | Memory Reduction |
|---|---|---|---|
| Scalar 8-bit | Scalar8Bit | Scalar quantization to 8-bit integers | ~4x |
| Product Quantization | ProductQuantization { subvector_count } | Splits vectors into sub-vectors and quantizes each | ~16-64x |
#![allow(unused)]
fn main() {
use laurus::vector::HnswOption;
use laurus::vector::core::quantization::QuantizationMethod;
let opt = HnswOption {
dimension: 384,
quantizer: Some(QuantizationMethod::Scalar8Bit),
..Default::default()
};
}
Segment Files
Each vector index type stores its data in a single segment file:
| Index Type | File Extension | Contents |
|---|---|---|
| HNSW | .hnsw | Graph structure, vectors, and metadata |
| Flat | .flat | Raw vectors and metadata |
| IVF | .ivf | Cluster centroids, assigned vectors, and metadata |
Code Example
use std::sync::Arc;
use laurus::{Document, Engine, Schema};
use laurus::lexical::TextOption;
use laurus::vector::HnswOption;
use laurus::vector::core::distance::DistanceMetric;
use laurus::storage::memory::MemoryStorage;
#[tokio::main]
async fn main() -> laurus::Result<()> {
let storage = Arc::new(MemoryStorage::new(Default::default()));
let schema = Schema::builder()
.add_text_field("title", TextOption::default())
.add_hnsw_field("embedding", HnswOption {
dimension: 384,
distance: DistanceMetric::Cosine,
m: 16,
ef_construction: 200,
..Default::default()
})
.build();
// With an embedder, text in vector fields is automatically embedded
let engine = Engine::builder(storage, schema)
.embedder(my_embedder)
.build()
.await?;
// Add text to the vector field — it will be embedded automatically
engine.add_document("doc-1", Document::builder()
.add_text("title", "Rust Programming")
.add_text("embedding", "Rust is a systems programming language.")
.build()
).await?;
engine.commit().await?;
Ok(())
}
Next Steps
- Search the vector index: Vector Search
- Combine with lexical search: Hybrid Search
Search
This section covers how to query your indexed data. Laurus supports three search modes that can be used independently or combined.
Topics
Lexical Search
Keyword-based search using an inverted index. Covers:
- All query types: Term, Phrase, Boolean, Fuzzy, Wildcard, Range, Geo, Span
- BM25 scoring and field boosts
- Using the Query DSL for text-based queries
Vector Search
Semantic similarity search using vector embeddings. Covers:
- VectorSearchRequestBuilder API
- Multi-field vector search and score modes
- Filtered vector search
Hybrid Search
Combining lexical and vector search for best-of-both-worlds results. Covers:
- SearchRequestBuilder API
- Fusion algorithms (RRF, WeightedSum)
- Filtered hybrid search
- Pagination with offset/limit
Spelling Correction
Suggest corrections for misspelled query terms. Covers:
- SpellingCorrector and “Did you mean?” features
- Custom dictionaries and configuration
- Learning from index terms and user queries
Lexical Search
Lexical search finds documents by matching keywords against an inverted index. Laurus provides a rich set of query types that cover exact matching, phrase matching, fuzzy matching, and more.
Basic Usage
#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, LexicalSearchRequest};
use laurus::lexical::TermQuery;
let request = SearchRequestBuilder::new()
.lexical_search_request(
LexicalSearchRequest::new(
Box::new(TermQuery::new("body", "rust"))
)
)
.limit(10)
.build();
let results = engine.search(request).await?;
}
Query Types
TermQuery
Matches documents containing an exact term in a specific field.
#![allow(unused)]
fn main() {
use laurus::lexical::TermQuery;
// Find documents where "body" contains the term "rust"
let query = TermQuery::new("body", "rust");
}
Note: Terms are matched after analysis. If the field uses
StandardAnalyzer, both the indexed text and the query term are lowercased, soTermQuery::new("body", "rust")will match “Rust” in the original text.
PhraseQuery
Matches documents containing an exact sequence of terms.
#![allow(unused)]
fn main() {
use laurus::lexical::query::phrase::PhraseQuery;
// Find documents containing the exact phrase "machine learning"
let query = PhraseQuery::new("body", vec!["machine".to_string(), "learning".to_string()]);
// Or use the convenience method from a phrase string:
let query = PhraseQuery::from_phrase("body", "machine learning");
}
Phrase queries require term positions to be stored (the default for TextOption).
BooleanQuery
Combines multiple queries with boolean logic.
#![allow(unused)]
fn main() {
use laurus::lexical::query::boolean::{BooleanQuery, BooleanQueryBuilder, Occur};
let query = BooleanQueryBuilder::new()
.must(Box::new(TermQuery::new("body", "rust"))) // AND
.must(Box::new(TermQuery::new("body", "programming"))) // AND
.must_not(Box::new(TermQuery::new("body", "python"))) // NOT
.build();
}
| Occur | Meaning | DSL Equivalent |
|---|---|---|
Must | Document MUST match | +term or AND |
Should | Document SHOULD match (boosts score) | term or OR |
MustNot | Document MUST NOT match | -term or NOT |
Filter | MUST match, but does not affect score | (no DSL equivalent) |
FuzzyQuery
Matches terms within a specified edit distance (Levenshtein distance).
#![allow(unused)]
fn main() {
use laurus::lexical::query::fuzzy::FuzzyQuery;
// Find documents matching "programing" within edit distance 2
// This will match "programming", "programing", etc.
let query = FuzzyQuery::new("body", "programing"); // default max_edits = 2
}
WildcardQuery
Matches terms using wildcard patterns.
#![allow(unused)]
fn main() {
use laurus::lexical::query::wildcard::WildcardQuery;
// '?' matches exactly one character, '*' matches zero or more
let query = WildcardQuery::new("filename", "*.pdf")?;
let query = WildcardQuery::new("body", "pro*")?;
let query = WildcardQuery::new("body", "col?r")?; // matches "color" and "colour"
}
PrefixQuery
Matches documents containing terms that start with a specific prefix.
#![allow(unused)]
fn main() {
use laurus::lexical::query::prefix::PrefixQuery;
// Find documents where "body" contains terms starting with "pro"
// This matches "programming", "program", "production", etc.
let query = PrefixQuery::new("body", "pro");
}
RegexpQuery
Matches documents containing terms that match a regular expression pattern.
#![allow(unused)]
fn main() {
use laurus::lexical::query::regexp::RegexpQuery;
// Find documents where "body" contains terms matching the regex
let query = RegexpQuery::new("body", "^pro.*ing$")?;
// Match version-like patterns
let query = RegexpQuery::new("version", r"^v\d+\.\d+")?;
}
Note:
RegexpQuery::new()returnsResultbecause the regex pattern is validated at construction time. Invalid patterns will produce an error.
NumericRangeQuery
Matches documents with numeric field values within a range.
#![allow(unused)]
fn main() {
use laurus::lexical::NumericRangeQuery;
use laurus::lexical::core::field::NumericType;
// Find documents where "price" is between 10.0 and 100.0 (inclusive)
let query = NumericRangeQuery::new(
"price",
NumericType::Float,
Some(10.0), // min
Some(100.0), // max
true, // include min
true, // include max
);
// Open-ended range: price >= 50
let query = NumericRangeQuery::new(
"price",
NumericType::Float,
Some(50.0),
None, // no upper bound
true,
false,
);
}
GeoQuery
Matches documents by geographic location.
#![allow(unused)]
fn main() {
use laurus::lexical::query::geo::GeoQuery;
// Find documents within 10km of Tokyo Station (35.6812, 139.7671)
let query = GeoQuery::within_radius("location", 35.6812, 139.7671, 10.0)?; // radius in kilometers
// Find documents within a bounding box (min_lat, min_lon, max_lat, max_lon)
let query = GeoQuery::within_bounding_box(
"location",
35.0, 139.0, // min (lat, lon)
36.0, 140.0, // max (lat, lon)
)?;
}
SpanQuery
Matches terms based on their proximity within a document. Use SpanTermQuery and SpanNearQuery to build proximity queries:
#![allow(unused)]
fn main() {
use laurus::lexical::query::span::{SpanQuery, SpanTermQuery, SpanNearQuery};
// Find documents where "quick" appears near "fox" (within 3 positions)
let query = SpanNearQuery::new(
"body",
vec![
Box::new(SpanTermQuery::new("body", "quick")) as Box<dyn SpanQuery>,
Box::new(SpanTermQuery::new("body", "fox")) as Box<dyn SpanQuery>,
],
3, // slop (max distance between terms)
true, // in_order (terms must appear in order)
);
}
Scoring
Lexical search results are scored using BM25. The score reflects how relevant a document is to the query:
- Higher term frequency in the document increases the score
- Rarer terms across the index increase the score
- Shorter documents are boosted relative to longer ones
Field Boosts
You can boost specific fields to influence relevance:
#![allow(unused)]
fn main() {
use laurus::LexicalSearchRequest;
let mut request = LexicalSearchRequest::new(Box::new(query));
request.field_boosts.insert("title".to_string(), 2.0); // title matches count double
request.field_boosts.insert("body".to_string(), 1.0);
}
LexicalSearchRequest Options
| Option | Default | Description |
|---|---|---|
query | (required) | The query to execute |
limit | 10 | Maximum number of results |
load_documents | true | Whether to load full document content |
min_score | 0.0 | Minimum score threshold |
timeout_ms | None | Search timeout in milliseconds |
parallel | false | Enable parallel search across segments |
sort_by | Score | Sort by relevance score, or by a field (asc / desc) |
field_boosts | empty | Per-field score multipliers |
Builder Methods
LexicalSearchRequest supports a builder-style API for setting options:
#![allow(unused)]
fn main() {
use laurus::LexicalSearchRequest;
use laurus::lexical::TermQuery;
let request = LexicalSearchRequest::new(Box::new(TermQuery::new("body", "rust")))
.limit(20)
.min_score(0.5)
.timeout_ms(5000)
.parallel(true)
.sort_by_field_desc("date")
.with_field_boost("title", 2.0)
.with_field_boost("body", 1.0);
}
Using the Query DSL
Instead of building queries programmatically, you can use the text-based Query DSL:
#![allow(unused)]
fn main() {
use laurus::lexical::QueryParser;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use std::sync::Arc;
let analyzer = Arc::new(StandardAnalyzer::default());
let parser = QueryParser::new(analyzer).with_default_field("body");
// Simple term
let query = parser.parse("rust")?;
// Boolean
let query = parser.parse("rust AND programming")?;
// Phrase
let query = parser.parse("\"machine learning\"")?;
// Field-specific
let query = parser.parse("title:rust AND body:programming")?;
// Fuzzy
let query = parser.parse("programing~2")?;
// Range
let query = parser.parse("year:[2020 TO 2024]")?;
}
See Query DSL for the complete syntax reference.
Next Steps
- Semantic similarity search: Vector Search
- Combine lexical + vector: Hybrid Search
- Full DSL syntax reference: Query DSL
Vector Search
Vector search finds documents by semantic similarity. Instead of matching keywords, it compares the meaning of the query against document embeddings in vector space.
Basic Usage
Builder API
#![allow(unused)]
fn main() {
use laurus::SearchRequestBuilder;
use laurus::vector::VectorSearchRequestBuilder;
let request = SearchRequestBuilder::new()
.vector_search_request(
VectorSearchRequestBuilder::new()
.add_text("embedding", "systems programming language")
.limit(10)
.build()
)
.build();
let results = engine.search(request).await?;
}
The add_text() method stores the text as a query payload. At search time, the engine embeds it using the configured embedder and then searches the vector index.
Query DSL
#![allow(unused)]
fn main() {
use laurus::vector::VectorQueryParser;
let parser = VectorQueryParser::new(embedder.clone())
.with_default_field("embedding");
let request = parser.parse(r#"embedding:~"systems programming""#).await?;
}
VectorSearchRequestBuilder
The builder API provides fine-grained control:
#![allow(unused)]
fn main() {
use laurus::vector::VectorSearchRequestBuilder;
use laurus::vector::store::request::QueryVector;
let request = VectorSearchRequestBuilder::new()
// Text query (will be embedded at search time)
.add_text("text_vec", "machine learning")
// Or use a pre-computed vector directly
.add_vector("embedding", vec![0.1, 0.2, 0.3, /* ... */])
// Search parameters
.limit(20)
.build();
}
Methods
| Method | Description |
|---|---|
add_text(field, text) | Add a text query for a specific field (embedded at search time) |
add_vector(field, vector) | Add a pre-computed query vector for a specific field |
add_vector_with_weight(field, vector, weight) | Add a pre-computed vector with an explicit weight |
add_payload(field, payload) | Add a generic DataValue payload to be embedded |
add_bytes(field, bytes, mime) | Add a binary payload (e.g., image bytes for multimodal) |
field(name) | Restrict search to a specific field |
fields(names) | Restrict search to multiple fields |
limit(n) | Maximum number of results (default: 10) |
score_mode(VectorScoreMode) | Score combination mode (WeightedSum, MaxSim, LateInteraction) |
min_score(f32) | Minimum score threshold (default: 0.0) |
overfetch(f32) | Overfetch factor for better result quality (default: 1.0) |
build() | Build the VectorSearchRequest |
Multi-Field Vector Search
You can search across multiple vector fields in a single request:
#![allow(unused)]
fn main() {
let request = VectorSearchRequestBuilder::new()
.add_text("text_vec", "cute kitten")
.add_text("image_vec", "fluffy cat")
.build();
}
Each clause produces a vector that is searched against its respective field. Results are combined using the configured score mode.
Score Modes
| Mode | Description |
|---|---|
WeightedSum (default) | Sum of (similarity * weight) across all clauses |
MaxSim | Maximum similarity score across clauses |
LateInteraction | ColBERT-style late interaction scoring |
Weights
Use the ^ boost syntax in DSL or weight in QueryVector to adjust how much each field contributes:
text_vec:~"cute kitten"^1.0 image_vec:~"fluffy cat"^0.5
This means text similarity counts twice as much as image similarity.
Filtered Vector Search
You can apply lexical filters to narrow the vector search results:
#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, LexicalSearchRequest};
use laurus::lexical::TermQuery;
use laurus::vector::VectorSearchRequestBuilder;
// Vector search with a category filter
let request = SearchRequestBuilder::new()
.vector_search_request(
VectorSearchRequestBuilder::new()
.add_text("embedding", "machine learning")
.build()
)
.filter_query(Box::new(TermQuery::new("category", "tutorial")))
.limit(10)
.build();
let results = engine.search(request).await?;
}
The filter query runs first on the lexical index to identify allowed document IDs, then the vector search is restricted to those IDs.
Filter with Numeric Range
#![allow(unused)]
fn main() {
use laurus::lexical::NumericRangeQuery;
use laurus::lexical::core::field::NumericType;
let request = SearchRequestBuilder::new()
.vector_search_request(
VectorSearchRequestBuilder::new()
.add_text("embedding", "type systems")
.build()
)
.filter_query(Box::new(NumericRangeQuery::new(
"year", NumericType::Integer,
Some(2020.0), Some(2024.0), true, true
)))
.limit(10)
.build();
}
Distance Metrics
The distance metric is configured per field in the schema (see Vector Indexing):
| Metric | Description | Lower = More Similar |
|---|---|---|
| Cosine | 1 - cosine similarity | Yes |
| Euclidean | L2 distance | Yes |
| Manhattan | L1 distance | Yes |
| DotProduct | Negative inner product | Yes |
| Angular | Angular distance | Yes |
Code Example: Complete Vector Search
use std::sync::Arc;
use laurus::{Document, Engine, Schema, SearchRequestBuilder, PerFieldEmbedder};
use laurus::lexical::TextOption;
use laurus::vector::HnswOption;
use laurus::vector::VectorSearchRequestBuilder;
use laurus::storage::memory::MemoryStorage;
#[tokio::main]
async fn main() -> laurus::Result<()> {
let storage = Arc::new(MemoryStorage::new(Default::default()));
let schema = Schema::builder()
.add_text_field("title", TextOption::default())
.add_hnsw_field("text_vec", HnswOption {
dimension: 384,
..Default::default()
})
.build();
// Set up per-field embedder
let embedder = Arc::new(my_embedder);
let mut pfe = PerFieldEmbedder::new(embedder.clone());
pfe.add_embedder("text_vec", embedder.clone());
let engine = Engine::builder(storage, schema)
.embedder(Arc::new(pfe))
.build()
.await?;
// Index documents (text in vector field is auto-embedded)
engine.add_document("doc-1", Document::builder()
.add_text("title", "Rust Programming")
.add_text("text_vec", "Rust is a systems programming language.")
.build()
).await?;
engine.commit().await?;
// Search by semantic similarity
let results = engine.search(
SearchRequestBuilder::new()
.vector_search_request(
VectorSearchRequestBuilder::new()
.add_text("text_vec", "systems language")
.build()
)
.limit(5)
.build()
).await?;
for r in &results {
println!("{}: score={:.4}", r.id, r.score);
}
Ok(())
}
Next Steps
- Combine with keyword search: Hybrid Search
- DSL syntax for vector queries: Query DSL
Hybrid Search
Hybrid search combines lexical search (keyword matching) with vector search (semantic similarity) to deliver results that are both precise and semantically relevant. This is Laurus’s most powerful search mode.
Why Hybrid Search?
| Search Type | Strengths | Weaknesses |
|---|---|---|
| Lexical only | Exact keyword matching, handles rare terms well | Misses synonyms and paraphrases |
| Vector only | Understands meaning, handles synonyms | May miss exact keywords, less precise |
| Hybrid | Best of both worlds | Slightly more complex to configure |
How It Works
sequenceDiagram
participant User
participant Engine
participant Lexical as LexicalStore
participant Vector as VectorStore
participant Fusion
User->>Engine: SearchRequest\n(lexical + vector)
par Execute in parallel
Engine->>Lexical: BM25 keyword search
Lexical-->>Engine: Ranked hits (by relevance)
and
Engine->>Vector: ANN similarity search
Vector-->>Engine: Ranked hits (by distance)
end
Engine->>Fusion: Merge two result sets
Note over Fusion: RRF or WeightedSum
Fusion-->>Engine: Unified ranked list
Engine-->>User: Vec of SearchResult
Basic Usage
Builder API
#![allow(unused)]
fn main() {
use laurus::{SearchRequestBuilder, LexicalSearchRequest, FusionAlgorithm};
use laurus::lexical::TermQuery;
use laurus::vector::VectorSearchRequestBuilder;
let request = SearchRequestBuilder::new()
// Lexical component
.lexical_search_request(
LexicalSearchRequest::new(
Box::new(TermQuery::new("body", "rust"))
)
)
// Vector component
.vector_search_request(
VectorSearchRequestBuilder::new()
.add_text("text_vec", "systems programming")
.build()
)
// Fusion algorithm
.fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
.limit(10)
.build();
let results = engine.search(request).await?;
}
Query DSL
Mix lexical and vector clauses in a single query string:
#![allow(unused)]
fn main() {
use laurus::UnifiedQueryParser;
use laurus::lexical::QueryParser;
use laurus::vector::VectorQueryParser;
let unified = UnifiedQueryParser::new(
QueryParser::new(analyzer).with_default_field("body"),
VectorQueryParser::new(embedder),
);
// Lexical + vector in one query
let request = unified.parse(r#"body:rust text_vec:~"systems programming""#).await?;
let results = engine.search(request).await?;
}
The ~"..." syntax identifies vector clauses. Everything else is parsed as lexical.
Fusion Algorithms
When both lexical and vector results exist, they must be merged into a single ranked list. Laurus supports two fusion algorithms:
RRF (Reciprocal Rank Fusion)
The default algorithm. Combines results based on their rank positions rather than raw scores.
score(doc) = sum( 1 / (k + rank_i) )
Where rank_i is the position of the document in each result list, and k is a smoothing parameter (default 60).
#![allow(unused)]
fn main() {
use laurus::FusionAlgorithm;
let fusion = FusionAlgorithm::RRF { k: 60.0 };
}
Advantages:
- Robust to different score distributions between lexical and vector results
- No need to tune weights
- Works well out of the box
WeightedSum
Linearly combines normalized lexical and vector scores:
score(doc) = lexical_weight * lexical_score + vector_weight * vector_score
#![allow(unused)]
fn main() {
use laurus::FusionAlgorithm;
let fusion = FusionAlgorithm::WeightedSum {
lexical_weight: 0.3,
vector_weight: 0.7,
};
}
When to use:
- When you want explicit control over the balance between lexical and vector relevance
- When you know one signal is more important than the other
SearchRequest Options
| Option | Default | Description |
|---|---|---|
lexical_search_request | None | Lexical query component |
vector_search_request | None | Vector query component |
filter_query | None | Pre-filter using a lexical query (restricts both lexical and vector results) |
fusion_algorithm | None (uses RRF { k: 60.0 } when both results exist) | How to merge lexical and vector results |
limit | 10 | Maximum number of results to return |
offset | 0 | Number of results to skip (for pagination) |
SearchResult
Each result contains:
| Field | Type | Description |
|---|---|---|
id | String | External document ID |
score | f32 | Fused relevance score |
document | Option<Document> | Full document content (if loaded) |
Filtered Hybrid Search
Apply a filter to restrict both lexical and vector results:
#![allow(unused)]
fn main() {
let request = SearchRequestBuilder::new()
.lexical_search_request(
LexicalSearchRequest::new(Box::new(TermQuery::new("body", "rust")))
)
.vector_search_request(
VectorSearchRequestBuilder::new()
.add_text("text_vec", "systems programming")
.build()
)
// Only search within "tutorial" category
.filter_query(Box::new(TermQuery::new("category", "tutorial")))
.fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
.limit(10)
.build();
}
How Filtering Works
- The filter query runs on the lexical index to produce a set of allowed document IDs
- For lexical search: the filter is combined with the user query as a boolean AND
- For vector search: the allowed IDs are passed to restrict the ANN search
Pagination
Use offset and limit for pagination:
#![allow(unused)]
fn main() {
// Page 1: results 0-9
let page1 = SearchRequestBuilder::new()
.lexical_search_request(/* ... */)
.vector_search_request(/* ... */)
.offset(0)
.limit(10)
.build();
// Page 2: results 10-19
let page2 = SearchRequestBuilder::new()
.lexical_search_request(/* ... */)
.vector_search_request(/* ... */)
.offset(10)
.limit(10)
.build();
}
Complete Example
use std::sync::Arc;
use laurus::{
Document, Engine, Schema, SearchRequestBuilder,
LexicalSearchRequest, FusionAlgorithm, PerFieldEmbedder,
};
use laurus::lexical::{TextOption, TermQuery};
use laurus::lexical::core::field::IntegerOption;
use laurus::vector::{HnswOption, VectorSearchRequestBuilder};
use laurus::storage::memory::MemoryStorage;
#[tokio::main]
async fn main() -> laurus::Result<()> {
let storage = Arc::new(MemoryStorage::new(Default::default()));
// Schema with both lexical and vector fields
let schema = Schema::builder()
.add_text_field("title", TextOption::default())
.add_text_field("body", TextOption::default())
.add_text_field("category", TextOption::default())
.add_integer_field("year", IntegerOption::default())
.add_hnsw_field("body_vec", HnswOption {
dimension: 384,
..Default::default()
})
.build();
// Configure analyzer and embedder (see Text Analysis and Embeddings docs)
// let analyzer = Arc::new(StandardAnalyzer::new()?);
// let embedder = Arc::new(CandleBertEmbedder::new("sentence-transformers/all-MiniLM-L6-v2")?);
let engine = Engine::builder(storage, schema)
// .analyzer(analyzer)
// .embedder(embedder)
.build()
.await?;
// Index documents with both text and vector fields
engine.add_document("doc-1", Document::builder()
.add_text("title", "Rust Programming Guide")
.add_text("body", "Rust is a systems programming language.")
.add_text("category", "programming")
.add_integer("year", 2024)
.add_text("body_vec", "Rust is a systems programming language.")
.build()
).await?;
engine.commit().await?;
// Hybrid search: keyword "rust" + semantic "systems language"
let results = engine.search(
SearchRequestBuilder::new()
.lexical_search_request(
LexicalSearchRequest::new(Box::new(TermQuery::new("body", "rust")))
)
.vector_search_request(
VectorSearchRequestBuilder::new()
.add_text("body_vec", "systems language")
.build()
)
.fusion_algorithm(FusionAlgorithm::RRF { k: 60.0 })
.limit(10)
.build()
).await?;
for r in &results {
println!("{}: score={:.4}", r.id, r.score);
}
Ok(())
}
Next Steps
- Full query syntax reference: Query DSL
- Understand ID resolution: ID Management
- Data durability: Persistence & WAL
Spelling Correction
Laurus includes a built-in spelling correction system that can suggest corrections for misspelled query terms and provide “Did you mean?” functionality.
Overview
The spelling corrector uses edit distance (Levenshtein distance) combined with word frequency data to suggest corrections. It supports:
- Word-level suggestions — correct individual misspelled words
- Auto-correction — automatically apply high-confidence corrections
- “Did you mean?” — suggest alternative queries to the user
- Query learning — improve suggestions by learning from user queries
- Custom dictionaries — use your own word lists
Basic Usage
SpellingCorrector
#![allow(unused)]
fn main() {
use laurus::spelling::corrector::SpellingCorrector;
// Create a corrector with the built-in English dictionary
let mut corrector = SpellingCorrector::new();
// Correct a query
let result = corrector.correct("programing langauge");
// Check if suggestions are available
if result.has_suggestions() {
for (word, suggestions) in &result.word_suggestions {
println!("'{}' -> {:?}", word, suggestions);
}
}
// Get the best corrected query
if let Some(corrected) = result.query() {
println!("Corrected: {}", corrected);
}
}
“Did You Mean?”
The DidYouMean wrapper provides a higher-level interface for search UIs:
#![allow(unused)]
fn main() {
use laurus::spelling::corrector::{SpellingCorrector, DidYouMean};
let corrector = SpellingCorrector::new();
let mut did_you_mean = DidYouMean::new(corrector);
if let Some(suggestion) = did_you_mean.suggest("programing") {
println!("Did you mean: {}?", suggestion);
}
}
Configuration
Use CorrectorConfig to customize behavior:
#![allow(unused)]
fn main() {
use laurus::spelling::corrector::{CorrectorConfig, SpellingCorrector};
let config = CorrectorConfig {
max_distance: 2, // Maximum edit distance (default: 2)
max_suggestions: 5, // Max suggestions per word (default: 5)
min_frequency: 1, // Minimum word frequency threshold (default: 1)
auto_correct: false, // Enable auto-correction (default: false)
auto_correct_threshold: 0.8, // Confidence threshold for auto-correction (default: 0.8)
use_index_terms: true, // Use indexed terms as dictionary (default: true)
learn_from_queries: true, // Learn from user queries (default: true)
};
}
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
max_distance | usize | 2 | Maximum Levenshtein edit distance for candidate suggestions |
max_suggestions | usize | 5 | Maximum number of suggestions returned per word |
min_frequency | u32 | 1 | Minimum frequency a word must have in the dictionary to be suggested |
auto_correct | bool | false | When true, automatically apply corrections above the threshold |
auto_correct_threshold | f64 | 0.8 | Confidence score (0.0–1.0) required for auto-correction |
use_index_terms | bool | true | Use terms from the search index as dictionary words |
learn_from_queries | bool | true | Learn new words from user search queries |
CorrectionResult
The correct() method returns a CorrectionResult with detailed information:
| Field | Type | Description |
|---|---|---|
original | String | The original query string |
corrected | Option<String> | The corrected query (if auto-correction was applied) |
word_suggestions | HashMap<String, Vec<Suggestion>> | Suggestions grouped by misspelled word |
confidence | f64 | Overall confidence score (0.0–1.0) |
auto_corrected | bool | Whether auto-correction was applied |
Helper Methods
| Method | Returns | Description |
|---|---|---|
has_suggestions() | bool | True if any word has suggestions |
best_suggestion() | Option<&Suggestion> | The single highest-scoring suggestion |
query() | Option<String> | The corrected query string, if corrections were made |
should_show_did_you_mean() | bool | Whether to display a “Did you mean?” prompt |
Custom Dictionaries
You can provide your own dictionary instead of using the built-in English one:
#![allow(unused)]
fn main() {
use laurus::spelling::corrector::SpellingCorrector;
use laurus::spelling::dictionary::SpellingDictionary;
// Build a custom dictionary
let mut dictionary = SpellingDictionary::new();
dictionary.add_word("elasticsearch", 100);
dictionary.add_word("lucene", 80);
dictionary.add_word("laurus", 90);
let corrector = SpellingCorrector::with_dictionary(dictionary);
}
Learning from Index Terms
When use_index_terms is enabled, the corrector can learn from terms in your search index:
#![allow(unused)]
fn main() {
let mut corrector = SpellingCorrector::new();
// Feed index terms to the corrector
let index_terms = vec!["rust", "programming", "search", "engine"];
corrector.learn_from_terms(&index_terms);
}
This improves suggestion quality by incorporating domain-specific vocabulary.
Statistics
Monitor the corrector’s state with stats():
#![allow(unused)]
fn main() {
let stats = corrector.stats();
println!("Dictionary words: {}", stats.dictionary_words);
println!("Total frequency: {}", stats.dictionary_total_frequency);
println!("Learned queries: {}", stats.queries_learned);
}
Next Steps
- Lexical Search — full-text search with query types
- Query DSL — human-readable query syntax
CLI (Command-Line Interface)
Laurus provides a command-line tool laurus that lets you create indexes, manage documents, and run search queries without writing code.
Features
- Index management — Create and inspect indexes from TOML schema files, with an interactive schema generator
- Document CRUD — Add, retrieve, and delete documents via JSON
- Search — Execute queries using the Query DSL
- Dual output — Human-readable tables or machine-parseable JSON
- Interactive REPL — Explore your index in a live session
- gRPC server — Start a gRPC server with
laurus serve
Getting Started
# Install
cargo install laurus-cli
# Generate a schema interactively
laurus create schema
# Create an index from the schema
laurus --data-dir ./my_index create index --schema schema.toml
# Add a document
laurus --data-dir ./my_index add doc --id doc1 --data '{"title":"Hello","body":"World"}'
# Commit changes
laurus --data-dir ./my_index commit
# Search
laurus --data-dir ./my_index search "body:world"
See the sub-sections for detailed documentation:
- Installation — How to install the CLI
- Commands — Full command reference
- Schema Format — Schema TOML format reference
- REPL — Interactive mode
Installation
From crates.io
cargo install laurus-cli
This installs the laurus binary to ~/.cargo/bin/.
From source
git clone https://github.com/mosuka/laurus.git
cd laurus
cargo install --path laurus-cli
Verify
laurus --version
Shell Completion
Generate completion scripts for your shell:
# Bash
laurus --help
# The CLI uses clap, so shell completions can be generated
# with clap_complete if needed in a future release.
Command Reference
Global Options
Every command accepts these options:
| Option | Environment Variable | Default | Description |
|---|---|---|---|
--data-dir <PATH> | LAURUS_DATA_DIR | ./laurus_data | Path to the index data directory |
--format <FORMAT> | — | table | Output format: table or json |
# Example: use JSON output with a custom data directory
laurus --data-dir /var/data/my_index --format json search "title:rust"
create — Create a Resource
create index
Create a new index from a schema TOML file.
laurus create index --schema <FILE>
Arguments:
| Flag | Required | Description |
|---|---|---|
--schema <FILE> | Yes | Path to a TOML file defining the index schema |
Schema file format:
The schema file follows the same structure as the Schema type in the Laurus library. See Schema Format Reference for full details. Example:
default_fields = ["title", "body"]
[fields.title.Text]
stored = true
indexed = true
[fields.body.Text]
stored = true
indexed = true
[fields.category.Text]
stored = true
indexed = true
Example:
laurus --data-dir ./my_index create index --schema schema.toml
# Index created at ./my_index.
Note: An error is returned if the index already exists. Delete the data directory to recreate.
create schema
Interactively generate a schema TOML file through a guided wizard.
laurus create schema [--output <FILE>]
Arguments:
| Flag | Required | Default | Description |
|---|---|---|---|
--output <FILE> | No | schema.toml | Output file path for the generated schema |
The wizard guides you through:
- Field definition — Enter a field name, select the type, and configure type-specific options
- Repeat — Add as many fields as needed
- Default fields — Select which lexical fields to use as default search fields
- Preview — Review the generated TOML before saving
- Save — Write the schema file
Supported field types:
| Type | Category | Options |
|---|---|---|
Text | Lexical | indexed, stored, term_vectors |
Integer | Lexical | indexed, stored |
Float | Lexical | indexed, stored |
Boolean | Lexical | indexed, stored |
DateTime | Lexical | indexed, stored |
Geo | Lexical | indexed, stored |
Bytes | Lexical | stored |
Hnsw | Vector | dimension, distance, m, ef_construction |
Flat | Vector | dimension, distance |
Ivf | Vector | dimension, distance, n_clusters, n_probe |
Example:
# Generate schema.toml interactively
laurus create schema
# Specify output path
laurus create schema --output my_schema.toml
# Then create an index from the generated schema
laurus create index --schema schema.toml
get — Get a Resource
get index
Display statistics about the index.
laurus get index
Table output example:
Document count: 42
Vector fields:
╭──────────┬─────────┬───────────╮
│ Field │ Vectors │ Dimension │
├──────────┼─────────┼───────────┤
│ text_vec │ 42 │ 384 │
╰──────────┴─────────┴───────────╯
JSON output example:
laurus --format json get index
{
"document_count": 42,
"fields": {
"text_vec": {
"vector_count": 42,
"dimension": 384
}
}
}
get doc
Retrieve a document (and all its chunks) by external ID.
laurus get doc --id <ID>
Table output example:
╭──────┬─────────────────────────────────────────╮
│ ID │ Fields │
├──────┼─────────────────────────────────────────┤
│ doc1 │ body: This is a test, title: Hello World │
╰──────┴─────────────────────────────────────────╯
JSON output example:
laurus --format json get doc --id doc1
[
{
"id": "doc1",
"document": {
"title": "Hello World",
"body": "This is a test document."
}
}
]
add — Add a Resource
add doc
Add a document to the index. Documents are not searchable until commit is called.
laurus add doc --id <ID> --data <JSON>
Arguments:
| Flag | Required | Description |
|---|---|---|
--id <ID> | Yes | External document ID (string) |
--data <JSON> | Yes | Document fields as a JSON string |
The JSON format is a flat object mapping field names to values:
{
"title": "Introduction to Rust",
"body": "Rust is a systems programming language.",
"category": "programming"
}
Example:
laurus add doc --id doc1 --data '{"title":"Hello World","body":"This is a test document."}'
# Document 'doc1' added. Run 'commit' to persist changes.
Tip: Multiple documents can share the same external ID (chunking pattern). Use
add docfor each chunk.
delete — Delete a Resource
delete doc
Delete a document (and all its chunks) by external ID.
laurus delete doc --id <ID>
Example:
laurus delete doc --id doc1
# Document 'doc1' deleted. Run 'commit' to persist changes.
commit
Commit pending changes (additions and deletions) to the index. Until committed, changes are not visible to search.
laurus commit
Example:
laurus --data-dir ./my_index commit
# Changes committed successfully.
search
Execute a search query using the Query DSL.
laurus search <QUERY> [--limit <N>] [--offset <N>]
Arguments:
| Argument / Flag | Required | Default | Description |
|---|---|---|---|
<QUERY> | Yes | — | Query string in Laurus Query DSL |
--limit <N> | No | 10 | Maximum number of results |
--offset <N> | No | 0 | Number of results to skip |
Query syntax examples:
# Term query
laurus search "body:rust"
# Phrase query
laurus search 'body:"machine learning"'
# Boolean query
laurus search "+body:programming -body:python"
# Fuzzy query (typo tolerance)
laurus search "body:programing~2"
# Wildcard query
laurus search "title:intro*"
# Range query
laurus search "price:[10 TO 50]"
Table output example:
╭──────┬────────┬─────────────────────────────────────────╮
│ ID │ Score │ Fields │
├──────┼────────┼─────────────────────────────────────────┤
│ doc1 │ 0.8532 │ body: Rust is a systems..., title: Intr │
│ doc3 │ 0.4210 │ body: JavaScript powers..., title: Web │
╰──────┴────────┴─────────────────────────────────────────╯
JSON output example:
laurus --format json search "body:rust" --limit 5
[
{
"id": "doc1",
"score": 0.8532,
"document": {
"title": "Introduction to Rust",
"body": "Rust is a systems programming language."
}
}
]
repl
Start an interactive REPL session. See REPL for details.
laurus repl
serve
Start the gRPC server. See gRPC Server for full documentation.
laurus serve [OPTIONS]
Options:
| Option | Short | Env Variable | Default | Description |
|---|---|---|---|---|
--config <PATH> | -c | LAURUS_CONFIG | — | Path to a TOML configuration file |
--host <HOST> | -H | LAURUS_HOST | 0.0.0.0 | Listen address |
--port <PORT> | -p | LAURUS_PORT | 50051 | Listen port |
--log-level <LEVEL> | -l | LAURUS_LOG_LEVEL | info | Log level (trace, debug, info, warn, error) |
Example:
# Start with defaults (port 50051)
laurus --data-dir ./my_index serve
# Custom port and log level
laurus serve --port 8080 --log-level debug
# Use a configuration file
laurus serve --config config.toml
# Use environment variables
LAURUS_DATA_DIR=./my_index LAURUS_PORT=8080 laurus serve
Schema Format Reference
The schema file defines the structure of your index — what fields exist, their types, and how they are indexed. Laurus uses TOML format for schema files.
Overview
A schema consists of two top-level elements:
# Fields to search by default when a query does not specify a field.
default_fields = ["title", "body"]
# Field definitions. Each field has a name and a typed configuration.
[fields.<field_name>.<FieldType>]
# ... type-specific options
default_fields— A list of field names used as default search targets by the Query DSL. Only lexical fields (Text, Integer, Float, etc.) can be default fields. This key is optional and defaults to an empty list.fields— A map of field names to their typed configuration. Each field must specify exactly one field type.
Field Naming
- Field names are arbitrary strings (e.g.,
title,body_vec,created_at). - The
_idfield is reserved by Laurus for internal document ID management — do not use it. - Field names must be unique within a schema.
Field Types
Fields fall into two categories: Lexical (for keyword/full-text search) and Vector (for similarity search). A single field cannot be both.
Lexical Fields
Text
Full-text searchable field. Text is processed by the analysis pipeline (tokenization, normalization, stemming, etc.).
[fields.title.Text]
indexed = true # Whether to index this field for search
stored = true # Whether to store the original value for retrieval
term_vectors = false # Whether to store term positions (for phrase queries, highlighting)
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables searching this field |
stored | bool | true | Stores the original value so it can be returned in results |
term_vectors | bool | true | Stores term positions for phrase queries, highlighting, and more-like-this |
Integer
64-bit signed integer field. Supports range queries and exact match.
[fields.year.Integer]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range and exact-match queries |
stored | bool | true | Stores the original value |
Float
64-bit floating point field. Supports range queries.
[fields.rating.Float]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range queries |
stored | bool | true | Stores the original value |
Boolean
Boolean field (true / false).
[fields.published.Boolean]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables filtering by boolean value |
stored | bool | true | Stores the original value |
DateTime
UTC timestamp field. Supports range queries.
[fields.created_at.DateTime]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range queries on date/time |
stored | bool | true | Stores the original value |
Geo
Geographic point field (latitude/longitude). Supports radius and bounding box queries.
[fields.location.Geo]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables geo queries (radius, bounding box) |
stored | bool | true | Stores the original value |
Bytes
Raw binary data field. Not indexed — stored only.
[fields.thumbnail.Bytes]
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
stored | bool | true | Stores the binary data |
Vector Fields
Vector fields are indexed for approximate nearest neighbor (ANN) search. They require a dimension (the length of each vector) and a distance metric.
Hnsw
Hierarchical Navigable Small World graph index. Best for most use cases — offers a good balance of speed and recall.
[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | 128 | Vector dimensionality (must match your embedding model) |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
m | integer | 16 | Max bi-directional connections per node. Higher = better recall, more memory |
ef_construction | integer | 200 | Search width during index construction. Higher = better quality, slower build |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | none | Optional quantization method (see Quantization) |
Tuning guidelines:
m: 12–48 is typical. Use higher values for higher-dimensional vectors.ef_construction: 100–500. Higher values produce a better graph but increase build time.dimension: Must exactly match the output dimension of your embedding model (e.g., 384 forall-MiniLM-L6-v2, 768 forBERT-base, 1536 fortext-embedding-3-small).
Flat
Brute-force linear scan index. Provides exact results with no approximation. Best for small datasets (< 10,000 vectors).
[fields.embedding.Flat]
dimension = 384
distance = "Cosine"
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | 128 | Vector dimensionality |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | none | Optional quantization method (see Quantization) |
Ivf
Inverted File Index. Clusters vectors and searches only a subset of clusters. Suitable for very large datasets.
[fields.embedding.Ivf]
dimension = 384
distance = "Cosine"
n_clusters = 100
n_probe = 1
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | (required) | Vector dimensionality |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
n_clusters | integer | 100 | Number of clusters. More clusters = finer partitioning |
n_probe | integer | 1 | Number of clusters to search at query time. Higher = better recall, slower |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | none | Optional quantization method (see Quantization) |
Note: Unlike Hnsw and Flat, the
dimensionfield in Ivf is required and has no default value.
Tuning guidelines:
n_clusters: A common heuristic issqrt(N)where N is the total number of vectors.n_probe: Start with 1 and increase until recall is acceptable. Typical range is 1–20.
Distance Metrics
The distance option for vector fields accepts the following values:
| Value | Description | Use When |
|---|---|---|
"Cosine" | Cosine distance (1 - cosine similarity). Default. | Normalized text/image embeddings |
"Euclidean" | L2 (Euclidean) distance | Spatial data, non-normalized vectors |
"Manhattan" | L1 (Manhattan) distance | Sparse feature vectors |
"DotProduct" | Dot product (higher = more similar) | Pre-normalized vectors where magnitude matters |
"Angular" | Angular distance | Similar to cosine, but based on angle |
For most embedding models (BERT, Sentence Transformers, OpenAI, etc.), "Cosine" is the correct choice.
Quantization
Vector fields optionally support quantization to reduce memory usage at the cost of some accuracy. Specify the quantizer option as a TOML table.
None (default)
No quantization — full precision 32-bit floats.
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
# quantizer is omitted (no quantization)
Scalar 8-bit
Compresses each float32 component to uint8 (~4x memory reduction).
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
quantizer = "Scalar8Bit"
Product Quantization
Splits the vector into subvectors and quantizes each independently.
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
[fields.embedding.Hnsw.quantizer.ProductQuantization]
subvector_count = 48
| Option | Type | Description |
|---|---|---|
subvector_count | integer | Number of subvectors. Must evenly divide dimension. |
Complete Examples
Full-text search only
A simple blog post index with lexical search:
default_fields = ["title", "body"]
[fields.title.Text]
indexed = true
stored = true
term_vectors = false
[fields.body.Text]
indexed = true
stored = true
term_vectors = false
[fields.category.Text]
indexed = true
stored = true
term_vectors = false
[fields.published_at.DateTime]
indexed = true
stored = true
Vector search only
A vector-only index for semantic similarity:
[fields.embedding.Hnsw]
dimension = 768
distance = "Cosine"
m = 16
ef_construction = 200
Hybrid search (lexical + vector)
Combine lexical and vector search for best-of-both-worlds retrieval:
default_fields = ["title", "body"]
[fields.title.Text]
indexed = true
stored = true
term_vectors = false
[fields.body.Text]
indexed = true
stored = true
term_vectors = true
[fields.category.Text]
indexed = true
stored = true
term_vectors = false
[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
Tip: A single field cannot be both lexical and vector. Use separate fields (e.g.,
bodyfor text,body_vecfor embedding) and map them both to the same source content.
E-commerce product index
A more complex schema with mixed field types:
default_fields = ["name", "description"]
[fields.name.Text]
indexed = true
stored = true
term_vectors = false
[fields.description.Text]
indexed = true
stored = true
term_vectors = true
[fields.price.Float]
indexed = true
stored = true
[fields.in_stock.Boolean]
indexed = true
stored = true
[fields.created_at.DateTime]
indexed = true
stored = true
[fields.location.Geo]
indexed = true
stored = true
[fields.description_vec.Hnsw]
dimension = 384
distance = "Cosine"
Generating a Schema
You can generate a schema TOML file interactively using the CLI:
laurus create schema
laurus create schema --output my_schema.toml
See create schema for details.
Using a Schema
Once you have a schema file, create an index from it:
laurus create index --schema schema.toml
Or load it programmatically in Rust:
#![allow(unused)]
fn main() {
use laurus::Schema;
let toml_str = std::fs::read_to_string("schema.toml")?;
let schema: Schema = toml::from_str(&toml_str)?;
}
REPL (Interactive Mode)
The REPL provides an interactive session for exploring your index without typing the full laurus command each time.
Starting the REPL
laurus --data-dir ./my_index repl
Laurus REPL (type 'help' for commands, 'quit' to exit)
laurus>
The REPL opens the index at startup and keeps it loaded throughout the session.
Available Commands
| Command | Description |
|---|---|
search <query> [limit] | Search the index |
doc add <id> <json> | Add a document |
doc get <id> | Get a document by ID |
doc delete <id> | Delete a document by ID |
commit | Commit pending changes |
stats | Show index statistics |
help | Show available commands |
quit / exit | Exit the REPL |
Usage Examples
Searching
laurus> search body:rust
╭──────┬────────┬────────────────────────────────────╮
│ ID │ Score │ Fields │
├──────┼────────┼────────────────────────────────────┤
│ doc1 │ 0.8532 │ body: Rust is a systems..., title… │
╰──────┴────────┴────────────────────────────────────╯
Adding and Committing Documents
laurus> doc add doc4 {"title":"New Document","body":"Some content here."}
Document 'doc4' added.
laurus> commit
Changes committed.
Retrieving Documents
laurus> doc get doc4
╭──────┬───────────────────────────────────────────────╮
│ ID │ Fields │
├──────┼───────────────────────────────────────────────┤
│ doc4 │ body: Some content here., title: New Document │
╰──────┴───────────────────────────────────────────────╯
Deleting Documents
laurus> doc delete doc4
Document 'doc4' deleted.
laurus> commit
Changes committed.
Viewing Statistics
laurus> stats
Document count: 3
Features
- Line editing — Arrow keys, Home/End, and standard readline shortcuts
- History — Use Up/Down arrows to recall previous commands
- Ctrl+C / Ctrl+D — Exit the REPL gracefully
gRPC Server
Laurus includes a built-in gRPC server that keeps the search engine resident in memory, eliminating the per-command startup overhead of the CLI. This is the recommended way to run Laurus in production or when integrating with other services.
Features
- Persistent engine — The index stays open across requests; no WAL replay on every call
- Full gRPC API — Index management, document CRUD, commit, and search (unary + streaming)
- Health checking — Standard health check endpoint for load balancers and orchestrators
- Graceful shutdown — Pending changes are committed automatically on Ctrl+C / SIGINT
- TOML configuration — Optional config file with CLI override support
Quick Start
# Start the server with default settings
laurus serve
# Start with a custom data directory and port
laurus --data-dir ./my_index serve --port 8080
# Start with a configuration file
laurus serve --config config.toml
Sections
- Getting Started — Installation, startup options, and configuration
- gRPC API Reference — Full API documentation for all services and RPCs
Getting Started with the gRPC Server
Starting the Server
The gRPC server is started via the serve subcommand of the laurus CLI:
laurus serve [OPTIONS]
Options
| Option | Short | Env Variable | Default | Description |
|---|---|---|---|---|
--config <PATH> | -c | LAURUS_CONFIG | — | Path to a TOML configuration file |
--host <HOST> | -H | LAURUS_HOST | 0.0.0.0 | Listen address |
--port <PORT> | -p | LAURUS_PORT | 50051 | Listen port |
--http-port <PORT> | — | LAURUS_HTTP_PORT | — | HTTP Gateway port (enables HTTP gateway when set) |
--log-level <LEVEL> | -l | LAURUS_LOG_LEVEL | info | Log level (trace, debug, info, warn, error) |
The global --data-dir option (env: LAURUS_DATA_DIR) specifies the index data directory:
# Using CLI arguments
laurus --data-dir ./my_index serve --port 8080 --log-level debug
# Using environment variables
export LAURUS_DATA_DIR=./my_index
export LAURUS_PORT=8080
export LAURUS_LOG_LEVEL=debug
laurus serve
Startup Behavior
On startup, the server attempts to open an existing index at the configured data directory. If no index exists, the server starts without one — you can create an index later via the CreateIndex RPC.
Configuration File
You can use a TOML configuration file instead of (or in addition to) command-line options:
laurus serve --config config.toml
Format
[server]
host = "0.0.0.0"
port = 50051
http_port = 8080 # Optional: enables HTTP Gateway
[index]
data_dir = "./laurus_data"
[log]
level = "info"
Priority
Settings are resolved in the following order (highest priority first):
CLI arguments > Environment variables > Config file > Defaults
For example, if config.toml sets port = 50051, the environment variable LAURUS_PORT=4567 is set, and --port 1234 is passed on the command line:
LAURUS_PORT=4567 laurus serve --config config.toml --port 1234
# → Listens on port 1234 (CLI argument wins)
If the CLI argument is omitted:
LAURUS_PORT=4567 laurus serve --config config.toml
# → Listens on port 4567 (environment variable wins over config file)
Graceful Shutdown
When the server receives a shutdown signal (Ctrl+C / SIGINT), it automatically:
- Stops accepting new connections
- Commits any pending changes to the index
- Exits cleanly
HTTP Gateway
When http_port is set, an HTTP/JSON gateway starts alongside the gRPC server. The gateway proxies HTTP requests to the gRPC server internally:
User Request (HTTP/JSON) → gRPC Gateway (axum) → gRPC Server (tonic) → Engine
If http_port is omitted, only the gRPC server starts (default behavior).
Starting with HTTP Gateway
# Via CLI
laurus serve --http-port 8080
# Via config file (set http_port in [server] section)
laurus serve --config config.toml
# Via environment variable
LAURUS_HTTP_PORT=8080 laurus serve
HTTP API Endpoints
| Method | Path | gRPC Method |
|---|---|---|
| GET | /v1/health | HealthService/Check |
| POST | /v1/index | IndexService/CreateIndex |
| GET | /v1/index | IndexService/GetIndex |
| GET | /v1/schema | IndexService/GetSchema |
| PUT | /v1/documents/:id | DocumentService/PutDocument |
| POST | /v1/documents/:id | DocumentService/AddDocument |
| GET | /v1/documents/:id | DocumentService/GetDocuments |
| DELETE | /v1/documents/:id | DocumentService/DeleteDocuments |
| POST | /v1/commit | DocumentService/Commit |
| POST | /v1/search | SearchService/Search |
| POST | /v1/search/stream | SearchService/SearchStream (SSE) |
HTTP API Examples
# Health check
curl http://localhost:8080/v1/health
# Create an index
curl -X POST http://localhost:8080/v1/index \
-H 'Content-Type: application/json' \
-d '{
"schema": {
"fields": {
"title": {"text": {"indexed": true, "stored": true}},
"body": {"text": {"indexed": true, "stored": true}}
},
"default_fields": ["title", "body"]
}
}'
# Add a document
curl -X POST http://localhost:8080/v1/documents/doc1 \
-H 'Content-Type: application/json' \
-d '{
"document": {
"fields": {
"title": "Hello World",
"body": "This is a test document."
}
}
}'
# Commit
curl -X POST http://localhost:8080/v1/commit
# Search
curl -X POST http://localhost:8080/v1/search \
-H 'Content-Type: application/json' \
-d '{"query": "body:test", "limit": 10}'
# Streaming search (SSE)
curl -N -X POST http://localhost:8080/v1/search/stream \
-H 'Content-Type: application/json' \
-d '{"query": "body:test", "limit": 10}'
Connecting via gRPC
Any gRPC client can connect to the server. For quick testing, grpcurl is useful:
# Health check
grpcurl -plaintext localhost:50051 laurus.v1.HealthService/Check
# Create an index
grpcurl -plaintext -d '{
"schema": {
"fields": {
"title": {"text": {"indexed": true, "stored": true, "term_vectors": true}},
"body": {"text": {"indexed": true, "stored": true, "term_vectors": true}}
},
"default_fields": ["title", "body"]
}
}' localhost:50051 laurus.v1.IndexService/CreateIndex
# Add a document
grpcurl -plaintext -d '{
"id": "doc1",
"document": {
"fields": {
"title": {"text_value": "Hello World"},
"body": {"text_value": "This is a test document."}
}
}
}' localhost:50051 laurus.v1.DocumentService/AddDocument
# Commit
grpcurl -plaintext localhost:50051 laurus.v1.DocumentService/Commit
# Search
grpcurl -plaintext -d '{"query": "body:test", "limit": 10}' \
localhost:50051 laurus.v1.SearchService/Search
gRPC API Reference
All services are defined under the laurus.v1 protobuf package.
Services Overview
| Service | RPCs | Description |
|---|---|---|
HealthService | Check | Health checking |
IndexService | CreateIndex, GetIndex, GetSchema | Index lifecycle and schema |
DocumentService | PutDocument, AddDocument, GetDocuments, DeleteDocuments, Commit | Document CRUD and commit |
SearchService | Search, SearchStream | Unary and streaming search |
HealthService
Check
Returns the current serving status of the server.
rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
Response fields:
| Field | Type | Description |
|---|---|---|
status | ServingStatus | SERVING_STATUS_SERVING when the server is ready |
IndexService
CreateIndex
Create a new index with the given schema. Fails with ALREADY_EXISTS if an index is already open.
rpc CreateIndex(CreateIndexRequest) returns (CreateIndexResponse);
Request fields:
| Field | Type | Required | Description |
|---|---|---|---|
schema | Schema | Yes | Index schema definition |
Schema structure:
message Schema {
map<string, FieldOption> fields = 1;
repeated string default_fields = 2;
}
Each FieldOption is a oneof with one of the following field types:
| Lexical Fields | Vector Fields |
|---|---|
TextOption (indexed, stored, term_vectors) | HnswOption (dimension, distance, m, ef_construction, base_weight, quantizer) |
IntegerOption (indexed, stored) | FlatOption (dimension, distance, base_weight, quantizer) |
FloatOption (indexed, stored) | IvfOption (dimension, distance, n_clusters, n_probe, base_weight, quantizer) |
BooleanOption (indexed, stored) | |
DateTimeOption (indexed, stored) | |
GeoOption (indexed, stored) | |
BytesOption (stored) |
Distance metrics: COSINE, EUCLIDEAN, MANHATTAN, DOT_PRODUCT, ANGULAR
Quantization methods: NONE, SCALAR_8BIT, PRODUCT_QUANTIZATION
Example:
{
"schema": {
"fields": {
"title": {"text": {"indexed": true, "stored": true, "term_vectors": true}},
"embedding": {"hnsw": {"dimension": 384, "distance": "DISTANCE_METRIC_COSINE", "m": 16, "ef_construction": 200}}
},
"default_fields": ["title"]
}
}
GetIndex
Get index statistics.
rpc GetIndex(GetIndexRequest) returns (GetIndexResponse);
Response fields:
| Field | Type | Description |
|---|---|---|
document_count | uint64 | Total number of documents in the index |
vector_fields | map<string, VectorFieldStats> | Per-field vector statistics |
Each VectorFieldStats contains vector_count and dimension.
GetSchema
Retrieve the current index schema.
rpc GetSchema(GetSchemaRequest) returns (GetSchemaResponse);
Response fields:
| Field | Type | Description |
|---|---|---|
schema | Schema | The index schema |
DocumentService
PutDocument
Insert or replace a document by ID. If a document with the same ID already exists, it is replaced.
rpc PutDocument(PutDocumentRequest) returns (PutDocumentResponse);
Request fields:
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | External document ID |
document | Document | Yes | Document content |
Document structure:
message Document {
map<string, Value> fields = 1;
}
Each Value is a oneof with these types:
| Type | Proto Field | Description |
|---|---|---|
| Null | null_value | Null value |
| Boolean | bool_value | Boolean value |
| Integer | int64_value | 64-bit integer |
| Float | float64_value | 64-bit floating point |
| Text | text_value | UTF-8 string |
| Bytes | bytes_value | Raw bytes |
| Vector | vector_value | VectorValue (list of floats) |
| DateTime | datetime_value | Unix microseconds (UTC) |
| Geo | geo_value | GeoPoint (latitude, longitude) |
AddDocument
Add a document. Unlike PutDocument, this does not replace existing documents with the same ID — multiple documents can share an ID (chunking pattern).
rpc AddDocument(AddDocumentRequest) returns (AddDocumentResponse);
Request fields are the same as PutDocument.
GetDocuments
Retrieve all documents matching the given external ID.
rpc GetDocuments(GetDocumentsRequest) returns (GetDocumentsResponse);
Request fields:
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | External document ID |
Response fields:
| Field | Type | Description |
|---|---|---|
documents | repeated Document | Matching documents |
DeleteDocuments
Delete all documents matching the given external ID.
rpc DeleteDocuments(DeleteDocumentsRequest) returns (DeleteDocumentsResponse);
Commit
Commit pending changes (additions and deletions) to the index. Changes are not visible to search until committed.
rpc Commit(CommitRequest) returns (CommitResponse);
SearchService
Search
Execute a search query and return results as a single response.
rpc Search(SearchRequest) returns (SearchResponse);
SearchStream
Execute a search query and stream results back one at a time.
rpc SearchStream(SearchRequest) returns (stream SearchResult);
SearchRequest Fields
| Field | Type | Required | Description |
|---|---|---|---|
query | string | No | Lexical search query in Query DSL |
query_vectors | repeated QueryVector | No | Vector search queries |
limit | uint32 | No | Maximum number of results (default: engine default) |
offset | uint32 | No | Number of results to skip |
fusion | FusionAlgorithm | No | Fusion algorithm for hybrid search |
lexical_params | LexicalParams | No | Lexical search parameters |
vector_params | VectorParams | No | Vector search parameters |
field_boosts | map<string, float> | No | Per-field score boosting |
At least one of query or query_vectors must be provided.
QueryVector
| Field | Type | Description |
|---|---|---|
vector | repeated float | Query vector |
weight | float | Weight for this vector (default: 1.0) |
fields | repeated string | Target vector fields (empty = all) |
FusionAlgorithm
A oneof with two options:
- RRF (Reciprocal Rank Fusion):
kparameter (default: 60) - WeightedSum:
lexical_weightandvector_weight
LexicalParams
| Field | Type | Description |
|---|---|---|
min_score | float | Minimum score threshold |
timeout_ms | uint64 | Search timeout in milliseconds |
parallel | bool | Enable parallel search |
sort_by | SortSpec | Sort by a field instead of score |
VectorParams
| Field | Type | Description |
|---|---|---|
fields | repeated string | Target vector fields |
score_mode | VectorScoreMode | WEIGHTED_SUM, MAX_SIM, or LATE_INTERACTION |
overfetch | float | Overfetch factor (default: 2.0) |
min_score | float | Minimum score threshold |
SearchResult
| Field | Type | Description |
|---|---|---|
id | string | External document ID |
score | float | Relevance score |
document | Document | Document content |
Example
{
"query": "body:rust",
"query_vectors": [
{"vector": [0.1, 0.2, 0.3], "weight": 1.0}
],
"limit": 10,
"fusion": {
"rrf": {"k": 60}
},
"field_boosts": {
"title": 2.0
}
}
Error Handling
gRPC errors are returned as standard Status codes:
| Laurus Error | gRPC Status | When |
|---|---|---|
| Schema / Query / Field / JSON | INVALID_ARGUMENT | Malformed request or schema |
| No index open | FAILED_PRECONDITION | RPC called before CreateIndex |
| Index already exists | ALREADY_EXISTS | CreateIndex called twice |
| Not implemented | UNIMPLEMENTED | Feature not yet supported |
| Internal errors | INTERNAL | I/O, storage, or unexpected errors |
Advanced Features
This section covers advanced topics for users who want to go deeper into Laurus’s capabilities.
Topics
Query DSL
A human-readable query language for lexical, vector, and hybrid search. Supports boolean operators, phrase matching, fuzzy search, range queries, and more — all in a single query string.
ID Management
How Laurus manages document identity with a dual-tiered ID system:
- External IDs (user-provided strings)
- Internal IDs (shard-prefixed
u64for performance)
Persistence & WAL
How Laurus ensures data durability through Write-Ahead Logging (WAL) and the commit lifecycle.
Deletions & Compaction
How documents are deleted (logical deletion via bitmaps) and how space is reclaimed (compaction).
Error Handling
Understanding LaurusError and Result<T> for robust application development. Covers all error variants, matching patterns, and common error scenarios.
Extensibility
Implementing custom components by extending Laurus’s trait-based abstractions:
- Custom
Analyzerfor text analysis - Custom
Embedderfor vector embeddings - Custom
Storagefor new backends
Query DSL
Laurus provides a unified query DSL (Domain Specific Language) that allows lexical (keyword) and vector (semantic) search in a single query string. The UnifiedQueryParser splits the input into lexical and vector portions and delegates to the appropriate sub-parser.
Overview
title:hello AND content:~"cute kitten"^0.8
|--- lexical --| |--- vector --------|
The ~" pattern distinguishes vector clauses from lexical clauses. Everything else is treated as a lexical query.
Lexical Query Syntax
Lexical queries search the inverted index using exact or approximate keyword matching.
Term Query
Match a single term against a field (or the default field):
hello
title:hello
Boolean Operators
Combine clauses with AND and OR (case-insensitive):
title:hello AND body:world
title:hello OR title:goodbye
Space-separated clauses without an explicit operator use implicit boolean (behaves like OR with scoring).
Required / Prohibited Clauses
Use + (must match) and - (must not match):
+title:hello -title:goodbye
Phrase Query
Match an exact phrase using double quotes. Optional proximity (~N) allows N words between terms:
"hello world"
"hello world"~2
Fuzzy Query
Approximate matching with edit distance. Append ~ and optionally the maximum edit distance:
roam~
roam~2
Wildcard Query
Use ? (single character) and * (zero or more characters):
te?t
test*
Range Query
Inclusive [] or exclusive {} ranges, useful for numeric and date fields:
price:[100 TO 500]
date:{2024-01-01 TO 2024-12-31}
price:[* TO 100]
Boost
Increase the weight of a clause with ^:
title:hello^2
"important phrase"^1.5
Grouping
Use parentheses for sub-expressions:
(title:hello OR title:hi) AND body:world
PEG Grammar
The full lexical grammar (parser.pest):
query = { SOI ~ boolean_query ~ EOI }
boolean_query = { clause ~ (boolean_op ~ clause | clause)* }
clause = { required_clause | prohibited_clause | sub_clause }
required_clause = { "+" ~ sub_clause }
prohibited_clause = { "-" ~ sub_clause }
sub_clause = { grouped_query | field_query | term_query }
grouped_query = { "(" ~ boolean_query ~ ")" ~ boost? }
boolean_op = { ^"AND" | ^"OR" }
field_query = { field ~ ":" ~ field_value }
field_value = { range_query | phrase_query | fuzzy_term
| wildcard_term | simple_term }
phrase_query = { "\"" ~ phrase_content ~ "\"" ~ proximity? ~ boost? }
proximity = { "~" ~ number }
fuzzy_term = { term ~ "~" ~ fuzziness? ~ boost? }
wildcard_term = { wildcard_pattern ~ boost? }
simple_term = { term ~ boost? }
boost = { "^" ~ boost_value }
Vector Query Syntax
Vector queries embed text into vectors at parse time and perform similarity search.
Basic Syntax
field:~"text"
field:~"text"^weight
| Element | Required | Description | Example |
|---|---|---|---|
field: | No | Target vector field name | content: |
~ | Yes | Vector query marker | |
"text" | Yes | Text to embed | "cute kitten" |
^weight | No | Score weight (default: 1.0) | ^0.8 |
Examples
# Single field
content:~"cute kitten"
# With boost weight
content:~"cute kitten"^0.8
# Default field (when configured)
~"cute kitten"
# Multiple clauses
content:~"cats" image:~"dogs"^0.5
# Nested field name (dot notation)
metadata.embedding:~"text"
Multiple Clauses
Multiple vector clauses are space-separated. All clauses are executed and their scores are combined using the score_mode (default: WeightedSum):
content:~"cats" image:~"dogs"^0.5
This produces:
score = similarity("cats", content) * 1.0
+ similarity("dogs", image) * 0.5
There are no AND/OR operators in the vector DSL. Vector search is inherently a ranking operation, and the weight (^) controls the contribution of each clause.
Score Modes
| Mode | Description |
|---|---|
WeightedSum (default) | Sum of (similarity * weight) across all clauses |
MaxSim | Maximum similarity score across clauses |
LateInteraction | Late interaction scoring |
Score mode cannot be set from DSL syntax. Use the Rust API to override:
#![allow(unused)]
fn main() {
let mut request = parser.parse(r#"content:~"cats" image:~"dogs""#).await?;
request.score_mode = VectorScoreMode::MaxSim;
}
PEG Grammar
The full vector grammar (parser.pest):
query = { SOI ~ vector_clause+ ~ EOI }
vector_clause = { field_prefix? ~ "~" ~ quoted_text ~ boost? }
field_prefix = { field_name ~ ":" }
field_name = @{ (ASCII_ALPHA | "_") ~ (ASCII_ALPHANUMERIC | "_" | ".")* }
quoted_text = ${ "\"" ~ inner_text ~ "\"" }
inner_text = @{ (!("\"") ~ ANY)* }
boost = { "^" ~ float_value }
float_value = @{ ASCII_DIGIT+ ~ ("." ~ ASCII_DIGIT+)? }
Unified (Hybrid) Query Syntax
The UnifiedQueryParser allows mixing lexical and vector clauses freely in a single query string:
title:hello content:~"cute kitten"^0.8
How It Works
- Split: Vector clauses (matching
field:~"text"^boostpattern) are extracted via regex. - Delegate: Vector portion goes to
VectorQueryParser, remainder goes to lexicalQueryParser. - Fuse: If both lexical and vector results exist, they are combined using a fusion algorithm.
Disambiguation
The ~" pattern unambiguously identifies vector clauses because in lexical syntax, ~ only appears after a term or phrase (e.g., roam~2, "hello world"~10), never before a quote.
Fusion Algorithms
When a query contains both lexical and vector clauses, results are fused:
| Algorithm | Formula | Description |
|---|---|---|
| RRF (default) | score = sum(1 / (k + rank)) | Reciprocal Rank Fusion. Robust to different score distributions. Default k=60. |
| WeightedSum | score = lexical * a + vector * b | Linear combination with configurable weights. |
Note: The fusion algorithm cannot be specified in the DSL syntax. It is configured when constructing the
UnifiedQueryParservia.with_fusion(). The default is RRF (k=60). See Custom Fusion for a code example.
Examples
# Lexical only — no fusion
title:hello AND body:world
# Vector only — no fusion
content:~"cute kitten"
# Hybrid — fusion applied automatically
title:hello content:~"cute kitten"
# Hybrid with boolean operators
title:hello AND category:animal content:~"cute kitten"^0.8
# Multiple vector clauses + lexical
category:animal content:~"cats" image:~"dogs"^0.5
# Default fields (when configured)
hello ~"cats"
Code Examples
Lexical Search with DSL
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
use laurus::lexical::query::QueryParser;
let analyzer = Arc::new(StandardAnalyzer::new()?);
let parser = QueryParser::new(analyzer)
.with_default_field("title");
let query = parser.parse("title:hello AND body:world")?;
}
Vector Search with DSL
#![allow(unused)]
fn main() {
use std::sync::Arc;
use laurus::vector::query::VectorQueryParser;
let parser = VectorQueryParser::new(embedder)
.with_default_field("content");
let request = parser.parse(r#"content:~"cute kitten"^0.8"#).await?;
}
Hybrid Search with Unified DSL
#![allow(unused)]
fn main() {
use laurus::engine::query::UnifiedQueryParser;
let unified = UnifiedQueryParser::new(lexical_parser, vector_parser);
let request = unified.parse(
r#"title:hello content:~"cute kitten"^0.8"#
).await?;
// request.lexical_search_request -> Some(...) — lexical query
// request.vector_search_request -> Some(...) — vector query
// request.fusion_algorithm -> Some(RRF) — fusion algorithm
}
Custom Fusion
#![allow(unused)]
fn main() {
use laurus::engine::search::FusionAlgorithm;
let unified = UnifiedQueryParser::new(lexical_parser, vector_parser)
.with_fusion(FusionAlgorithm::WeightedSum {
lexical_weight: 0.3,
vector_weight: 0.7,
});
}
ID Management
Laurus uses a dual-tiered ID management strategy to ensure efficient document retrieval, updates, and aggregation in distributed environments.
1. External ID (String)
The External ID is a logical identifier used by users and applications to uniquely identify a document.
- Type:
String - Role: You can use any unique value, such as UUIDs, URLs, or database primary keys.
- Storage: Persisted transparently as a reserved system field name
_idwithin the Lexical Index. - Uniqueness: Expected to be unique across the entire system.
- Updates: Indexing a document with an existing
external_idtriggers an automatic “Delete-then-Insert” (Upsert) operation, replacing the old version with the newest.
2. Internal ID (u64 / Stable ID)
The Internal ID is a physical handle used internally by Laurus’s engines (Lexical and Vector) for high-performance operations.
- Type: Unsigned 64-bit Integer (
u64) - Role: Used for bitmap operations, point references, and routing between distributed nodes.
- Immutability (Stable): Once assigned, an Internal ID never changes due to index merges (segment compaction) or restarts. This prevents inconsistencies in deletion logs and caches.
ID Structure (Shard-Prefixed)
Laurus employs a Shard-Prefixed Stable ID scheme designed for multi-node distributed environments.
| Bit Range | Name | Description |
|---|---|---|
| 48-63 bit | Shard ID | Prefix identifying the node or partition (up to 65,535 shards). |
| 0-47 bit | Local ID | Monotonically increasing document number within a shard (up to ~281 trillion documents). |
Why this structure?
- Zero-Cost Aggregation: Since
u64IDs are globally unique, the aggregator can perform fast sorting and deduplication without worrying about ID collisions between nodes. - Fast Routing: The aggregator can immediately identify the physical node responsible for a document just by looking at the upper bits, avoiding expensive hash lookups.
- High-Performance Fetching: Internal IDs map directly to physical data structures. This allows Laurus to skip the “External-to-Internal ID” conversion step during retrieval, achieving O(1) access speed.
ID Lifecycle
- Registration (
engine.put_document()/engine.add_document()): User provides a document with an External ID. - ID Assignment: The
Enginecombines the currentshard_idwith a new Local ID to issue a Shard-Prefixed Internal ID. - Mapping: The engine maintains the relationship between the External ID and the new Internal ID.
- Search: Search results return the External ID (
String), resolved from the Internal ID. - Retrieval/Deletion: While the user-facing API accepts External IDs for convenience, the engine internally converts them to Internal IDs for near-instant processing.
Persistence & WAL
Laurus uses a Write-Ahead Log (WAL) to ensure data durability. Every write operation is persisted to the WAL before modifying in-memory structures, guaranteeing that no data is lost even if the process crashes.
Write Path
sequenceDiagram
participant App as Application
participant Engine
participant WAL as DocumentLog (WAL)
participant Mem as In-Memory Buffers
participant Disk as Storage (segments)
App->>Engine: add_document() / delete_documents()
Engine->>WAL: 1. Append operation to WAL
Engine->>Mem: 2. Update in-memory buffers
Note over Mem: Document is buffered but\nNOT yet searchable
App->>Engine: commit()
Engine->>Disk: 3. Flush segments to storage
Engine->>WAL: 4. Truncate WAL
Note over Disk: Documents are now\nsearchable and durable
Key Principles
- WAL-first: Every write (add or delete) is appended to the WAL before updating in-memory structures
- Buffered writes: In-memory buffers accumulate changes until
commit()is called - Atomic commit:
commit()flushes all buffered changes to segment files and truncates the WAL - Crash safety: If the process crashes between writes and commit, the WAL is replayed on the next startup
Write-Ahead Log (WAL)
The WAL is managed by the DocumentLog component and stored at the root level of the storage backend (engine.wal).
WAL Entry Types
| Entry Type | Description |
|---|---|
| Upsert | Document content + external ID + assigned internal ID |
| Delete | External ID of the document to remove |
WAL File
The WAL file (engine.wal) is an append-only binary log. Each entry is self-contained with:
- Operation type (add/delete)
- Sequence number
- Payload (document data or ID)
Recovery
When an engine is built (Engine::builder(...).build().await), it automatically checks for remaining WAL entries and replays them (the WAL is truncated on commit, so any remaining entries are from a crashed session):
graph TD
Start["Engine::build()"] --> Check["Check WAL for\nuncommitted entries"]
Check -->|"Entries found"| Replay["Replay operations\ninto in-memory buffers"]
Replay --> Ready["Engine ready"]
Check -->|"No entries"| Ready
Recovery is transparent — you do not need to handle it manually.
The Commit Lifecycle
#![allow(unused)]
fn main() {
// 1. Add documents (buffered, not yet searchable)
engine.add_document("doc-1", doc1).await?;
engine.add_document("doc-2", doc2).await?;
// 2. Commit — flush to persistent storage
engine.commit().await?;
// Documents are now searchable
// 3. Add more documents
engine.add_document("doc-3", doc3).await?;
// 4. If the process crashes here, doc-3 is in the WAL
// and will be recovered on next startup
}
When to Commit
| Strategy | Description | Use Case |
|---|---|---|
| After each document | Maximum durability, minimum search latency | Real-time search with few writes |
| After a batch | Good balance of throughput and latency | Bulk indexing |
| Periodically | Maximum write throughput | High-volume ingestion |
Tip: Commits are relatively expensive because they flush segments to storage. For bulk indexing, batch many documents before calling
commit().
Storage Layout
The engine uses PrefixedStorage to organize data:
<storage root>/
├── lexical/ # Inverted index segments
│ ├── seg-000/
│ │ ├── terms.dict
│ │ ├── postings.post
│ │ └── ...
│ └── metadata.json
├── vector/ # Vector index segments
│ ├── seg-000/
│ │ ├── graph.hnsw
│ │ ├── vectors.vecs
│ │ └── ...
│ └── metadata.json
├── documents/ # Document storage
│ └── ...
└── engine.wal # Write-ahead log
Next Steps
- How deletions are handled: Deletions & Compaction
- Storage backends: Storage
Deletions & Compaction
Laurus uses a two-phase deletion strategy: fast logical deletion followed by periodic physical compaction.
Deleting Documents
#![allow(unused)]
fn main() {
// Delete a document by its external ID
engine.delete_documents("doc-1").await?;
engine.commit().await?;
}
Logical Deletion
When a document is deleted, it is not immediately removed from the index files. Instead:
graph LR
Del["delete_documents('doc-1')"] --> Bitmap["Add internal ID\nto Deletion Bitmap"]
Bitmap --> Search["Search skips\ndeleted IDs"]
- The document’s internal ID is added to a deletion bitmap
- The bitmap is checked during every search, filtering out deleted documents from results
- The original data remains in the segment files
Why Logical Deletion?
| Benefit | Description |
|---|---|
| Speed | O(1) — flipping a bit is instant |
| Immutable segments | Segment files are never modified in place, simplifying concurrency |
| Safe recovery | If a crash occurs, the deletion bitmap can be reconstructed from the WAL |
Upserts (Update = Delete + Insert)
When you index a document with an existing external ID, Laurus performs an automatic upsert:
- The old document is logically deleted (its ID is added to the deletion bitmap)
- A new document is inserted with a new internal ID
- The external-to-internal ID mapping is updated
#![allow(unused)]
fn main() {
// First insert
engine.put_document("doc-1", doc_v1).await?;
engine.commit().await?;
// Update: old version is logically deleted, new version is inserted
engine.put_document("doc-1", doc_v2).await?;
engine.commit().await?;
}
Physical Compaction
Over time, logically deleted documents accumulate and waste space. Compaction reclaims this space by rewriting segment files without the deleted entries.
graph LR
subgraph "Before Compaction"
S1["Segment 0\ndoc-1 (deleted)\ndoc-2\ndoc-3 (deleted)"]
S2["Segment 1\ndoc-4\ndoc-5"]
end
Compact["Compaction"]
subgraph "After Compaction"
S3["Segment 0\ndoc-2\ndoc-4\ndoc-5"]
end
S1 --> Compact
S2 --> Compact
Compact --> S3
What Compaction Does
- Reads all live (non-deleted) documents from existing segments
- Rebuilds the inverted index and/or vector index without deleted entries
- Writes new, clean segment files
- Removes the old segment files
- Resets the deletion bitmap
Cost and Frequency
| Aspect | Detail |
|---|---|
| CPU cost | High — rebuilds index structures from scratch |
| I/O cost | High — reads all data, writes new segments |
| Blocking | Searches continue during compaction (reads see the old segments until the new ones are ready) |
| Frequency | Run when deleted documents exceed a threshold (e.g., 10-20% of total) |
When to Compact
- Low-write workloads: Compact periodically (e.g., daily or weekly)
- High-write workloads: Compact when the deletion ratio exceeds a threshold
- After bulk updates: Compact after a large batch of upserts
Deletion Bitmap
The deletion bitmap tracks which internal IDs have been deleted:
- Storage: HashSet of deleted document IDs (
AHashSet<u64>) - Lookup: O(1) — hash set lookup
The bitmap is persisted alongside the index segments and is rebuilt from the WAL during recovery.
Next Steps
- How data is persisted: Persistence & WAL
- ID management and internal/external ID mapping: ID Management
Error Handling
Laurus uses a unified error type for all operations. Understanding the error system helps you write robust applications that handle failures gracefully.
LaurusError
All Laurus operations return Result<T>, which is an alias for std::result::Result<T, LaurusError>.
LaurusError is an enum with variants for each category of failure:
| Variant | Description | Common Causes |
|---|---|---|
Io | I/O errors | File not found, permission denied, disk full |
Index | Index operation errors | Corrupt index, segment read failure |
Schema | Schema-related errors | Unknown field name, type mismatch |
Analysis | Text analysis errors | Tokenizer failure, invalid filter config |
Query | Query parsing/execution errors | Malformed Query DSL, unknown field in query |
Storage | Storage backend errors | Failed to open storage, write failure |
Field | Field definition errors | Invalid field options, duplicate field name |
Json | JSON serialization errors | Malformed document JSON |
InvalidOperation | Invalid operation | Searching before commit, double close |
ResourceExhausted | Resource limits exceeded | Out of memory, too many open files |
SerializationError | Binary serialization errors | Corrupt data on disk |
OperationCancelled | Operation was cancelled | Timeout, user cancellation |
NotImplemented | Feature not available | Unimplemented operation |
Other | Generic errors | Timeout, invalid config, invalid argument |
Basic Error Handling
Using the ? Operator
The simplest approach — propagate errors to the caller:
#![allow(unused)]
fn main() {
use laurus::{Engine, Result};
async fn index_documents(engine: &Engine) -> Result<()> {
let doc = laurus::Document::builder()
.add_text("title", "Rust Programming")
.build();
engine.put_document("doc1", doc).await?;
engine.commit().await?;
Ok(())
}
}
Matching on Error Variants
When you need different behavior for different error types:
#![allow(unused)]
fn main() {
use laurus::{Engine, LaurusError};
async fn safe_search(engine: &Engine, query: &str) {
match engine.search(/* request */).await {
Ok(results) => {
for result in results {
println!("{}: {}", result.id, result.score);
}
}
Err(LaurusError::Query(msg)) => {
eprintln!("Invalid query syntax: {}", msg);
}
Err(LaurusError::Io(e)) => {
eprintln!("Storage I/O error: {}", e);
}
Err(e) => {
eprintln!("Unexpected error: {}", e);
}
}
}
}
Checking Error Types with downcast
Since LaurusError implements std::error::Error, you can use standard error handling patterns:
#![allow(unused)]
fn main() {
use laurus::LaurusError;
fn is_retriable(error: &LaurusError) -> bool {
matches!(error, LaurusError::Io(_) | LaurusError::ResourceExhausted(_))
}
}
Common Error Scenarios
Schema Mismatch
Adding a document with fields that don’t match the schema:
#![allow(unused)]
fn main() {
// Schema has "title" (Text) and "year" (Integer)
let doc = Document::builder()
.add_text("title", "Hello")
.add_text("unknown_field", "this field is not in schema")
.build();
// Fields not in the schema are silently ignored during indexing.
// No error is raised — only schema-defined fields are processed.
}
Query Parsing Errors
Invalid Query DSL syntax returns a Query error:
#![allow(unused)]
fn main() {
use laurus::engine::query::UnifiedQueryParser;
let parser = UnifiedQueryParser::new();
match parser.parse("title:\"unclosed phrase") {
Ok(request) => { /* ... */ }
Err(LaurusError::Query(msg)) => {
// msg contains details about the parse failure
eprintln!("Bad query: {}", msg);
}
Err(e) => { /* other errors */ }
}
}
Storage I/O Errors
File-based storage may encounter I/O errors:
#![allow(unused)]
fn main() {
use laurus::storage::{StorageConfig, StorageFactory};
match StorageFactory::open(StorageConfig::File {
path: "/nonexistent/path".into(),
loading_mode: Default::default(),
}) {
Ok(storage) => { /* ... */ }
Err(LaurusError::Io(e)) => {
eprintln!("Cannot open storage: {}", e);
}
Err(e) => { /* other errors */ }
}
}
Convenience Constructors
LaurusError provides factory methods for creating errors in custom implementations:
| Method | Creates |
|---|---|
LaurusError::index(msg) | Index variant |
LaurusError::schema(msg) | Schema variant |
LaurusError::analysis(msg) | Analysis variant |
LaurusError::query(msg) | Query variant |
LaurusError::storage(msg) | Storage variant |
LaurusError::field(msg) | Field variant |
LaurusError::other(msg) | Other variant |
LaurusError::cancelled(msg) | OperationCancelled variant |
LaurusError::invalid_argument(msg) | Other with “Invalid argument” prefix |
LaurusError::invalid_config(msg) | Other with “Invalid configuration” prefix |
LaurusError::not_found(msg) | Other with “Not found” prefix |
LaurusError::timeout(msg) | Other with “Timeout” prefix |
These are useful when implementing custom Analyzer, Embedder, or Storage traits:
#![allow(unused)]
fn main() {
use laurus::{LaurusError, Result};
fn validate_dimension(dim: usize) -> Result<()> {
if dim == 0 {
return Err(LaurusError::invalid_argument("dimension must be > 0"));
}
Ok(())
}
}
Automatic Conversions
LaurusError implements From for common error types, so they convert automatically with ?:
| Source Type | Target Variant |
|---|---|
std::io::Error | LaurusError::Io |
serde_json::Error | LaurusError::Json |
anyhow::Error | LaurusError::Anyhow |
Next Steps
- Extensibility — implement custom traits with proper error handling
- API Reference — full method signatures and return types
Extensibility
Laurus uses trait-based abstractions for its core components. You can implement these traits to provide custom analyzers, embedders, and storage backends.
Custom Analyzer
Implement the Analyzer trait to create a custom text analysis pipeline:
#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::analyzer::Analyzer;
use laurus::analysis::token::{Token, TokenStream};
use laurus::Result;
#[derive(Debug)]
struct ReverseAnalyzer;
impl Analyzer for ReverseAnalyzer {
fn analyze(&self, text: &str) -> Result<TokenStream> {
let tokens: Vec<Token> = text
.split_whitespace()
.enumerate()
.map(|(i, word)| Token {
text: word.chars().rev().collect(),
position: i,
..Default::default()
})
.collect();
Ok(Box::new(tokens.into_iter()))
}
fn name(&self) -> &str {
"reverse"
}
fn as_any(&self) -> &dyn std::any::Any {
self
}
}
}
Required Methods
| Method | Description |
|---|---|
analyze(&self, text: &str) -> Result<TokenStream> | Process text into a stream of tokens |
name(&self) -> &str | Return a unique identifier for this analyzer |
as_any(&self) -> &dyn Any | Enable downcasting to the concrete type |
Using a Custom Analyzer
Pass your analyzer to EngineBuilder:
#![allow(unused)]
fn main() {
use std::sync::Arc;
let analyzer = Arc::new(ReverseAnalyzer);
let engine = Engine::builder(storage, schema)
.analyzer(analyzer)
.build()
.await?;
}
For per-field analyzers, wrap with PerFieldAnalyzer:
#![allow(unused)]
fn main() {
use laurus::analysis::analyzer::per_field::PerFieldAnalyzer;
use laurus::analysis::analyzer::standard::StandardAnalyzer;
let mut per_field = PerFieldAnalyzer::new(Arc::new(StandardAnalyzer::new()?));
per_field.add_analyzer("custom_field", Arc::new(ReverseAnalyzer));
let engine = Engine::builder(storage, schema)
.analyzer(Arc::new(per_field))
.build()
.await?;
}
Custom Embedder
Implement the Embedder trait to integrate your own vector embedding model:
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use laurus::embedding::embedder::{Embedder, EmbedInput, EmbedInputType};
use laurus::vector::core::vector::Vector;
use laurus::{LaurusError, Result};
#[derive(Debug)]
struct MyEmbedder {
dimension: usize,
}
#[async_trait]
impl Embedder for MyEmbedder {
async fn embed(&self, input: &EmbedInput<'_>) -> Result<Vector> {
match input {
EmbedInput::Text(text) => {
// Your embedding logic here
let vector = vec![0.0f32; self.dimension];
Ok(Vector::new(vector))
}
_ => Err(LaurusError::invalid_argument(
"this embedder only supports text input",
)),
}
}
fn supported_input_types(&self) -> Vec<EmbedInputType> {
vec![EmbedInputType::Text]
}
fn name(&self) -> &str {
"my-embedder"
}
fn as_any(&self) -> &dyn std::any::Any {
self
}
}
}
Required Methods
| Method | Description |
|---|---|
async embed(&self, input: &EmbedInput) -> Result<Vector> | Generate an embedding vector for the given input |
supported_input_types(&self) -> Vec<EmbedInputType> | Declare supported input types (Text, Image) |
as_any(&self) -> &dyn Any | Enable downcasting |
Optional Methods
| Method | Default | Description |
|---|---|---|
async embed_batch(&self, inputs) -> Result<Vec<Vector>> | Sequential calls to embed | Override for batch optimization |
name(&self) -> &str | "unknown" | Identifier for logging |
supports(&self, input_type) -> bool | Checks supported_input_types | Input type support check |
supports_text() -> bool | Checks for Text | Text support shorthand |
supports_image() -> bool | Checks for Image | Image support shorthand |
is_multimodal() -> bool | Both text and image | Multimodal check |
Using a Custom Embedder
#![allow(unused)]
fn main() {
let embedder = Arc::new(MyEmbedder { dimension: 384 });
let engine = Engine::builder(storage, schema)
.embedder(embedder)
.build()
.await?;
}
For per-field embedders, wrap with PerFieldEmbedder:
#![allow(unused)]
fn main() {
use laurus::embedding::per_field::PerFieldEmbedder;
let mut per_field = PerFieldEmbedder::new(Arc::new(MyEmbedder { dimension: 384 }));
per_field.add_embedder("image_vec", Arc::new(ClipEmbedder::new()?));
let engine = Engine::builder(storage, schema)
.embedder(Arc::new(per_field))
.build()
.await?;
}
Custom Storage
Implement the Storage trait to add a new storage backend:
#![allow(unused)]
fn main() {
use laurus::storage::{Storage, StorageInput, StorageOutput, LoadingMode, FileMetadata};
use laurus::Result;
#[derive(Debug)]
struct S3Storage {
bucket: String,
prefix: String,
}
impl Storage for S3Storage {
fn loading_mode(&self) -> LoadingMode {
LoadingMode::Eager // S3 requires full download
}
fn open_input(&self, name: &str) -> Result<Box<dyn StorageInput>> {
// Download from S3 and return a reader
todo!()
}
fn create_output(&self, name: &str) -> Result<Box<dyn StorageOutput>> {
// Create an upload stream to S3
todo!()
}
fn create_output_append(&self, name: &str) -> Result<Box<dyn StorageOutput>> {
todo!()
}
fn file_exists(&self, name: &str) -> bool {
todo!()
}
fn delete_file(&self, name: &str) -> Result<()> {
todo!()
}
fn list_files(&self) -> Result<Vec<String>> {
todo!()
}
fn file_size(&self, name: &str) -> Result<u64> {
todo!()
}
fn metadata(&self, name: &str) -> Result<FileMetadata> {
todo!()
}
fn rename_file(&self, old_name: &str, new_name: &str) -> Result<()> {
todo!()
}
fn create_temp_output(&self, prefix: &str) -> Result<(String, Box<dyn StorageOutput>)> {
todo!()
}
fn sync(&self) -> Result<()> {
todo!()
}
fn close(&mut self) -> Result<()> {
todo!()
}
}
}
Required Methods
| Method | Description |
|---|---|
open_input(name) -> Result<Box<dyn StorageInput>> | Open a file for reading |
create_output(name) -> Result<Box<dyn StorageOutput>> | Create a file for writing |
create_output_append(name) -> Result<Box<dyn StorageOutput>> | Open a file for appending |
file_exists(name) -> bool | Check if a file exists |
delete_file(name) -> Result<()> | Delete a file |
list_files() -> Result<Vec<String>> | List all files |
file_size(name) -> Result<u64> | Get file size in bytes |
metadata(name) -> Result<FileMetadata> | Get file metadata |
rename_file(old, new) -> Result<()> | Rename a file |
create_temp_output(prefix) -> Result<(String, Box<dyn StorageOutput>)> | Create a temporary file |
sync() -> Result<()> | Flush all pending writes |
close(&mut self) -> Result<()> | Close storage and release resources |
Optional Methods
| Method | Default | Description |
|---|---|---|
loading_mode() -> LoadingMode | LoadingMode::Eager | Preferred data loading mode |
Thread Safety
All three traits require Send + Sync. This means your implementations must be safe to share across threads. Use Arc<Mutex<_>> or lock-free data structures for shared mutable state.
Next Steps
- Error Handling — handle errors in custom implementations
- Text Analysis — built-in analyzers and pipeline components
- Embeddings — built-in embedder options
- Storage — built-in storage backends
Architecture
This page explains how Laurus is structured internally. Understanding the architecture will help you make better decisions about schema design, analyzer selection, and search strategies.
High-Level Overview
Laurus is organized around a single Engine that coordinates four internal components:
graph TB
subgraph Engine
SCH["Schema"]
LS["LexicalStore\n(Inverted Index)"]
VS["VectorStore\n(HNSW / Flat / IVF)"]
DL["DocumentLog\n(WAL + Document Storage)"]
end
Storage["Storage (trait)\nMemory / File / File+Mmap"]
LS --- Storage
VS --- Storage
DL --- Storage
| Component | Responsibility |
|---|---|
| Schema | Declares fields and their types; determines how each field is routed |
| LexicalStore | Inverted index for keyword search (BM25 scoring) |
| VectorStore | Vector index for similarity search (Flat, HNSW, or IVF) |
| DocumentLog | Write-ahead log (WAL) for durability + raw document storage |
All three stores share a single Storage backend, isolated by key prefixes (lexical/, vector/, documents/).
Engine Lifecycle
Building an Engine
The EngineBuilder assembles the engine from its parts:
#![allow(unused)]
fn main() {
let engine = Engine::builder(storage, schema)
.analyzer(analyzer) // optional: for text fields
.embedder(embedder) // optional: for vector fields
.build()
.await?;
}
sequenceDiagram
participant User
participant EngineBuilder
participant Engine
User->>EngineBuilder: new(storage, schema)
User->>EngineBuilder: .analyzer(analyzer)
User->>EngineBuilder: .embedder(embedder)
User->>EngineBuilder: .build().await
EngineBuilder->>EngineBuilder: split_schema()
Note over EngineBuilder: Separate fields into\nLexicalIndexConfig\n+ VectorIndexConfig
EngineBuilder->>Engine: Create LexicalStore
EngineBuilder->>Engine: Create VectorStore
EngineBuilder->>Engine: Create DocumentLog
EngineBuilder->>Engine: Recover from WAL
EngineBuilder-->>User: Engine ready
During build(), the engine:
- Splits the schema — lexical fields go to
LexicalIndexConfig, vector fields go toVectorIndexConfig - Creates prefixed storage — each component gets an isolated namespace (
lexical/,vector/,documents/) - Initializes stores —
LexicalStoreandVectorStoreare constructed with their configs - Recovers from WAL — replays any uncommitted operations from a previous session
Schema Splitting
The Schema contains both lexical and vector fields. At build time, split_schema() separates them:
graph LR
S["Schema\ntitle: Text\nbody: Text\ncategory: Text\npage: Integer\ncontent_vec: HNSW"]
S --> LC["LexicalIndexConfig\ntitle: TextOption\nbody: TextOption\ncategory: TextOption\npage: IntegerOption\n_id: KeywordAnalyzer"]
S --> VC["VectorIndexConfig\ncontent_vec: HnswOption\n(dim=384, m=16, ef=200)"]
Key details:
- The reserved
_idfield is always added to the lexical config withKeywordAnalyzer(exact match) - A
PerFieldAnalyzerwraps per-field analyzer settings; if you pass a simpleStandardAnalyzer, it becomes the default for all text fields - A
PerFieldEmbedderworks the same way for vector fields
Indexing Data Flow
When you call engine.add_document(id, doc):
sequenceDiagram
participant User
participant Engine
participant WAL as DocumentLog (WAL)
participant Lexical as LexicalStore
participant Vector as VectorStore
User->>Engine: add_document("doc-1", doc)
Engine->>WAL: Append to WAL
Engine->>Engine: Assign internal ID (u64)
loop For each field in document
alt Lexical field (text, integer, etc.)
Engine->>Lexical: Analyze + index field
else Vector field
Engine->>Vector: Embed + index field
end
end
Note over Engine: Document is buffered\nbut NOT yet searchable
User->>Engine: commit()
Engine->>Lexical: Flush segments to storage
Engine->>Vector: Flush segments to storage
Engine->>WAL: Truncate WAL
Note over Engine: Documents are\nnow searchable
Key points:
- WAL-first: every write is logged before modifying in-memory structures
- Dual indexing: each field is routed to either the lexical or vector store based on the schema
- Commit required: documents become searchable only after
commit()
Search Data Flow
When you call engine.search(request):
sequenceDiagram
participant User
participant Engine
participant Lexical as LexicalStore
participant Vector as VectorStore
participant Fusion
User->>Engine: search(request)
opt Filter query present
Engine->>Lexical: Execute filter query
Lexical-->>Engine: Allowed document IDs
end
par Lexical search
Engine->>Lexical: Execute lexical query
Lexical-->>Engine: Ranked hits (BM25)
and Vector search
Engine->>Vector: Execute vector query
Vector-->>Engine: Ranked hits (similarity)
end
alt Both lexical and vector results
Engine->>Fusion: Fuse results (RRF or WeightedSum)
Fusion-->>Engine: Merged ranked list
end
Engine->>Engine: Apply offset + limit
Engine-->>User: Vec of SearchResult
The search pipeline has three stages:
- Filter (optional) — execute a filter query on the lexical index to get a set of allowed document IDs
- Search — run lexical and/or vector queries in parallel
- Fusion — if both query types are present, merge results using RRF (default, k=60) or WeightedSum
Storage Architecture
All components share a single Storage trait implementation, but use key prefixes to isolate their data:
graph TB
Engine --> PS1["PrefixedStorage\nprefix: 'lexical/'"]
Engine --> PS2["PrefixedStorage\nprefix: 'vector/'"]
Engine --> PS3["PrefixedStorage\nprefix: 'documents/'"]
PS1 --> S["Storage Backend\n(Memory / File / File+Mmap)"]
PS2 --> S
PS3 --> S
| Backend | Description | Best For |
|---|---|---|
MemoryStorage | All data in memory | Testing, small datasets, ephemeral use |
FileStorage | Standard file I/O | General production use |
FileStorage (mmap) | Memory-mapped files (use_mmap = true) | Large datasets, read-heavy workloads |
Per-Field Dispatch
When a PerFieldAnalyzer is provided, the engine dispatches analysis to field-specific analyzers. The same pattern applies to PerFieldEmbedder.
graph LR
PFA["PerFieldAnalyzer"]
PFA -->|"title"| KA["KeywordAnalyzer"]
PFA -->|"body"| SA["StandardAnalyzer"]
PFA -->|"description"| JA["JapaneseAnalyzer"]
PFA -->|"_id"| KA2["KeywordAnalyzer\n(always)"]
PFA -->|other fields| DEF["Default Analyzer\n(StandardAnalyzer)"]
This allows different fields to use different analysis strategies within the same engine.
Summary
| Aspect | Detail |
|---|---|
| Core struct | Engine — coordinates all operations |
| Builder | EngineBuilder — assembles Engine from Storage + Schema + Analyzer + Embedder |
| Schema split | Lexical fields → LexicalIndexConfig, Vector fields → VectorIndexConfig |
| Write path | WAL → in-memory buffers → commit() → persistent storage |
| Read path | Query → parallel lexical/vector search → fusion → ranked results |
| Storage isolation | PrefixedStorage with lexical/, vector/, documents/ prefixes |
| Per-field dispatch | PerFieldAnalyzer and PerFieldEmbedder route to field-specific implementations |
Next Steps
- Understand field types and schema design: Schema & Fields
- Learn about text analysis: Text Analysis
- Learn about embeddings: Embeddings
API Reference
This page provides a quick reference of the most important types and methods in Laurus. For full details, generate the Rustdoc:
cargo doc --open
Engine
The central coordinator for all indexing and search operations.
| Method | Description |
|---|---|
Engine::builder(storage, schema) | Create an EngineBuilder |
engine.put_document(id, doc).await? | Upsert a document (replace if ID exists) |
engine.add_document(id, doc).await? | Add a document as a chunk (multiple chunks can share an ID) |
engine.delete_documents(id).await? | Delete all documents/chunks by external ID |
engine.get_documents(id).await? | Get all documents/chunks by external ID |
engine.search(request).await? | Execute a search request |
engine.commit().await? | Flush all pending changes to storage |
engine.stats()? | Get index statistics |
put_documentvsadd_document:put_documentperforms an upsert — if a document with the same external ID already exists, it is deleted and replaced.add_documentalways appends, allowing multiple document chunks to share the same external ID. See Schema & Fields — Indexing Documents for details.
EngineBuilder
| Method | Description |
|---|---|
EngineBuilder::new(storage, schema) | Create a builder with storage and schema |
.analyzer(Arc<dyn Analyzer>) | Set the text analyzer (default: StandardAnalyzer) |
.embedder(Arc<dyn Embedder>) | Set the vector embedder (optional) |
.build().await? | Build the Engine |
Schema
Defines document structure.
| Method | Description |
|---|---|
Schema::builder() | Create a SchemaBuilder |
SchemaBuilder
| Method | Description |
|---|---|
.add_text_field(name, TextOption) | Add a full-text field |
.add_integer_field(name, IntegerOption) | Add an integer field |
.add_float_field(name, FloatOption) | Add a float field |
.add_boolean_field(name, BooleanOption) | Add a boolean field |
.add_datetime_field(name, DateTimeOption) | Add a datetime field |
.add_geo_field(name, GeoOption) | Add a geographic field |
.add_bytes_field(name, BytesOption) | Add a binary field |
.add_hnsw_field(name, HnswOption) | Add an HNSW vector field |
.add_flat_field(name, FlatOption) | Add a Flat vector field |
.add_ivf_field(name, IvfOption) | Add an IVF vector field |
.add_default_field(name) | Set a default search field |
.build() | Build the Schema |
Document
A collection of named field values.
| Method | Description |
|---|---|
Document::builder() | Create a DocumentBuilder |
doc.get(name) | Get a field value by name |
doc.has_field(name) | Check if a field exists |
doc.field_names() | Get all field names |
DocumentBuilder
| Method | Description |
|---|---|
.add_text(name, value) | Add a text field |
.add_integer(name, value) | Add an integer field |
.add_float(name, value) | Add a float field |
.add_boolean(name, value) | Add a boolean field |
.add_datetime(name, value) | Add a datetime field |
.add_vector(name, vec) | Add a pre-computed vector |
.add_geo(name, lat, lon) | Add a geographic point |
.add_bytes(name, data) | Add binary data |
.build() | Build the Document |
Search
SearchRequestBuilder
| Method | Description |
|---|---|
SearchRequestBuilder::new() | Create a new builder |
.lexical_search_request(req) | Set the lexical search component |
.vector_search_request(req) | Set the vector search component |
.filter_query(query) | Set a pre-filter query |
.fusion_algorithm(algo) | Set the fusion algorithm (default: RRF) |
.limit(n) | Maximum results (default: 10) |
.offset(n) | Skip N results (default: 0) |
.build() | Build the SearchRequest |
LexicalSearchRequest
| Method | Description |
|---|---|
LexicalSearchRequest::new(query) | Create with a query |
LexicalSearchRequest::from_dsl(query_str) | Create from a DSL query string |
.limit(n) | Maximum results |
.load_documents(bool) | Whether to load document content |
.min_score(f32) | Minimum score threshold |
.timeout_ms(u64) | Search timeout in milliseconds |
.parallel(bool) | Enable parallel search |
.sort_by_field_asc(field) | Sort by field ascending |
.sort_by_field_desc(field) | Sort by field descending |
.sort_by_score() | Sort by relevance score (default) |
.with_field_boost(field, boost) | Add field-level boost |
VectorSearchRequestBuilder
| Method | Description |
|---|---|
VectorSearchRequestBuilder::new() | Create a new builder |
.add_text(field, text) | Add a text query for a field |
.add_vector(field, vector) | Add a pre-computed query vector |
.add_bytes(field, bytes, mime) | Add a binary payload (for multimodal) |
.limit(n) | Maximum results |
.score_mode(VectorScoreMode) | Score combination mode (WeightedSum, MaxSim) |
.min_score(f32) | Minimum score threshold |
.field(name) | Restrict search to a specific field |
.build() | Build the request |
SearchResult
| Field | Type | Description |
|---|---|---|
id | String | External document ID |
score | f32 | Relevance score |
document | Option<Document> | Document content (if loaded) |
FusionAlgorithm
| Variant | Description |
|---|---|
RRF { k: f64 } | Reciprocal Rank Fusion (default k=60.0) |
WeightedSum { lexical_weight, vector_weight } | Linear combination of scores |
Query Types (Lexical)
| Query | Description | Example |
|---|---|---|
TermQuery::new(field, term) | Exact term match | TermQuery::new("body", "rust") |
PhraseQuery::new(field, terms) | Exact phrase | PhraseQuery::new("body", vec!["machine".into(), "learning".into()]) |
BooleanQueryBuilder::new() | Boolean combination | .must(q1).should(q2).must_not(q3).build() |
FuzzyQuery::new(field, term) | Fuzzy match (default max_edits=2) | FuzzyQuery::new("body", "programing").max_edits(1) |
WildcardQuery::new(field, pattern) | Wildcard | WildcardQuery::new("file", "*.pdf") |
NumericRangeQuery::new(...) | Numeric range | See Lexical Search |
GeoQuery::within_radius(...) | Geo radius | See Lexical Search |
SpanNearQuery::new(...) | Proximity | See Lexical Search |
PrefixQuery::new(field, prefix) | Prefix match | PrefixQuery::new("body", "pro") |
RegexpQuery::new(field, pattern)? | Regex match | RegexpQuery::new("body", "^pro.*ing$")? |
Query Parsers
| Parser | Description |
|---|---|
QueryParser::new(analyzer) | Parse lexical DSL queries |
VectorQueryParser::new(embedder) | Parse vector DSL queries |
UnifiedQueryParser::new(lexical, vector) | Parse hybrid DSL queries |
Analyzers
| Type | Description |
|---|---|
StandardAnalyzer | RegexTokenizer + lowercase + stop words |
SimpleAnalyzer | Tokenization only (no filtering) |
EnglishAnalyzer | RegexTokenizer + lowercase + English stop words |
JapaneseAnalyzer | Japanese morphological analysis |
KeywordAnalyzer | No tokenization (exact match) |
PipelineAnalyzer | Custom tokenizer + filter chain |
PerFieldAnalyzer | Per-field analyzer dispatch |
Embedders
| Type | Feature Flag | Description |
|---|---|---|
CandleBertEmbedder | embeddings-candle | Local BERT model |
OpenAIEmbedder | embeddings-openai | OpenAI API |
CandleClipEmbedder | embeddings-multimodal | Local CLIP model |
PrecomputedEmbedder | (default) | Pre-computed vectors |
PerFieldEmbedder | (default) | Per-field embedder dispatch |
Storage
| Type | Description |
|---|---|
MemoryStorage | In-memory (non-durable) |
FileStorage | File-system based (supports use_mmap for memory-mapped I/O) |
StorageFactory::create(config) | Create from config |
DataValue
| Variant | Rust Type |
|---|---|
DataValue::Null | — |
DataValue::Bool(bool) | bool |
DataValue::Int64(i64) | i64 |
DataValue::Float64(f64) | f64 |
DataValue::Text(String) | String |
DataValue::Bytes(Vec<u8>, Option<String>) | (data, mime_type) |
DataValue::Vector(Vec<f32>) | Vec<f32> |
DataValue::DateTime(DateTime<Utc>) | chrono::DateTime<Utc> |
DataValue::Geo(f64, f64) | (latitude, longitude) |