Architecture
This page explains how Laurus is structured internally. Understanding the architecture will help you make better decisions about schema design, analyzer selection, and search strategies.
Project Structure
Laurus is organized as a Cargo workspace with five crates:
graph TB
CLI["laurus-cli\n(Binary Crate)\nCLI + REPL"]
SRV["laurus-server\n(Library + Binary)\ngRPC Server + HTTP Gateway"]
MCP["laurus-mcp\n(Library + Binary)\nMCP Server"]
PY["laurus-python\n(cdylib)\nPython Bindings"]
LIB["laurus\n(Library Crate)\nCore Search Engine"]
CLI --> LIB
CLI --> SRV
CLI --> MCP
SRV --> LIB
MCP --> SRV
MCP --> LIB
PY --> LIB
| Crate | Type | Description |
|---|---|---|
| laurus | Library | Core search engine – lexical, vector, and hybrid search |
| laurus-cli | Binary | Command-line interface for index management and search |
| laurus-server | Library + Binary | gRPC server with optional HTTP/JSON gateway |
| laurus-mcp | Library + Binary | MCP (Model Context Protocol) server |
| laurus-python | cdylib | Python bindings via PyO3 |
For details on each crate, see:
High-Level Overview
Laurus is organized around a single Engine that coordinates four internal components:
graph TB
subgraph Engine
SCH["Schema"]
LS["LexicalStore\n(Inverted Index)"]
VS["VectorStore\n(HNSW / Flat / IVF)"]
DL["DocumentLog\n(WAL + Document Storage)"]
end
Storage["Storage (trait)\nMemory / File / File+Mmap"]
LS --- Storage
VS --- Storage
DL --- Storage
| Component | Responsibility |
|---|---|
| Schema | Declares fields and their types; determines how each field is routed |
| LexicalStore | Inverted index for keyword search (BM25 scoring) |
| VectorStore | Vector index for similarity search (Flat, HNSW, or IVF) |
| DocumentLog | Write-ahead log (WAL) for durability + raw document storage |
All three stores share a single Storage backend, isolated by key prefixes (lexical/, vector/, documents/).
Engine Lifecycle
Building an Engine
The EngineBuilder assembles the engine from its parts:
#![allow(unused)]
fn main() {
let engine = Engine::builder(storage, schema)
.analyzer(analyzer) // optional: for text fields
.embedder(embedder) // optional: for vector fields
.build()
.await?;
}
sequenceDiagram
participant User
participant EngineBuilder
participant Engine
User->>EngineBuilder: new(storage, schema)
User->>EngineBuilder: .analyzer(analyzer)
User->>EngineBuilder: .embedder(embedder)
User->>EngineBuilder: .build().await
EngineBuilder->>EngineBuilder: split_schema()
Note over EngineBuilder: Separate fields into\nLexicalIndexConfig\n+ VectorIndexConfig
EngineBuilder->>Engine: Create LexicalStore
EngineBuilder->>Engine: Create VectorStore
EngineBuilder->>Engine: Create DocumentLog
EngineBuilder->>Engine: Recover from WAL
EngineBuilder-->>User: Engine ready
During build(), the engine:
- Splits the schema — lexical fields go to
LexicalIndexConfig, vector fields go toVectorIndexConfig - Creates prefixed storage — each component gets an isolated namespace (
lexical/,vector/,documents/) - Initializes stores —
LexicalStoreandVectorStoreare constructed with their configs - Recovers from WAL — replays any uncommitted operations from a previous session
Schema Splitting
The Schema contains both lexical and vector fields. At build time, split_schema() separates them:
graph LR
S["Schema\ntitle: Text\nbody: Text\ncategory: Text\npage: Integer\ncontent_vec: HNSW"]
S --> LC["LexicalIndexConfig\ntitle: TextOption\nbody: TextOption\ncategory: TextOption\npage: IntegerOption\n_id: KeywordAnalyzer"]
S --> VC["VectorIndexConfig\ncontent_vec: HnswOption\n(dim=384, m=16, ef=200)"]
Key details:
- The reserved
_idfield is always added to the lexical config withKeywordAnalyzer(exact match) - A
PerFieldAnalyzerwraps per-field analyzer settings; if you pass a simpleStandardAnalyzer, it becomes the default for all text fields - A
PerFieldEmbedderworks the same way for vector fields
Indexing Data Flow
When you call engine.add_document(id, doc):
sequenceDiagram
participant User
participant Engine
participant WAL as DocumentLog (WAL)
participant Lexical as LexicalStore
participant Vector as VectorStore
User->>Engine: add_document("doc-1", doc)
Engine->>WAL: Append to WAL
Engine->>Engine: Assign internal ID (u64)
loop For each field in document
alt Lexical field (text, integer, etc.)
Engine->>Lexical: Analyze + index field
else Vector field
Engine->>Vector: Embed + index field
end
end
Note over Engine: Document is buffered\nbut NOT yet searchable
User->>Engine: commit()
Engine->>Lexical: Flush segments to storage
Engine->>Vector: Flush segments to storage
Engine->>WAL: Truncate WAL
Note over Engine: Documents are\nnow searchable
Key points:
- WAL-first: every write is logged before modifying in-memory structures
- Dual indexing: each field is routed to either the lexical or vector store based on the schema
- Commit required: documents become searchable only after
commit()
Search Data Flow
When you call engine.search(request):
sequenceDiagram
participant User
participant Engine
participant Lexical as LexicalStore
participant Vector as VectorStore
participant Fusion
User->>Engine: search(request)
opt Filter query present
Engine->>Lexical: Execute filter query
Lexical-->>Engine: Allowed document IDs
end
par Lexical search
Engine->>Lexical: Execute lexical query
Lexical-->>Engine: Ranked hits (BM25)
and Vector search
Engine->>Vector: Execute vector query
Vector-->>Engine: Ranked hits (similarity)
end
alt Both lexical and vector results
Engine->>Fusion: Fuse results (RRF or WeightedSum)
Fusion-->>Engine: Merged ranked list
end
Engine->>Engine: Apply offset + limit
Engine-->>User: Vec of SearchResult
The search pipeline has three stages:
- Filter (optional) — execute a filter query on the lexical index to get a set of allowed document IDs
- Search — run lexical and/or vector queries in parallel
- Fusion — if both query types are present, merge results using RRF (default, k=60) or WeightedSum
Storage Architecture
All components share a single Storage trait implementation, but use key prefixes to isolate their data:
graph TB
Engine --> PS1["PrefixedStorage\nprefix: 'lexical/'"]
Engine --> PS2["PrefixedStorage\nprefix: 'vector/'"]
Engine --> PS3["PrefixedStorage\nprefix: 'documents/'"]
PS1 --> S["Storage Backend\n(Memory / File / File+Mmap)"]
PS2 --> S
PS3 --> S
| Backend | Description | Best For |
|---|---|---|
MemoryStorage | All data in memory | Testing, small datasets, ephemeral use |
FileStorage | Standard file I/O | General production use |
FileStorage (mmap) | Memory-mapped files (use_mmap = true) | Large datasets, read-heavy workloads |
Per-Field Dispatch
When a PerFieldAnalyzer is provided, the engine dispatches analysis to field-specific analyzers. The same pattern applies to PerFieldEmbedder.
graph LR
PFA["PerFieldAnalyzer"]
PFA -->|"title"| KA["KeywordAnalyzer"]
PFA -->|"body"| SA["StandardAnalyzer"]
PFA -->|"description"| JA["JapaneseAnalyzer"]
PFA -->|"_id"| KA2["KeywordAnalyzer\n(always)"]
PFA -->|other fields| DEF["Default Analyzer\n(StandardAnalyzer)"]
This allows different fields to use different analysis strategies within the same engine.
Summary
| Aspect | Detail |
|---|---|
| Core struct | Engine — coordinates all operations |
| Builder | EngineBuilder — assembles Engine from Storage + Schema + Analyzer + Embedder |
| Schema split | Lexical fields → LexicalIndexConfig, Vector fields → VectorIndexConfig |
| Write path | WAL → in-memory buffers → commit() → persistent storage |
| Read path | Query → parallel lexical/vector search → fusion → ranked results |
| Storage isolation | PrefixedStorage with lexical/, vector/, documents/ prefixes |
| Per-field dispatch | PerFieldAnalyzer and PerFieldEmbedder route to field-specific implementations |
Next Steps
- Understand field types and schema design: Schema & Fields
- Learn about text analysis: Text Analysis
- Learn about embeddings: Embeddings