Lexical Indexing

Lexical indexing powers keyword-based search. When a document’s text field is indexed, Laurus builds an inverted index — a data structure that maps terms to the documents containing them.

How Lexical Indexing Works

sequenceDiagram
    participant Doc as Document
    participant Analyzer
    participant Writer as IndexWriter
    participant Seg as Segment

    Doc->>Analyzer: "The quick brown fox"
    Analyzer->>Analyzer: Tokenize + Filter
    Analyzer-->>Writer: ["quick", "brown", "fox"]
    Writer->>Writer: Buffer in memory
    Writer->>Seg: Flush to segment on commit()

Step by Step

Analyze: The text passes through the configured analyzer (tokenizer + filters), producing a stream of normalized terms
Buffer: Terms are stored in an in-memory write buffer, organized by field
Commit: On commit(), the buffer is flushed to a new segment on storage

The Inverted Index

An inverted index is essentially a map from terms to document lists:

graph LR
    subgraph "Term Dictionary"
        T1["'brown'"]
        T2["'fox'"]
        T3["'quick'"]
        T4["'rust'"]
    end

    subgraph "Posting Lists"
        P1["doc_1, doc_3"]
        P2["doc_1"]
        P3["doc_1, doc_2"]
        P4["doc_2, doc_3"]
    end

    T1 --> P1
    T2 --> P2
    T3 --> P3
    T4 --> P4

Component	Description
Term Dictionary	Sorted list of all unique terms in the index; supports fast prefix lookup
Posting Lists	For each term, a list of document IDs and metadata (term frequency, positions)
Doc Values	Column-oriented storage for sort/filter operations on numeric and date fields

Posting List Contents

Each entry in a posting list contains:

Field	Description
Document ID	Internal `u64` identifier
Term Frequency	How many times the term appears in this document
Positions (optional)	Where in the document the term appears (needed for phrase queries)
Weight	Score weight for this posting

Numeric and Date Fields

Integer, float, and datetime fields are indexed using a BKD tree — a space-partitioning data structure optimized for range queries:

graph TB
    Root["BKD Root"]
    Root --> L["values < 50"]
    Root --> R["values >= 50"]
    L --> LL["values < 25"]
    L --> LR["25 <= values < 50"]
    R --> RL["50 <= values < 75"]
    R --> RR["values >= 75"]

BKD trees allow efficient evaluation of range queries like price:[10 TO 100] or date:[2024-01-01 TO 2024-12-31].

Geo Fields

Geographic fields store latitude/longitude pairs. They are indexed using a spatial data structure that supports:

Radius queries: find all points within N kilometers of a center point
Bounding box queries: find all points within a rectangular area

Segments

The lexical index is organized into segments. Each segment is an immutable, self-contained mini-index:

graph TB
    LI["Lexical Index"]
    LI --> S1["Segment 0"]
    LI --> S2["Segment 1"]
    LI --> S3["Segment 2"]

    S1 --- F1[".dict (terms)"]
    S1 --- F2[".post (postings)"]
    S1 --- F3[".bkd (numerics)"]
    S1 --- F4[".docs (doc store)"]
    S1 --- F5[".dv (doc values)"]
    S1 --- F6[".meta (metadata)"]
    S1 --- F7[".lens (field lengths)"]

File Extension	Contents
`.dict`	Term dictionary (sorted terms + metadata offsets)
`.post`	Posting lists (document IDs, term frequencies, positions)
`.bkd`	BKD tree data for numeric and date fields
`.docs`	Stored field values (the original document content)
`.dv`	Doc values for sorting and filtering
`.meta`	Segment metadata (doc count, term count, etc.)
`.lens`	Field length norms (for BM25 scoring)

Segment Lifecycle

Create: A new segment is created each time commit() is called
Search: All segments are searched in parallel and results are merged
Merge: Periodically, multiple small segments are merged into larger ones to improve query performance
Delete: When a document is deleted, its ID is added to a deletion bitmap rather than physically removed (see Deletions & Compaction)

BM25 Scoring

Laurus uses the BM25 algorithm to score lexical search results. BM25 considers:

Term Frequency (TF): how often the term appears in the document (more = better, with diminishing returns)
Inverse Document Frequency (IDF): how rare the term is across all documents (rarer = more important)
Field Length Normalization: shorter fields are boosted relative to longer ones

The formula:

score(q, d) = IDF(q) * (TF(q, d) * (k1 + 1)) / (TF(q, d) + k1 * (1 - b + b * |d| / avgdl))

Where k1 = 1.2 and b = 0.75 are the default tuning parameters.

SIMD Optimization

Vector distance calculations leverage SIMD (Single Instruction, Multiple Data) instructions when available, providing significant speedups for similarity computations in vector search.

Code Example

use std::sync::Arc;
use laurus::{Document, Engine, Schema};
use laurus::lexical::TextOption;
use laurus::lexical::core::field::IntegerOption;
use laurus::storage::memory::MemoryStorage;

#[tokio::main]
async fn main() -> laurus::Result<()> {
    let storage = Arc::new(MemoryStorage::new(Default::default()));
    let schema = Schema::builder()
        .add_text_field("title", TextOption::default())
        .add_text_field("body", TextOption::default())
        .add_integer_field("year", IntegerOption::default())
        .build();

    let engine = Engine::builder(storage, schema).build().await?;

    // Index documents
    engine.add_document("doc-1", Document::builder()
        .add_text("title", "Rust Programming")
        .add_text("body", "Rust is a systems programming language.")
        .add_integer("year", 2024)
        .build()
    ).await?;

    // Commit to flush segments to storage
    engine.commit().await?;

    Ok(())
}

Next Steps

Learn how vector indexes work: Vector Indexing
Run queries against the lexical index: Lexical Search

Keyboard shortcuts

Laurus Documentation