IRiS

IRiS is a high-performance search core library written in Rust, designed for Information Retrieval with Semantics.

IRiS provides the foundational mechanisms essential for advanced search capabilities:

Lexical search primitives for precise, exact-match retrieval
Vector-based similarity search for deep semantic understanding
Hybrid scoring and ranking to synthesize multiple signals into coherent results

Rather than functioning as a monolithic search engine, IRiS is architected as a composable search core — a suite of modular building blocks designed to be embedded into applications, extended with custom logic, or orchestrated within distributed systems.

Documentation

Comprehensive documentation is available in the docs/ directory and online at https://mosuka.github.io/iris/:

Getting Started: Installation and basic usage.
Core Concepts: Architecture, Lexical Search, and Vector Search.
Advanced Features: ID Management, Persistence, and Deletions.
API Reference

Features

Pure Rust Implementation: Memory-safe and fast performance with zero-cost abstractions.
Hybrid Search: Seamlessly combine BM25 lexical search with HNSW vector search using configurable fusion strategies.
Multimodal capabilities: Native support for text-to-image and image-to-image search via CLIP embeddings.
Rich Query DSL: Term, phrase, boolean, fuzzy, wildcard, range, and geographic queries.
Flexible Analysis: Configurable pipelines for tokenization, normalization, and stemming (including CJK support).
Pluggable Storage: Interfaces for in-memory, file-system, and memory-mapped storage backends.

Quick Start

use iris::{Document, Engine, FieldOption, FusionAlgorithm, Schema, SearchRequestBuilder};
use iris::analysis::analyzer::standard::StandardAnalyzer;
use iris::lexical::{FieldOption as LexicalFieldOption, TextOption, TermQuery};
use iris::vector::{FlatOption, VectorOption, VectorSearchRequestBuilder};
use iris::storage::{StorageConfig, StorageFactory};
use iris::storage::memory::MemoryStorageConfig;
use std::sync::Arc;

fn main() -> iris::Result<()> {
    // 1. Create storage
    let storage = StorageFactory::create(StorageConfig::Memory(MemoryStorageConfig::default()))?;

    // 2. Define schema with separate lexical and vector fields
    let schema = Schema::builder()
        .add_field("content", FieldOption::Lexical(LexicalFieldOption::Text(TextOption::default())))
        .add_field("content_vec", FieldOption::Vector(VectorOption::Flat(FlatOption { dimension: 384, ..Default::default() })))
        .build();

    // 3. Create engine with analyzer and embedder
    let engine = Engine::builder(storage, schema)
        .analyzer(Arc::new(StandardAnalyzer::default()))
        .embedder(Arc::new(MyEmbedder))  // Your embedder implementation
        .build()?;

    engine.index(
        Document::new_with_id("doc1")
            .add_text("content", "Rust is a systems programming language")
            .add_text("content_vec", "Rust is a systems programming language")
    )?;
    engine.index(
        Document::new_with_id("doc2")
            .add_text("content", "Python is great for machine learning")
            .add_text("content_vec", "Python is great for machine learning")
    )?;
    engine.commit()?;

    // 4. Hybrid search (combines lexical keyword match + semantic similarity)
    let results = engine.search(
        SearchRequestBuilder::new()
            .with_lexical(Box::new(TermQuery::new("content", "programming")))
            .with_vector(VectorSearchRequestBuilder::new().add_text("content_vec", "systems language").build())
            .fusion(FusionAlgorithm::RRF { k: 60.0 })
            .build()
    )?;

    // 5. Display results with document content
    for hit in results {
        if let Ok(Some(doc)) = engine.get_document(hit.doc_id) {
            let id = doc.id().unwrap_or("unknown");
            let content = doc.fields.get("content").and_then(|v| v.as_text()).unwrap_or("");
            println!("[{}] {} (internal_id={}, score={:.4})", id, content, hit.doc_id, hit.score);
        }
    }

    Ok(())
}

Examples

You can find usage examples in the examples/ directory:

Search

Unified Search - Lexical, Vector, and Hybrid search in one cohesive example
Multimodal Search - Text-to-image and image-to-image search

Query Types

Term Query - Basic keyword search
Boolean Query - Complex boolean expressions (AND, OR, NOT)
Phrase Query - Exact phrase matching
Fuzzy Query - Approximate string matching
Wildcard Query - Pattern-based search
Range Query - Numeric and date range queries
Geo Query - Geographic search
Span Query - Positional queries

Embeddings

Candle Embedder - Local BERT embeddings
OpenAI Embedder - Cloud-based embeddings

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under either of

MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Getting Started

Welcome to the Iris getting started guide. This section is designed to try out Iris quickly.

Workflow Overview

Building a search application with Iris typically involves the following steps:

Installation: Adding iris to your project dependencies.
Configuration: Setting up the Engine with Schema and choosing a storage backend (Memory, File, or Mmap).
Indexing: Inserting documents that contain both text (for lexical search) and vectors (for semantic search).
Searching: Executing queries to retrieve relevant results.

In this Section

Installation Learn how to add Iris to your Rust project and configure necessary feature flags (e.g., for different tokenizer support).

Quick Example

For a complete, runnable example of how to set up a Hybrid Search (combining vector and text search), please refer to the Unified Search Example in the repository.

use iris::{Document, Engine, FieldOption, FusionAlgorithm, Schema, SearchRequestBuilder};
use iris::analysis::analyzer::standard::StandardAnalyzer;
use iris::lexical::{FieldOption as LexicalFieldOption, TextOption, TermQuery};
use iris::vector::{FlatOption, VectorOption, VectorSearchRequestBuilder};
use iris::storage::{StorageConfig, StorageFactory};
use iris::storage::memory::MemoryStorageConfig;
use std::sync::Arc;

fn main() -> iris::Result<()> {
    // 1. Create storage
    let storage = StorageFactory::create(StorageConfig::Memory(MemoryStorageConfig::default()))?;

    // 2. Define schema with separate lexical and vector fields
    let schema = Schema::builder()
        .add_field("content", FieldOption::Lexical(LexicalFieldOption::Text(TextOption::default())))
        .add_field("content_vec", FieldOption::Vector(VectorOption::Flat(FlatOption { dimension: 384, ..Default::default() })))
        .build();

    // 3. Create engine with analyzer and embedder
    let engine = Engine::builder(storage, schema)
        .analyzer(Arc::new(StandardAnalyzer::default()))
        .embedder(Arc::new(MyEmbedder))  // Your embedder implementation
        .build()?;

    engine.put_document("doc1",
        Document::new()
            .add_text("content", "Rust is a systems programming language")
            .add_text("content_vec", "Rust is a systems programming language")
    )?;
    engine.put_document("doc2",
        Document::new()
            .add_text("content", "Python is great for machine learning")
            .add_text("content_vec", "Python is great for machine learning")
    )?;
    engine.commit()?;

    // 4. Hybrid search (combines lexical keyword match + semantic similarity)
    let results = engine.search(
        SearchRequestBuilder::new()
            .with_lexical(Box::new(TermQuery::new("content", "programming")))
            .with_vector(VectorSearchRequestBuilder::new().add_text("content_vec", "systems language").build())
            .fusion(FusionAlgorithm::RRF { k: 60.0 })
            .build()
    )?;

    // 5. Display results
    for hit in results {
        println!("[{}] score={:.4}", hit.id, hit.score);
    }

    Ok(())
}

Installation

Add iris to your Cargo.toml:

[dependencies]
iris = "0.1.0"

Feature Flags

Iris provides several feature flags to enable optional functionalities, particularly for embedding generation:

embeddings-candle: Enables Hugging Face Candle integration for running models locally.
embeddings-openai: Enables OpenAI API integration.
embeddings-multimodal: Enables multimodal embedding support (image + text) via Candle.
embeddings-all: Enables all embedding features.

# Example: interacting with OpenAI
iris = { version = "0.1.0", features = ["embeddings-openai"] }

Core Concepts

This section details the internal architecture and design philosophy of IRiS, based on its current Rust implementation.

Unified Vector Engine Architecture

At the heart of IRiS is the VectorEngine (src/vector/engine.rs), which acts as the unified coordinator for all search operations. Unlike traditional systems that treat vector search and keyword search as separate silos, IRiS integrates them into a single cohesive system.

Key Components

VectorEngine:
- Responsibility: Manages the lifecycle of documents, handles persistence (WAL & Snapshots), and coordinates search queries.
- Unified Indexing: When a document is indexed, VectorEngine splits it into:
  - Vector Data: Stored in field-specific indices (HNSW, Flat, IVF).
  - Lexical Data: Stored in a managed LexicalEngine instance (metadata_index).
- Implicit Schema: By default, fields are registered dynamically upon insertion (implicit_schema: true), allowing for a schemaless-like experience while maintaining strict typing internally.
LexicalEngine (src/lexical/engine.rs):
- Role: Serves as an internal component of VectorEngine to handle:
  - Inverted Indices: For Term, Phrase, and Boolean queries.
  - ID Mapping: Maps external string IDs (e.g., “product-123”) to internal 64-bit integer IDs (u64).
  - Metadata Storage: Stores non-vector document fields (JSON-like metadata).
- Design: Uses a “Near Real-Time” (NRT) architecture with a writer_cache for uncommitted changes and a searcher_cache for committed views.
Storage Abstraction (src/storage.rs):
- All components interact with data through the Storage trait, enabling seamless switching between backends:
  - MemoryStorage: Pure in-memory usage (great for testing/embedded).
  - FileStorage: Standard disk-based persistence.
  - MmapStorage: Memory-mapped files for high-performance large datasets.

Data Model

IRiS uses a flexible data model centered around the DocumentVector structure.

#![allow(unused)]
fn main() {
pub struct DocumentVector {
    /// Vector fields (e.g., "embedding", "image_vec")
    pub fields: HashMap<String, StoredVector>,
    /// Metadata fields (e.g., "title", "category", "_id")
    pub metadata: HashMap<String, String>,
}
}

External ID (_id): Every document has a unique string ID. Internally, this is mapped to a dense u64 ID for performance.
Vector Fields: Store high-dimensional vectors. Supported formats include:
- Flat: Brute-force exact search.
- HNSW: Hierarchical Navigable Small World graphs for approximate nearest neighbor search.
- IVF: Inverted File Index for quantized search.
Payloads: You can index raw text or images. The engine uses configured Embedders (e.g., CLIP, BERT) to convert these payloads into vectors on the fly.

Hybrid Search & Fusion

One of IRiS’s core strengths is its ability to perform Hybrid Search—combining semantic similarity (Vector) with keyword relevance (Lexical).

Search Flow

Request: The user sends a VectorSearchRequest containing both a query vector and a lexical query (e.g., “find red shoes” + category:"sale").
Parallel Execution:
- The Vector Searcher scans the HNSW index to find nearest neighbors.
- The Lexical Searcher scans the Inverted Index to find matching terms.
Fusion: The results are merged using a configurable strategy (FusionAlgorithm):
- RRF (Reciprocal Rank Fusion): Ranks documents based on their positional rank in each result set. Robust and parameter-free.
- Weighted Sum: Linearly combines normalized scores (alpha * vector_score + beta * lexical_score).

Persistence & durability

IRiS ensures data safety through a combination of Write-Ahead Logging (WAL) and Snapshots.

WAL (src/vector/index/wal.rs): Every write operation (Upsert/Delete) is appended to a log file immediately. This ensures that even if the process crashes, recent changes can be replayed on startup.
Snapshots: Periodically, the in-memory state of the registry and documents is serialized to disk (document_snapshot.bin). This speeds up recovery by avoiding full WAL replay.
Commit: Calling commit() forces a flush of all in-memory buffers to persistent storage and rotates the logs.

graph TD
    User[User Update] -->|1. Write| WAL[Write-Ahead Log]
    User -->|2. Update| Mem[In-Memory Index]
    
    subgraph "Persistence"
        WAL
        Snapshot[Snapshot File]
    end
    
    Mem -->|Commit/Flush| Snapshot
    WAL -->|Recovery| Mem
    Snapshot -->|Recovery| Mem

Architecture

Iris is built on a unified modular architecture where the Engine serves as the core orchestrator.

1. Engine (Unified)

The primary engine associated with the library. It unifies vector similarity search with full-text search capabilities.

Orchestration: Manages both VectorStore (HNSW/IVF/Flat index) and LexicalStore (Inverted Index).
Hybrid Search: Performs unified queries combining vector similarity and keyword relevance.
ID Management: Manages external ID to internal integer ID mapping.

2. LexicalStore (Component)

Operates as a component managed by the Engine, handling full-text search.

Inverted Index: Standard posting lists for term lookups.
Analyzers: Tokenization and normalization pipeline.
Query Parser: Supports boolean, phrase, and structured queries.

3. VectorStore (Component)

Operates as a component managed by the Engine, handling vector similarity search.

Vector Index: Supports HNSW, IVF, and Flat index types.
Embedder: Automatic text/image to vector embedding.
Distance Metrics: Cosine, Euclidean, and DotProduct similarity.

graph TD
    subgraph "Application Layer"
        User[User / App]
        Req[SearchRequest]
    end

    subgraph "Iris Engine"
        E[Engine]

        subgraph "Components"
            VS[VectorStore]
            LS[LexicalStore]
            DS[DocumentStore]
            WAL[Write-Ahead Log]
        end

        Fusion[Result Fusion]
    end

    subgraph "Storage Layer"
        FS[FileStorage / Mmap]
    end

    %% Flows
    User -->|index/search| E
    E --> VS
    E --> LS
    E --> DS
    E --> WAL

    LS --> FS
    VS --> FS
    DS --> FS
    WAL --> FS

    %% Search Flow
    Req --> E
    E -->|Vector Query| VS
    E -->|Keyword Query| LS

    VS -->|Hits| Fusion
    LS -->|Hits| Fusion

    Fusion -->|Unified Results| User

Storage Layer

All components abstract their storage through a Storage trait, allowing seamless switching between:

Memory: For testing and ephemeral data.
File: For persistent on-disk storage.
Mmap: For high-performance memory-mapped file access.

Component Structure

Each store follows a simplified 4-member structure pattern:

#![allow(unused)]
fn main() {
pub struct LexicalStore {
    index: Box<dyn LexicalIndex>,
    writer_cache: Mutex<Option<Box<dyn LexicalIndexWriter>>>,
    searcher_cache: RwLock<Option<Box<dyn LexicalIndexSearcher>>>,
    doc_store: Arc<RwLock<UnifiedDocumentStore>>,
}

pub struct VectorStore {
    index: Box<dyn VectorIndex>,
    writer_cache: Mutex<Option<Box<dyn VectorIndexWriter>>>,
    searcher_cache: RwLock<Option<Box<dyn VectorIndexSearcher>>>,
    doc_store: Arc<RwLock<UnifiedDocumentStore>>,
}
}

This pattern provides:

Lazy Initialization: Writers and searchers are created on-demand.
Cache Invalidation: Searcher cache is invalidated after commit/optimize.
Shared Document Store: Both stores share the same document storage.

Lexical Search

Lexical search matches documents based on exact or approximate keyword matches. It is the traditional “search engine” functionality found in Lucene or Elasticsearch.

Note: In the unified Iris architecture, Lexical Search is handled by the Engine, which orchestrates both lexical (LexicalStore) and vector (VectorStore) components concurrently.

Document Structure

In Iris, a Document is the fundamental unit of indexing. It follows a schema-less design, allowing fields to be added dynamically without defining a schema upfront.

Each Document consists of multiple Fields stored in a Map where the key is the field name. Each Field has a Value and Options defining how it should be indexed.

flowchart LR
    IntID1("Internal ID<br>1") --> Document_Container

    subgraph Document_Container [Document]
        direction TB
        
        ExtID1["Field (External ID)<br>Name: '_id'<br>Value: 'product_123'<br>Type: Text"]
        F11["Field<br>Name: 'title'<br>Value: 'Apple'<br>Type: Text"]
        F12["Field<br>Name: 'price'<br>Value: 10.00<br>Type: Float"]
    end
    
    IntID2("Internal ID<br>2") --> Document_Container2

    subgraph Document_Container2 [Document]
        direction TB
        
        ExtID2["Field (External ID)<br>Name: '_id'<br>Value: 'product_456'<br>Type: Text"]
        F21["Field<br>Name: 'title'<br>Value: 'Orange'<br>Type: Text"]
        F22["Field<br>Name: 'price'<br>Value: 11.00<br>Type: Float"]
    end

Document

The fundamental unit of indexing in Iris.

Schema-less: Fields can be added dynamically without a predefined schema.
Map Structure: Fields are stored in a HashMap where the key is the field name (String).
Flexible: A single document can contain a mix of different field types (Text, Integer, Blob, etc.).

Field

A container representing a single data point within a document.

Value: The actual data content (e.g., “Hello World”, 123, true). Defined by FieldValue.
Option: Configuration for how this data should be handled (e.g., indexed, stored). Defined by FieldOption.

Field Values

Text: UTF-8 string. Typically analyzed and indexed for full-text search.
Integer / Float: Numeric values. Used for range queries (BKD Tree) and sorting.
Boolean: True/False values.
DateTime: UTC timestamps.
Geo: Latitude/Longitude coordinates. Indexed in a 2D BKD tree for efficient spatial queries (distance and bounding box) and stored for precise calculations.
Blob: Raw byte data with MIME type. Used for storing binary content (images, etc.) or vector source data. Stored only, never indexed by the lexical engine.

Field Options

Configuration for the field defining how it should be indexed and stored.

TextOption:
- indexed: If true, the text is analyzed and added to the inverted index (searchable).
- stored: If true, the original text is stored in the doc store (retrievable).
- term_vectors: If true, stores term positions and offsets (needed for highlighting and “More Like This”).
IntegerOption / FloatOption:
- indexed: If true, the value is added to the BKD tree (range searchable).
- stored: If true, the original value is stored.
BooleanOption:
- indexed: If true, the value is indexed.
- stored: If true, the original value is stored.
DateTimeOption:
- indexed: If true, the timestamp is added to the BKD tree (range searchable).
- stored: If true, the original timestamp is stored.
GeoOption:
- indexed: If true, the coordinates are added to the 2D BKD tree (efficient spatial search).
- stored: If true, the original coordinates are stored.
BlobOption:
- stored: If true, the binary data is stored. Note: Blobs cannot be indexed by the lexical engine.

Indexing Process

The lexical indexing process translates documents into inverted indexes and BKD trees.

graph TD
    subgraph "Indexing Flow"
        Input["Raw Data"] --> DocBuilder["Document Construction"]
        
        subgraph "Processing (InvertedIndexWriter)"
            DocBuilder -->|Text| CharFilter["Char Filter"]
            DocBuilder -->|Numeric/Date/Geo| Normalizer["String Normalizer"]
            DocBuilder -->|Numeric/Date/Geo| PtExt["Point Extractor"]
            DocBuilder -->|Stored Field| StoreProc["Field Values Collector"]
            DocBuilder -->|All Fields| LenTracker["Field Length Tracker"]
            DocBuilder -->|Doc Values| DVTracker["Doc Values Collector"]
            
            subgraph "Analysis Chain"
                CharFilter --> Tokenizer["Tokenizer"]
                Tokenizer --> TokenFilter["Token Filter"]
            end
        end
        
        subgraph "In-Memory Buffering"
            TokenFilter -->|Terms| InvBuffer["Term Posting Index"]
            Normalizer -->|Terms| InvBuffer
            PtExt -->|Points| BkdBuffer["Point Values Buffer"]
            StoreProc -->|Data| DocsBuffer["Stored Docs Buffer"]
        end
        
        subgraph "Segment Flushing (Disk)"
            InvBuffer -->|Write| Postings[".dict / .post"]
            BkdBuffer -->|Sort & Write| BKD[".bkd"]
            DocsBuffer -->|Write| DOCS[".docs"]
            DVTracker -->|Write| DV[".dv"]
            LenTracker -->|Write| LENS[".lens"]
            InvBuffer -.->|Stats| Meta[".meta / .fstats"]
        end
    end

Document Processing:
- Analysis & Normalization: Text is processed through the Analysis Chain (Char Filter, Tokenizer, Token Filter). Non-text fields are handled by the String Normalizer.
- Point Extraction: Multidimensional values (Numeric, Date, and Geo) are extracted by the Point Extractor for spatial indexing (BKD Tree).
- Tracking & Collection: Field Length Tracker and Doc Values Collector gather metadata and columnar data.
In-Memory Buffering:
- Terms are added to the Term Posting Index.
- Extracted points and stored fields are staged in the Point Values Buffer and Stored Docs Buffer.
Segment Flushing:
- Buffered data is periodically sorted and serialized into immutable Segment files on disk.
Merging:
- A background process automatically merges smaller segments into larger ones to optimize read performance and reclaim space from deleted documents.

Analyzers

Text analysis is the process of converting raw text into tokens. An Analyzer is typically composed of a pipeline:

Char Filters: Transform the raw character stream (e.g., removing HTML tags).
Tokenizer: Splits the character stream into a token stream (e.g., splitting by whitespace).
Token Filters: Modify the token stream (e.g., lowercasing, stemming, removing stop words).

Iris provides several built-in analyzers:

StandardAnalyzer: Good default for most European languages.
JapaneseAnalyzer: Optimized for Japanese text using Lindera (morphological analysis).
KeywordAnalyzer: Treats the entire input as a single token.
PipelineAnalyzer: A flexible builder for creating custom analysis pipelines.

Core Concepts

Inverted Index

The inverted index is the fundamental structure for full-text search. While a traditional database maps documents to their terms, an inverted index maps terms to the list of documents containing them.

Term Dictionary: A sorted repository of all unique terms across the index.
Postings Lists: For each term, a list of document IDs (postings) where the term appears, along with frequency and position data for scoring.

BKD Tree

For non-textual data like numbers, dates, and geographic coordinates, Iris uses a BKD Tree. It is a multi-dimensional tree structure optimized for block-based storage on disk. Unlike an inverted index, a BKD tree is designed for range search and spatial search. It effectively partitions the data space into hierarchical blocks, allowing the search engine to skip large portions of irrelevant data.

SIMD Optimization

Iris uses SIMD-accelerated batch scoring for high-throughput ranking. The BM25 scoring algorithm is optimized to process multiple documents simultaneously, leveraging modern CPU instructions to provide a several-fold increase in performance compared to scalar processing.

Engine Architecture

LexicalStore

The store component that manages indexing and searching for text data. It coordinates between LexicalIndexWriter and LexicalIndexSearcher. In the unified architecture, LexicalStore operates as a sub-component managed by the Engine, handling the inverted index portions of hybrid documents.

Index Components

InvertedIndexWriter: The primary interface for adding documents. It orchestrates analysis, point extraction, and buffering.
Segment Manager: Controls the lifecycle and visibility of segments, maintaining the manifest and tracking deletions.
In-Memory Buffering: High-performance mapping of terms and staged BKD/Stored data before merging into disk segments.

Index Segment Files

A single segment is composed of several specialized files:

Extension	Component	Description
`.dict`	Term Dictionary	Maps terms to their locations in the postings list.
`.post`	Postings Lists	Stores document IDs, frequencies, and positions for each term.
`.bkd`	BKD Tree	Provides multidimensional indexing for numeric and geospatial fields.
`.docs`	Document Store	Stores the original (stored) field values in a compressed format.
`.dv`	Doc Values	Columnar storage for fast sorting and aggregations.
`.meta`	Segment Metadata	Statistics, document count, and configuration.
`.lens`	Field Lengths	Token counts per field per document (used for scoring).

Search Process

The search process involves structure-aware traversal and weighted scoring.

graph TD
    subgraph "Search Flow"
        UserQuery["User Query"] --> Parser
        
        subgraph "Searcher"
            Parser["Query Parser"] --> QueryObj["Query"]
            QueryObj --> WeightObj["Weight"]
            WeightObj --> MatcherObj["Matcher"]
            WeightObj --> ScorerObj["Scorer"]
            
            subgraph "Index Access"
                MatcherObj -.->|Look up| II["Inverted Index"]
                MatcherObj -.->|Range Scan| BKD["BKD Tree"]
            end
            
            MatcherObj -->|Doc IDs| CollectorObj["Collector"]
            ScorerObj -->|Scores| CollectorObj
            CollectorObj -.->|Sort by Field| DV["Doc Values"]
            CollectorObj -->|Top Doc IDs| Fetcher["Fetcher"]
            Fetcher -.->|Retrieve Fields| Docs
        end
        
        Fetcher --> Result["Search Results"]
    end

Query Parsing: Translates a human-friendly string or DSL into a structured Query tree.
Weight Creation: Precomputes global statistics (like IDF) to prepare for execution across multiple segments.
Matching & Scoring:
- Matcher: Navigates the Term Dictionary or BKD Tree to identify document IDs.
- Scorer: Computes the relevance score (BM25) using precomputed weights and segment-local frequencies.
Collection & Fetching: Aggregates top results into a sorted list and retrieves original field data for the final response.

Query Types

Iris supports a wide range of queries for different information needs.

Term Query: Match a single analyzed term exactly.
Boolean Query: Logical combinations (MUST, SHOULD, MUST_NOT).
Approximate Queries: Fuzzy, Prefix, Wildcard, and Regexp queries.
Phrase Query: Matches terms in a specific order with optional “slop”.
Numeric Range Query: High-performance range search using the BKD tree.
Geospatial Queries: Distance-based or bounding-box search for geographic points.

Scoring (BM25)

Iris uses Okapi BM25 as its default scoring function. It improves results by prioritizing rare terms and normalizing for document length, ensuring that matches in shorter, focused documents are ranked appropriately.

Code Examples

1. Configuring Engine for Lexical Search

Setting up an engine with a lexical field and default analyzer.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use iris::{Engine, Schema};
use iris::analysis::analyzer::standard::StandardAnalyzer;
use iris::lexical::{FieldOption, TextOption};
use iris::storage::{StorageConfig, StorageFactory};
use iris::storage::memory::MemoryStorageConfig;

fn setup_engine() -> iris::Result<Engine> {
    let storage = StorageFactory::create(StorageConfig::Memory(MemoryStorageConfig::default()))?;

    let schema = Schema::builder()
        .add_lexical_field("title", FieldOption::Text(TextOption::default()))
        .add_lexical_field("content", FieldOption::Text(TextOption::default()))
        .build();

    Engine::builder(storage, schema)
        .analyzer(Arc::new(StandardAnalyzer::default()))
        .build()
}
}

2. Adding Documents

Creating and indexing documents with various field types.

#![allow(unused)]
fn main() {
use iris::{Document, DataValue};

fn add_documents(engine: &Engine) -> iris::Result<()> {
    let doc = Document::new()
        .add_text("title", "Iris Search")
        .add_text("content", "Fast and semantic search engine in Rust")
        .add_field("price", DataValue::Integer(100));

    engine.put_document("doc1", doc)?;
    engine.commit()?; // Flush and commit to make searchable
    Ok(())
}
}

3. Searching with Term Query

Executing a simple search using a term query.

#![allow(unused)]
fn main() {
use iris::SearchRequestBuilder;
use iris::lexical::TermQuery;

fn search(engine: &Engine) -> iris::Result<()> {
    let results = engine.search(
        SearchRequestBuilder::new()
            .with_lexical(Box::new(TermQuery::new("content", "rust")))
            .limit(10)
            .build()
    )?;

    for hit in results {
        println!("[{}] Score: {:.4}", hit.id, hit.score);
    }
    Ok(())
}
}

4. Custom Analyzer Setup

Configuring a Japanese analyzer for specific fields.

#![allow(unused)]
fn main() {
use iris::analysis::analyzer::japanese::JapaneseAnalyzer;

fn setup_japanese_engine() -> iris::Result<Engine> {
    let storage = StorageFactory::create(StorageConfig::Memory(MemoryStorageConfig::default()))?;

    // Configure default analyzer to Japanese
    let analyzer = Arc::new(JapaneseAnalyzer::default());
    let schema = Schema::builder()
        .add_lexical_field("content", FieldOption::Text(TextOption::default()))
        .build();

    Engine::builder(storage, schema)
        .analyzer(analyzer)
        .build()
}
}

Future Outlook

Advanced Scoring Functions: Support for BM25F and custom script-based scoring.
Improved NRT (Near-Real-Time): Faster segment flushing and background merging optimizations.
Multilingual Support: Integration with more language-specific tokenizers and dictionaries.
Tiered Storage: Support for moving older segments to slower/cheaper storage automatically.

Vector Search

Vector search (finding “nearest neighbors”) enables semantic retrieval where matches are based on meaning rather than exact keywords. Iris provides a Unified Engine that combines this semantic search with traditional lexical (keyword) search capabilities.

Document Structure

With the unified engine, a document can contain both vector fields (for semantic search) and lexical fields (for keyword search/filtering).

flowchart LR
    IntID1("Internal ID<br>1") --> DocContainer1_Vec
    IntID1 --> DocContainer1_Lex

    subgraph DocContainer1_Vec [Vector Document]
        direction TB
        subgraph VecField1 [Vector Field]
            direction TB
            F11["Vector Field<br>Name: 'image_vec'<br>Value: [0.12, 0.05, ...]<br>Type: HNSW"]
        end
        subgraph Meta1 [Metadata]
            direction TB
            F12["Metadata Field<br>Name: '_id'<br>Value: 'img_001'"]
            F13["Metadata Field<br>Name: '_mime_type'<br>Value: 'image/jpeg'"]
        end
        VecField1 --> Meta1

        subgraph VecField1_2 [Vector Field]
            direction TB
            F11_2["Vector Field<br>Name: 'text_vec'<br>Value: [0.33, 0.44, ...]<br>Type: HNSW"]
        end
        subgraph Meta1_2 [Metadata]
            direction TB
            F12_2["Metadata Field<br>Name: '_id'<br>Value: 'img_001'"]
            F13_2["Metadata Field<br>Name: '_mime_type'<br>Value: 'text/plain'"]
        end
        VecField1_2 --> Meta1_2
    end

    subgraph DocContainer1_Lex [Lexical Document]
        direction TB
        ExtID1_Lex["Lexical Field (External ID)<br>Name: '_id'<br>Value: 'img_001'<br>Type: Text"]
        L11["Lexical Field<br>Name: 'description'<br>Value: 'A cute cat'<br>Type: Text"]
        L12["Lexical Field<br>Name: 'like'<br>Value: 53<br>Type: Integer"]
    end

    IntID2("Internal ID<br>2") --> DocContainer2_Vec
    IntID2 --> DocContainer2_Lex

    subgraph DocContainer2_Vec [Vector Document]
        direction TB
        subgraph VecField2 [Vector Field]
            direction TB
            F21["Vector Field<br>Name: 'image_vec'<br>Value: [0.88, 0.91, ...]<br>Type: HNSW"]
        end
        subgraph Meta2 [Metadata]
            direction TB
            F22["Metadata Field<br>Name: '_id'<br>Value: 'img_002'"]
            F23["Metadata Field<br>Name: '_mime_type'<br>Value: 'image/jpeg'"]
        end
        VecField2 --> Meta2

        subgraph VecField2_2 [Vector Field]
            direction TB
            F21_2["Vector Field<br>Name: 'text_vec'<br>Value: [0.11, 0.99, ...]<br>Type: HNSW"]
        end
        subgraph Meta2_2 [Metadata]
            direction TB
            F22_2["Metadata Field<br>Name: '_id'<br>Value: 'img_002'"]
            F23_2["Metadata Field<br>Name: '_mime_type'<br>Value: 'text/plain'"]
        end
        VecField2_2 --> Meta2_2
    end

    subgraph DocContainer2_Lex [Lexical Document]
        direction TB
        ExtID2_Lex["Lexical Field (External ID)<br>Name: '_id'<br>Value: 'img_002'<br>Type: Text"]
        L21["Lexical Field<br>Name: 'description'<br>Value: 'A loyal dog'<br>Type: Text"]
        L22["Lexical Field<br>Name: 'like'<br>Value: 42<br>Type: Integer"]
    end

Vector

A mathematical representation of an object (text, image, audio) in a multi-dimensional space.

Dimension: The number of elements in the vector (e.g., 384, 768, 1536).
Normalization: Vectors can be normalized (e.g., to unit length) to optimize distance calculations.

Vector Field Configuration

Defines how vectors in a specific field are indexed and queried.

Distance Metric: The formula used to calculate “similarity” between vectors.
Index Type: The algorithm used for storage and retrieval (HNSW, IVF, Flat).
Quantization: Compression techniques to reduce memory usage.

Indexing Process

The vector indexing process transforms raw data or pre-computed vectors into efficient, searchable structures.

graph TD
    subgraph "Vector Indexing Flow"
        Input["Raw Input (Text/Image)"] --> Embedder["Embedding Model"]
        Embedder -->|Vector| Norm["Normalization"]
        PreComp["Pre-computed Vector"] --> Norm
        
        subgraph "VectorEngine"
            Norm --> Quant["Quantizer (PQ/SQ)"]
            Quant -->|Quantized| Buffer["In-memory Buffer"]
            Norm -->|Raw| Buffer
            
            subgraph "Index Building"
                Buffer -->|HNSW| GraphBuilder["Graph Builder"]
                Buffer -->|IVF| Clustering["K-Means Clustering"]
                Buffer -->|Flat| ArrayBuilder["Linear Array Builder"]
            end
        end
        
        subgraph "Segment Flushing"
            GraphBuilder -->|Write| HNSWFiles[".hnsw / .vecs"]
            Clustering -->|Write| IVFFiles[".ivf / .vecs"]
            ArrayBuilder -->|Write| FlatFiles[".vecs"]
            Quant -.->|Codebook| QMeta[".quant"]
        end
    end

Vector Acquisition: Vectors are either provided directly or generated from text/images using an Embedder.
Processing:
- Normalization: Adjusting vectors to a consistent scale (e.g., unit norm for Cosine similarity).
- Quantization: Optional compression (e.g., Product Quantization) to reduce the memory footprint.
Index Construction:
- HNSW: Builds a hierarchical graph structure for sub-linear search time.
- IVF: Clusters vectors into partitions to restrict the search space.
Segment Flushing: Serializes the in-memory structures into immutable files on disk.

Core Concepts

Approximate Nearest Neighbor (ANN)

In large-scale vector search, calculating exact distances to every vector is too slow. ANN algorithms provide a high-speed search with a small, controllable loss in accuracy (Recall).

Index Types

Flat Index (Exact Search)

Stores all vectors directly in an array and calculates distances between the query and every vector during search.

Implementation: FlatIndexWriter, FlatVectorIndexReader
Characteristics: 100% precision (Exact Search), but search speed decreases linearly with data volume.
Use Cases: Small datasets or as a baseline for ANN precision.

HNSW (Hierarchical Navigable Small World)

Iris’s primary ANN algorithm. It constructs a multi-layered graph where the top layers are sparse (long-distance “express” links) and bottom layers are dense (short-distance local links).

Efficiency: Search time is logarithmic $O(\log N)$.
Implementation: HnswIndexWriter, HnswIndexReader
Parameters: m (links per node) and ef_construction control the trade-off between index quality and build speed.

IVF (Inverted File Index)

Clusters vectors into $K$ Voronoi cells. During search, only the nearest n_probe cells are scanned.

Centroids: Calculated during a Training phase using K-Means.
Implementation: IvfIndexWriter, IvfIndexReader
Use Case: Efficient for extremely large datasets where HNSW memory overhead becomes prohibitive. Works best when combined with PQ quantization.

Distance Metrics

Iris leverages Rust’s SIMD (Single Instruction Multiple Data) instructions to maximize performance for distance calculations.

Metric	Description	Rust Implementation Class	Features
Cosine	Measures the angle between vectors.	`DistanceMetric::Cosine`	Ideal for semantic text similarity.
Euclidean	Measures straight-line distance.	`DistanceMetric::Euclidean`	Suitable for image retrieval and physical proximity.
DotProduct	Calculates the dot product.	`DistanceMetric::DotProduct`	Extremely fast for pre-normalized vectors.

Quantization

To reduce memory usage and improve search speed, Iris supports several quantization methods:

Scalar 8-bit (SQ8): Maps 32-bit floating-points to 8-bit integers (4x compression).
Product Quantization (PQ): Decomposes vectors into sub-vectors and performs clustering (16x-64x compression).

Engine Architecture

VectorStore

The store component that manages vector indexing and searching. It follows a simplified 4-member structure:

index: The underlying vector index (HNSW, IVF, or Flat)
writer_cache: Cached writer for write operations
searcher_cache: Cached searcher for search operations
doc_store: Shared document storage

Index Components

VectorIndex: Trait for vector index implementations (HnswIndex, IvfIndex, FlatIndex).
VectorIndexWriter: Handles vector insertion and embedding.
VectorIndexSearcher: Performs nearest neighbor search.
EmbeddingVectorIndexWriter: Wrapper that automatically embeds text/images before indexing.

Index Segment Files

A vector segment consists of several specialized files:

Extension	Component	Description
`.hnsw`	HNSW Graph	Adjacency lists for hierarchical navigation.
`.vecs`	Raw Vectors	Stored raw floating-point vectors (f32).
`.quant`	Codebook	Trained centroids and parameters for quantization.
`.idx`	Quantized IDs	Compressed vector representations.
`.meta`	Metadata	Segment statistics, dimension, and configuration.

Search Process

Finding the nearest neighbors involves navigating the index structure to minimize distance calculations.

graph TD
    subgraph "Vector Search Flow"
        Query["Query Vector"] --> Quant["Quantization (Encoding)"]
        
        subgraph "Segment Search"
            Quant -->|HNSW| HNSWNav["Graph Navigation"]
            Quant -->|IVF| CentroidScan["Nearest Centroid Probe"]
            
            HNSWNav -->|Top-K| ResBuffer["Candidate Buffer"]
            CentroidScan -->|Top-K| ResBuffer
        end
        
        ResBuffer -->|Re-ranking| Refine["Precision Scoring (Raw Vectors)"]
        Refine --> Final["Sorted Hits"]
    end

Preparation: The query vector is normalized and/or quantized to match the index format.
Navigation:
- In HNSW, the search starts at the top layer and descends toward the target vector through graph neighbors.
- In IVF, the nearest cluster centroids are identified, and search is restricted to those cells.
Refinement: (Optional) If quantization was used, raw vectors may be accessed to re-rank the top candidates for higher precision.

Query Types

K-NN Search (K-Nearest Neighbors)

The basic vector search query.

Parameters: K (the number of neighbors to return).
Recall vs. Speed: Adjusted via search parameters like ef_search for HNSW.

Filtered Vector Search

Combines vector search with boolean filters. Iris supports pre-filtering using metadata filters (backed by LexicalEngine) to restrict the search space to documents matching specific metadata criteria.

Hybrid Search

Leverages both Lexical and Vector engines simultaneously. Results are combined using algorithms like Reciprocal Rank Fusion (RRF) to produce a single, high-quality ranked list.

Fusion Strategies

Results from the vector and lexical searches are combined using fusion strategies.

Weighted Sum: Scores are normalized and combined using linear weights. FinalScore = (LexicalScore * alpha) + (VectorScore * beta)
RRF (Reciprocal Rank Fusion): Calculates scores based on rank position, robust to different score distributions. Score = Σ_i (1 / (k + rank_i))

Search Process for Hybrid Queries

graph TD
    Query["SearchRequest"] --> Engine["Engine"]
    Engine -->|Lexical Query| LexSearch["LexicalStore"]
    Engine -->|Vector Query| VecSearch["VectorStore"]

    LexSearch --> LexHits["Lexical Hits"]
    VecSearch --> VecHits["Vector Hits"]

    LexHits --> Fusion["Result Fusion"]
    VecHits --> Fusion

    Fusion --> Combine["Score Combination"]
    Combine --> TopDocs["Final Top Results"]

Code Examples

1. Configuring Engine for Vector Search

Example of creating an engine with an embedder and vector field configurations.

#![allow(unused)]
fn main() {
use std::sync::Arc;
use iris::{Engine, Schema};
use iris::vector::{FlatOption, HnswOption, VectorOption, DistanceMetric};
use iris::storage::{StorageConfig, StorageFactory};
use iris::storage::memory::MemoryStorageConfig;

fn setup_engine() -> iris::Result<Engine> {
    let storage = StorageFactory::create(StorageConfig::Memory(MemoryStorageConfig::default()))?;

    let schema = Schema::builder()
        .add_vector_field(
            "embedding",
            VectorOption::Hnsw(HnswOption {
                dimension: 384,
                distance: DistanceMetric::Cosine,
                m: 16,
                ef_construction: 200,
                ..Default::default()
            }),
        )
        .build();

    Engine::builder(storage, schema)
        .embedder(Arc::new(MyEmbedder))  // Your embedder implementation
        .build()
}
}

2. Adding Documents

Example of indexing a document with text that gets automatically embedded.

#![allow(unused)]
fn main() {
use iris::{Document, DataValue};

fn add_document(engine: &Engine) -> iris::Result<()> {
    // Text is automatically embedded by the configured embedder
    let doc = Document::new()
        .add_text("embedding", "Fast semantic search in Rust")
        .add_field("category", DataValue::Text("technology".into()));

    engine.put_document("doc_001", doc)?;
    engine.commit()?;

    Ok(())
}
}

3. Executing Vector Search

Example of performing a search using VectorSearchRequestBuilder.

#![allow(unused)]
fn main() {
use iris::SearchRequestBuilder;
use iris::vector::VectorSearchRequestBuilder;

fn search(engine: &Engine) -> iris::Result<()> {
    let results = engine.search(
        SearchRequestBuilder::new()
            .with_vector(
                VectorSearchRequestBuilder::new()
                    .add_text("embedding", "semantic search")
                    .build()
            )
            .limit(10)
            .build()
    )?;

    for hit in results {
        println!("[{}] Score: {:.4}", hit.id, hit.score);
    }

    Ok(())
}
}

4. Hybrid Search

Example of combining vector and keyword search. Note that vector and lexical searches use separate fields.

#![allow(unused)]
fn main() {
use iris::{FusionAlgorithm, SearchRequestBuilder};
use iris::lexical::TermQuery;
use iris::vector::VectorSearchRequestBuilder;

fn hybrid_search(engine: &Engine) -> iris::Result<()> {
    let results = engine.search(
        SearchRequestBuilder::new()
            // Vector search (semantic) on vector field
            .with_vector(
                VectorSearchRequestBuilder::new()
                    .add_text("content_vec", "fast semantic search")
                    .build()
            )
            // Lexical search (keyword) on lexical field
            .with_lexical(Box::new(TermQuery::new("content", "rust")))
            // Fusion strategy
            .fusion(FusionAlgorithm::RRF { k: 60.0 })
            .limit(10)
            .build()
    )?;

    for hit in results {
        println!("[{}] score={:.4}", hit.id, hit.score);
    }

    Ok(())
}
}

5. Weighted Sum Fusion

Example using weighted sum fusion for fine-grained control.

#![allow(unused)]
fn main() {
fn weighted_hybrid_search(engine: &Engine) -> iris::Result<()> {
    let results = engine.search(
        SearchRequestBuilder::new()
            .with_vector(
                VectorSearchRequestBuilder::new()
                    .add_text("content_vec", "machine learning")
                    .build()
            )
            .with_lexical(Box::new(TermQuery::new("content", "python")))
            .fusion(FusionAlgorithm::WeightedSum {
                vector_weight: 0.7,  // 70% semantic
                lexical_weight: 0.3, // 30% keyword
            })
            .limit(10)
            .build()
    )?;

    Ok(())
}
}

Future Outlook

Full Implementation of Product Quantization (PQ): Optimizing PQ clustering, currently a placeholder.
GPU Acceleration: Offloading distance calculations to GPUs, in addition to model inference.
Disk-ANN Support: Mechanisms to efficiently search large indexes stored on SSDs when they exceed memory capacity.

Advanced Features

ID Management

Iris uses a dual-tiered ID management strategy to ensure efficient document retrieval, updates, and aggregation in distributed environments.

1. External ID (String)

The External ID is a logical identifier used by users and applications to uniquely identify a document.

Type: String
Role: You can use any unique value, such as UUIDs, URLs, or database primary keys.
Storage: Persisted transparently as a reserved system field name _id within the Lexical Index.
Uniqueness: Expected to be unique across the entire system.
Updates: Indexing a document with an existing external_id triggers an automatic “Delete-then-Insert” (Upsert) operation, replacing the old version with the newest.

2. Internal ID (u64 / Stable ID)

The Internal ID is a physical handle used internally by Iris’s engines (Lexical and Vector) for high-performance operations.

Type: Unsigned 64-bit Integer (u64)
Role: Used for bitmap operations, point references, and routing between distributed nodes.
Immutability (Stable): Once assigned, an Internal ID never changes due to index merges (segment compaction) or restarts. This prevents inconsistencies in deletion logs and caches.

ID Structure (Shard-Prefixed)

Iris employs a Shard-Prefixed Stable ID scheme designed for multi-node distributed environments.

Bit Range	Name	Description
48-63 bit	Shard ID	Prefix identifying the node or partition (up to 65,535 shards).
0-47 bit	Local ID	Monotonically increasing document number within a shard (up to ~281 trillion documents).

Why this structure?

Zero-Cost Aggregation: Since u64 IDs are globally unique, the aggregator can perform fast sorting and deduplication without worrying about ID collisions between nodes.
Fast Routing: The aggregator can immediately identify the physical node responsible for a document just by looking at the upper bits, avoiding expensive hash lookups.
High-Performance Fetching: Internal IDs map directly to physical data structures. This allows Iris to skip the “External-to-Internal ID” conversion step during retrieval, achieving O(1) access speed.

ID Lifecycle

Registration (engine.index()): User provides a document with an External ID.
ID Assignment: The Engine combines the current shard_id with a new Local ID to issue a Shard-Prefixed Internal ID.
Mapping: The engine maintains the relationship between the External ID and the new Internal ID.
Search: Search results return the u64 Internal ID for efficiency.
Retrieval/Deletion: While the user-facing API accepts External IDs for convenience, the engine internally converts them to Internal IDs for near-instant processing.

Persistence & WAL

To ensure data durability and fast recovery, Iris implements a Write-Ahead Log (WAL) system.

Write-Ahead Log (WAL)

All incoming write operations (Add, Delete) are immediately appended to a disk-based log file.
This happens before memory structures (like HNSW graph or Inverted Index) are updated.
In case of a crash, Iris replays the WAL on startup to restore the in-memory state.

Segments

Indexes can be split into segments (though currently, the implementation focuses on a global segment model with potential for expansion).

Larger indexes are safer to manage as smaller, immutable segments that are periodically merged.

Checkpointing

Currently, explicit commits flush the in-memory state to durable index files.

#![allow(unused)]
fn main() {
engine.commit()?;  // Flush and persist all changes
}

Deletions & Compaction

Logical Deletion

When a document is deleted:

It is not immediately removed from the physical files.
Its ID is added to a Deletion Bitmap.
Subsequent searches check this bitmap and filter out deleted IDs from results.
This operation is fast O(1).

Physical Deletion (Compaction)

Over time, deleted documents accumulate and waste space.

Compaction (Vacuuming) is the process of rewriting the index files to exclude logically deleted data.
It rebuilds the HNSW graph or Inverted Index segments without the deleted entries.
This is an expensive operation and should be run periodically (e.g., nightly).

#![allow(unused)]
fn main() {
// Example of triggering manual compaction
engine.optimize()?;
}

API Reference

For detailed API documentation, please refer to the auto-generated Rustdocs.

You can generate them locally by running:

cargo doc --open

Keyboard shortcuts

Iris Documentation