Schema Format Reference
The schema file defines the structure of your index — what fields exist, their types, and how they are indexed. Laurus uses TOML format for schema files.
Overview
A schema consists of three top-level elements:
# Policy for fields not declared below. Optional — defaults to "dynamic".
dynamic_field_policy = "dynamic"
# Fields to search by default when a query does not specify a field.
default_fields = ["title", "body"]
# Field definitions. Each field has a name and a typed configuration.
[fields.<field_name>.<FieldType>]
# ... type-specific options
dynamic_field_policy— How the engine treats fields present in an ingested document but absent from this schema. Accepted values:"strict","dynamic","ignore". Defaults to"dynamic". See Dynamic Schema for the full semantics and the warning about silent truncation under"dynamic".default_fields— A list of field names used as default search targets by the Query DSL. Only lexical fields (Text, Integer, Float, etc.) can be default fields. This key is optional and defaults to an empty list.fields— A map of field names to their typed configuration. Each field must specify exactly one field type.
Field Naming
- Field names are arbitrary strings (e.g.,
title,body_vec,created_at). - Field names starting with
_are reserved for the engine. The only allow-listed name is_id(managed automatically). Attempting to declare any other_-prefixed field results in an error. - Field names must be unique within a schema.
Field Types
Fields fall into two categories: Lexical (for keyword/full-text search) and Vector (for similarity search). A single field cannot be both.
Lexical Fields
Text
Full-text searchable field. Text is processed by the analysis pipeline (tokenization, normalization, stemming, etc.).
[fields.title.Text]
indexed = true # Whether to index this field for search
stored = true # Whether to store the original value for retrieval
term_vectors = false # Whether to store term positions (for phrase queries, highlighting)
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables searching this field |
stored | bool | true | Stores the original value so it can be returned in results |
term_vectors | bool | true | Stores term positions for phrase queries, highlighting, and more-like-this |
Integer
64-bit signed integer field. Supports range queries and exact match.
[fields.year.Integer]
indexed = true
stored = true
multi_valued = false
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range and exact-match queries |
stored | bool | true | Stores the original value |
multi_valued | bool | false | Accept arrays of integers; range queries match if any value satisfies the predicate (Lucene-style “any match” with constant scoring) |
Float
64-bit floating point field. Supports range queries.
[fields.rating.Float]
indexed = true
stored = true
multi_valued = false
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range queries |
stored | bool | true | Stores the original value |
multi_valued | bool | false | Accept arrays of floats; range queries match if any value satisfies the predicate (Lucene-style “any match” with constant scoring) |
Boolean
Boolean field (true / false).
[fields.published.Boolean]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables filtering by boolean value |
stored | bool | true | Stores the original value |
DateTime
UTC timestamp field. Supports range queries.
[fields.created_at.DateTime]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range queries on date/time |
stored | bool | true | Stores the original value |
Geo
Geographic point field (latitude/longitude). Supports radius and bounding box queries.
[fields.location.Geo]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables geo queries (radius, bounding box) |
stored | bool | true | Stores the original value |
Geo3d
3D Earth-Centered Earth-Fixed (ECEF) Cartesian point field (x / y / z in meters). Supports the geo3d_distance (sphere), geo3d_bbox (3D AABB), and geo3d_nearest (k-NN) queries. See 3D Geographic Search (ECEF) for the coordinate system and the wgs84_to_ecef / ecef_to_wgs84 conversion utilities.
[fields.position.Geo3d]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables 3D geo queries (geo3d_distance, geo3d_bbox, geo3d_nearest) |
stored | bool | true | Stores the original (x, y, z) value |
Bytes
Raw binary data field. Not indexed — stored only.
[fields.thumbnail.Bytes]
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
stored | bool | true | Stores the binary data |
Vector Fields
Vector fields are indexed for approximate nearest neighbor (ANN) search. They require a dimension (the length of each vector) and a distance metric.
Hnsw
Hierarchical Navigable Small World graph index. Best for most use cases — offers a good balance of speed and recall.
[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | 128 | Vector dimensionality (must match your embedding model) |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
m | integer | 16 | Max bi-directional connections per node. Higher = better recall, more memory |
ef_construction | integer | 200 | Search width during index construction. Higher = better quality, slower build |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | "Scalar8Bit" | Quantization method (see Quantization). Mandatory; default keeps the int8 format introduced in Issue #481 Stage 1. |
rerank_storage | string | (omit) | Optional Stage 2 rerank sidecar (see Rerank Storage). "F32" enables a per-field f32 sidecar so search can rescore int8 candidates against the original vectors. Omit to keep Stage 1 int8-only behavior. |
Tuning guidelines:
m: 12–48 is typical. Use higher values for higher-dimensional vectors.ef_construction: 100–500. Higher values produce a better graph but increase build time.dimension: Must exactly match the output dimension of your embedding model (e.g., 384 forall-MiniLM-L6-v2, 768 forBERT-base, 1536 fortext-embedding-3-small).
Flat
Brute-force linear scan index. Provides exact results with no approximation. Best for small datasets (< 10,000 vectors).
[fields.embedding.Flat]
dimension = 384
distance = "Cosine"
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | 128 | Vector dimensionality |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | "Scalar8Bit" | Quantization method (see Quantization). Mandatory; default keeps the int8 format introduced in Issue #481 Stage 1. |
rerank_storage | string | (omit) | Reserved for Rerank Storage. Currently emitted only by the HNSW writer; Flat / IVF accept the field for schema symmetry but do not yet write or consume the sidecar. |
Ivf
Inverted File Index. Clusters vectors and searches only a subset of clusters. Suitable for very large datasets.
[fields.embedding.Ivf]
dimension = 384
distance = "Cosine"
n_clusters = 100
n_probe = 1
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | (required) | Vector dimensionality |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
n_clusters | integer | 100 | Number of clusters. More clusters = finer partitioning |
n_probe | integer | 1 | Number of clusters to search at query time. Higher = better recall, slower |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | "Scalar8Bit" | Quantization method (see Quantization). Mandatory; default keeps the int8 format introduced in Issue #481 Stage 1. |
rerank_storage | string | (omit) | Reserved for Rerank Storage. Currently emitted only by the HNSW writer; Flat / IVF accept the field for schema symmetry but do not yet write or consume the sidecar. |
Note: Unlike Hnsw and Flat, the
dimensionfield in Ivf is required and has no default value.
Tuning guidelines:
n_clusters: A common heuristic issqrt(N)where N is the total number of vectors.n_probe: Start with 1 and increase until recall is acceptable. Typical range is 1–20.
Distance Metrics
The distance option for vector fields accepts the following values:
| Value | Description | Use When |
|---|---|---|
"Cosine" | Cosine distance (1 - cosine similarity). Default. | Normalized text/image embeddings |
"Euclidean" | L2 (Euclidean) distance | Spatial data, non-normalized vectors |
"Manhattan" | L1 (Manhattan) distance | Sparse feature vectors |
"DotProduct" | Dot product (higher = more similar) | Pre-normalized vectors where magnitude matters |
"Angular" | Angular distance | Similar to cosine, but based on angle |
For most embedding models (BERT, Sentence Transformers, OpenAI, etc.), "Cosine" is the correct choice.
Quantization
Vector fields are stored on disk as 8-bit scalar-quantized integers
(Issue #481 Stage 1). Quantization is mandatory; the previous “no
quantization” mode no longer exists. The quantizer option defaults to
Scalar8Bit and can be omitted from TOML.
Scalar 8-bit (default)
Per-segment global affine quantization to u8. Compresses each f32
component to a single byte (~4x memory reduction) with negligible
recall loss in practice.
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
# quantizer = "Scalar8Bit" # implicit default; can be omitted
Product Quantization (reserved)
Reserved for Issue #481 Stage 3. Currently the writer / searcher
return NotImplemented if selected; the variant is kept here so
schemas can pre-declare without further TOML changes once Stage 3
lands.
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
[fields.embedding.Hnsw.quantizer.ProductQuantization]
subvector_count = 48
| Option | Type | Description |
|---|---|---|
subvector_count | integer | Number of subvectors. Must evenly divide dimension. |
Breaking change (Issue #481 Stage 1): schemas that explicitly set
quantizerto a “none” value are no longer valid. Existing vector indexes built with a pre-Stage-1 laurus build cannot be read; rebuild from source data after upgrading.
Rerank Storage
Optional Stage 2 sidecar (Issue #481) that keeps the original
full-precision vectors alongside the int8 segment so the HNSW
searcher can do a wide candidate fetch over int8 (cheap) and then
rescore the top top_k * rerank_factor candidates against the
exact f32 values (accurate).
The sidecar is configured per field with rerank_storage:
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
rerank_storage = "F32" # opt-in; omit for Stage 1 int8-only behavior
| Value | On-disk overhead | Description |
|---|---|---|
"F32" | +4 bytes/dim per vector | IEEE-754 single-precision sidecar (Lucene 99 / FAISS convention). |
When omitted, no sidecar is written and the field stays on the
Stage 1 int8-only search path. Queries that pass rerank_factor
against a field without rerank_storage silently fall back to
Stage 1 ranking — the searcher cannot recover f32 information that
was discarded at index time.
Scope: Stage 2 lands HNSW only. Flat / IVF accept the field for schema symmetry but currently neither emit nor consume the sidecar.
Complete Examples
Full-text search only
A simple blog post index with lexical search:
default_fields = ["title", "body"]
[fields.title.Text]
indexed = true
stored = true
term_vectors = false
[fields.body.Text]
indexed = true
stored = true
term_vectors = false
[fields.category.Text]
indexed = true
stored = true
term_vectors = false
[fields.published_at.DateTime]
indexed = true
stored = true
Vector search only
A vector-only index for semantic similarity:
[fields.embedding.Hnsw]
dimension = 768
distance = "Cosine"
m = 16
ef_construction = 200
Hybrid search (lexical + vector)
Combine lexical and vector search for best-of-both-worlds retrieval:
default_fields = ["title", "body"]
[fields.title.Text]
indexed = true
stored = true
term_vectors = false
[fields.body.Text]
indexed = true
stored = true
term_vectors = true
[fields.category.Text]
indexed = true
stored = true
term_vectors = false
[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
Tip: A single field cannot be both lexical and vector. Use separate fields (e.g.,
bodyfor text,body_vecfor embedding) and map them both to the same source content.
E-commerce product index
A more complex schema with mixed field types:
default_fields = ["name", "description"]
[fields.name.Text]
indexed = true
stored = true
term_vectors = false
[fields.description.Text]
indexed = true
stored = true
term_vectors = true
[fields.price.Float]
indexed = true
stored = true
[fields.in_stock.Boolean]
indexed = true
stored = true
[fields.created_at.DateTime]
indexed = true
stored = true
[fields.location.Geo]
indexed = true
stored = true
[fields.description_vec.Hnsw]
dimension = 384
distance = "Cosine"
Generating a Schema
You can generate a schema TOML file interactively using the CLI:
laurus create schema
laurus create schema --output my_schema.toml
See create schema for details.
Using a Schema
Once you have a schema file, create an index from it:
laurus create index --schema schema.toml
Or load it programmatically in Rust:
#![allow(unused)]
fn main() {
use laurus::Schema;
let toml_str = std::fs::read_to_string("schema.toml")?;
let schema: Schema = toml::from_str(&toml_str)?;
}