Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Schema Format Reference

The schema file defines the structure of your index — what fields exist, their types, and how they are indexed. Laurus uses TOML format for schema files.

Overview

A schema consists of three top-level elements:

# Policy for fields not declared below. Optional — defaults to "dynamic".
dynamic_field_policy = "dynamic"

# Fields to search by default when a query does not specify a field.
default_fields = ["title", "body"]

# Field definitions. Each field has a name and a typed configuration.
[fields.<field_name>.<FieldType>]
# ... type-specific options
  • dynamic_field_policy — How the engine treats fields present in an ingested document but absent from this schema. Accepted values: "strict", "dynamic", "ignore". Defaults to "dynamic". See Dynamic Schema for the full semantics and the warning about silent truncation under "dynamic".
  • default_fields — A list of field names used as default search targets by the Query DSL. Only lexical fields (Text, Integer, Float, etc.) can be default fields. This key is optional and defaults to an empty list.
  • fields — A map of field names to their typed configuration. Each field must specify exactly one field type.

Field Naming

  • Field names are arbitrary strings (e.g., title, body_vec, created_at).
  • Field names starting with _ are reserved for the engine. The only allow-listed name is _id (managed automatically). Attempting to declare any other _-prefixed field results in an error.
  • Field names must be unique within a schema.

Field Types

Fields fall into two categories: Lexical (for keyword/full-text search) and Vector (for similarity search). A single field cannot be both.

Lexical Fields

Text

Full-text searchable field. Text is processed by the analysis pipeline (tokenization, normalization, stemming, etc.).

[fields.title.Text]
indexed = true       # Whether to index this field for search
stored = true        # Whether to store the original value for retrieval
term_vectors = false # Whether to store term positions (for phrase queries, highlighting)
OptionTypeDefaultDescription
indexedbooltrueEnables searching this field
storedbooltrueStores the original value so it can be returned in results
term_vectorsbooltrueStores term positions for phrase queries, highlighting, and more-like-this

Integer

64-bit signed integer field. Supports range queries and exact match.

[fields.year.Integer]
indexed = true
stored = true
multi_valued = false
OptionTypeDefaultDescription
indexedbooltrueEnables range and exact-match queries
storedbooltrueStores the original value
multi_valuedboolfalseAccept arrays of integers; range queries match if any value satisfies the predicate (Lucene-style “any match” with constant scoring)

Float

64-bit floating point field. Supports range queries.

[fields.rating.Float]
indexed = true
stored = true
multi_valued = false
OptionTypeDefaultDescription
indexedbooltrueEnables range queries
storedbooltrueStores the original value
multi_valuedboolfalseAccept arrays of floats; range queries match if any value satisfies the predicate (Lucene-style “any match” with constant scoring)

Boolean

Boolean field (true / false).

[fields.published.Boolean]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables filtering by boolean value
storedbooltrueStores the original value

DateTime

UTC timestamp field. Supports range queries.

[fields.created_at.DateTime]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables range queries on date/time
storedbooltrueStores the original value

Geo

Geographic point field (latitude/longitude). Supports radius and bounding box queries.

[fields.location.Geo]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables geo queries (radius, bounding box)
storedbooltrueStores the original value

Geo3d

3D Earth-Centered Earth-Fixed (ECEF) Cartesian point field (x / y / z in meters). Supports the geo3d_distance (sphere), geo3d_bbox (3D AABB), and geo3d_nearest (k-NN) queries. See 3D Geographic Search (ECEF) for the coordinate system and the wgs84_to_ecef / ecef_to_wgs84 conversion utilities.

[fields.position.Geo3d]
indexed = true
stored = true
OptionTypeDefaultDescription
indexedbooltrueEnables 3D geo queries (geo3d_distance, geo3d_bbox, geo3d_nearest)
storedbooltrueStores the original (x, y, z) value

Bytes

Raw binary data field. Not indexed — stored only.

[fields.thumbnail.Bytes]
stored = true
OptionTypeDefaultDescription
storedbooltrueStores the binary data

Vector Fields

Vector fields are indexed for approximate nearest neighbor (ANN) search. They require a dimension (the length of each vector) and a distance metric.

Hnsw

Hierarchical Navigable Small World graph index. Best for most use cases — offers a good balance of speed and recall.

[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger128Vector dimensionality (must match your embedding model)
distancestring"Cosine"Distance metric (see Distance Metrics)
minteger16Max bi-directional connections per node. Higher = better recall, more memory
ef_constructioninteger200Search width during index construction. Higher = better quality, slower build
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobject"Scalar8Bit"Quantization method (see Quantization). Mandatory; default keeps the int8 format introduced in Issue #481 Stage 1.
rerank_storagestring(omit)Optional Stage 2 rerank sidecar (see Rerank Storage). "F32" enables a per-field f32 sidecar so search can rescore int8 candidates against the original vectors. Omit to keep Stage 1 int8-only behavior.

Tuning guidelines:

  • m: 12–48 is typical. Use higher values for higher-dimensional vectors.
  • ef_construction: 100–500. Higher values produce a better graph but increase build time.
  • dimension: Must exactly match the output dimension of your embedding model (e.g., 384 for all-MiniLM-L6-v2, 768 for BERT-base, 1536 for text-embedding-3-small).

Flat

Brute-force linear scan index. Provides exact results with no approximation. Best for small datasets (< 10,000 vectors).

[fields.embedding.Flat]
dimension = 384
distance = "Cosine"
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger128Vector dimensionality
distancestring"Cosine"Distance metric (see Distance Metrics)
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobject"Scalar8Bit"Quantization method (see Quantization). Mandatory; default keeps the int8 format introduced in Issue #481 Stage 1.
rerank_storagestring(omit)Reserved for Rerank Storage. Currently emitted only by the HNSW writer; Flat / IVF accept the field for schema symmetry but do not yet write or consume the sidecar.

Ivf

Inverted File Index. Clusters vectors and searches only a subset of clusters. Suitable for very large datasets.

[fields.embedding.Ivf]
dimension = 384
distance = "Cosine"
n_clusters = 100
n_probe = 1
base_weight = 1.0
OptionTypeDefaultDescription
dimensioninteger(required)Vector dimensionality
distancestring"Cosine"Distance metric (see Distance Metrics)
n_clustersinteger100Number of clusters. More clusters = finer partitioning
n_probeinteger1Number of clusters to search at query time. Higher = better recall, slower
base_weightfloat1.0Scoring weight in hybrid search fusion
quantizerobject"Scalar8Bit"Quantization method (see Quantization). Mandatory; default keeps the int8 format introduced in Issue #481 Stage 1.
rerank_storagestring(omit)Reserved for Rerank Storage. Currently emitted only by the HNSW writer; Flat / IVF accept the field for schema symmetry but do not yet write or consume the sidecar.

Note: Unlike Hnsw and Flat, the dimension field in Ivf is required and has no default value.

Tuning guidelines:

  • n_clusters: A common heuristic is sqrt(N) where N is the total number of vectors.
  • n_probe: Start with 1 and increase until recall is acceptable. Typical range is 1–20.

Distance Metrics

The distance option for vector fields accepts the following values:

ValueDescriptionUse When
"Cosine"Cosine distance (1 - cosine similarity). Default.Normalized text/image embeddings
"Euclidean"L2 (Euclidean) distanceSpatial data, non-normalized vectors
"Manhattan"L1 (Manhattan) distanceSparse feature vectors
"DotProduct"Dot product (higher = more similar)Pre-normalized vectors where magnitude matters
"Angular"Angular distanceSimilar to cosine, but based on angle

For most embedding models (BERT, Sentence Transformers, OpenAI, etc.), "Cosine" is the correct choice.

Quantization

Vector fields are stored on disk as 8-bit scalar-quantized integers (Issue #481 Stage 1). Quantization is mandatory; the previous “no quantization” mode no longer exists. The quantizer option defaults to Scalar8Bit and can be omitted from TOML.

Scalar 8-bit (default)

Per-segment global affine quantization to u8. Compresses each f32 component to a single byte (~4x memory reduction) with negligible recall loss in practice.

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
# quantizer = "Scalar8Bit"  # implicit default; can be omitted

Product Quantization (reserved)

Reserved for Issue #481 Stage 3. Currently the writer / searcher return NotImplemented if selected; the variant is kept here so schemas can pre-declare without further TOML changes once Stage 3 lands.

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"

[fields.embedding.Hnsw.quantizer.ProductQuantization]
subvector_count = 48
OptionTypeDescription
subvector_countintegerNumber of subvectors. Must evenly divide dimension.

Breaking change (Issue #481 Stage 1): schemas that explicitly set quantizer to a “none” value are no longer valid. Existing vector indexes built with a pre-Stage-1 laurus build cannot be read; rebuild from source data after upgrading.

Rerank Storage

Optional Stage 2 sidecar (Issue #481) that keeps the original full-precision vectors alongside the int8 segment so the HNSW searcher can do a wide candidate fetch over int8 (cheap) and then rescore the top top_k * rerank_factor candidates against the exact f32 values (accurate).

The sidecar is configured per field with rerank_storage:

[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
rerank_storage = "F32"  # opt-in; omit for Stage 1 int8-only behavior
ValueOn-disk overheadDescription
"F32"+4 bytes/dim per vectorIEEE-754 single-precision sidecar (Lucene 99 / FAISS convention).

When omitted, no sidecar is written and the field stays on the Stage 1 int8-only search path. Queries that pass rerank_factor against a field without rerank_storage silently fall back to Stage 1 ranking — the searcher cannot recover f32 information that was discarded at index time.

Scope: Stage 2 lands HNSW only. Flat / IVF accept the field for schema symmetry but currently neither emit nor consume the sidecar.

Complete Examples

Full-text search only

A simple blog post index with lexical search:

default_fields = ["title", "body"]

[fields.title.Text]
indexed = true
stored = true
term_vectors = false

[fields.body.Text]
indexed = true
stored = true
term_vectors = false

[fields.category.Text]
indexed = true
stored = true
term_vectors = false

[fields.published_at.DateTime]
indexed = true
stored = true

Vector search only

A vector-only index for semantic similarity:

[fields.embedding.Hnsw]
dimension = 768
distance = "Cosine"
m = 16
ef_construction = 200

Hybrid search (lexical + vector)

Combine lexical and vector search for best-of-both-worlds retrieval:

default_fields = ["title", "body"]

[fields.title.Text]
indexed = true
stored = true
term_vectors = false

[fields.body.Text]
indexed = true
stored = true
term_vectors = true

[fields.category.Text]
indexed = true
stored = true
term_vectors = false

[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200

Tip: A single field cannot be both lexical and vector. Use separate fields (e.g., body for text, body_vec for embedding) and map them both to the same source content.

E-commerce product index

A more complex schema with mixed field types:

default_fields = ["name", "description"]

[fields.name.Text]
indexed = true
stored = true
term_vectors = false

[fields.description.Text]
indexed = true
stored = true
term_vectors = true

[fields.price.Float]
indexed = true
stored = true

[fields.in_stock.Boolean]
indexed = true
stored = true

[fields.created_at.DateTime]
indexed = true
stored = true

[fields.location.Geo]
indexed = true
stored = true

[fields.description_vec.Hnsw]
dimension = 384
distance = "Cosine"

Generating a Schema

You can generate a schema TOML file interactively using the CLI:

laurus create schema
laurus create schema --output my_schema.toml

See create schema for details.

Using a Schema

Once you have a schema file, create an index from it:

laurus create index --schema schema.toml

Or load it programmatically in Rust:

#![allow(unused)]
fn main() {
use laurus::Schema;

let toml_str = std::fs::read_to_string("schema.toml")?;
let schema: Schema = toml::from_str(&toml_str)?;
}