Schema Format Reference
The schema file defines the structure of your index — what fields exist, their types, and how they are indexed. Laurus uses TOML format for schema files.
Overview
A schema consists of two top-level elements:
# Fields to search by default when a query does not specify a field.
default_fields = ["title", "body"]
# Field definitions. Each field has a name and a typed configuration.
[fields.<field_name>.<FieldType>]
# ... type-specific options
default_fields— A list of field names used as default search targets by the Query DSL. Only lexical fields (Text, Integer, Float, etc.) can be default fields. This key is optional and defaults to an empty list.fields— A map of field names to their typed configuration. Each field must specify exactly one field type.
Field Naming
- Field names are arbitrary strings (e.g.,
title,body_vec,created_at). - The
_idfield is reserved by Laurus for internal document ID management — do not use it. - Field names must be unique within a schema.
Field Types
Fields fall into two categories: Lexical (for keyword/full-text search) and Vector (for similarity search). A single field cannot be both.
Lexical Fields
Text
Full-text searchable field. Text is processed by the analysis pipeline (tokenization, normalization, stemming, etc.).
[fields.title.Text]
indexed = true # Whether to index this field for search
stored = true # Whether to store the original value for retrieval
term_vectors = false # Whether to store term positions (for phrase queries, highlighting)
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables searching this field |
stored | bool | true | Stores the original value so it can be returned in results |
term_vectors | bool | true | Stores term positions for phrase queries, highlighting, and more-like-this |
Integer
64-bit signed integer field. Supports range queries and exact match.
[fields.year.Integer]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range and exact-match queries |
stored | bool | true | Stores the original value |
Float
64-bit floating point field. Supports range queries.
[fields.rating.Float]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range queries |
stored | bool | true | Stores the original value |
Boolean
Boolean field (true / false).
[fields.published.Boolean]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables filtering by boolean value |
stored | bool | true | Stores the original value |
DateTime
UTC timestamp field. Supports range queries.
[fields.created_at.DateTime]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables range queries on date/time |
stored | bool | true | Stores the original value |
Geo
Geographic point field (latitude/longitude). Supports radius and bounding box queries.
[fields.location.Geo]
indexed = true
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
indexed | bool | true | Enables geo queries (radius, bounding box) |
stored | bool | true | Stores the original value |
Bytes
Raw binary data field. Not indexed — stored only.
[fields.thumbnail.Bytes]
stored = true
| Option | Type | Default | Description |
|---|---|---|---|
stored | bool | true | Stores the binary data |
Vector Fields
Vector fields are indexed for approximate nearest neighbor (ANN) search. They require a dimension (the length of each vector) and a distance metric.
Hnsw
Hierarchical Navigable Small World graph index. Best for most use cases — offers a good balance of speed and recall.
[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | 128 | Vector dimensionality (must match your embedding model) |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
m | integer | 16 | Max bi-directional connections per node. Higher = better recall, more memory |
ef_construction | integer | 200 | Search width during index construction. Higher = better quality, slower build |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | none | Optional quantization method (see Quantization) |
Tuning guidelines:
m: 12–48 is typical. Use higher values for higher-dimensional vectors.ef_construction: 100–500. Higher values produce a better graph but increase build time.dimension: Must exactly match the output dimension of your embedding model (e.g., 384 forall-MiniLM-L6-v2, 768 forBERT-base, 1536 fortext-embedding-3-small).
Flat
Brute-force linear scan index. Provides exact results with no approximation. Best for small datasets (< 10,000 vectors).
[fields.embedding.Flat]
dimension = 384
distance = "Cosine"
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | 128 | Vector dimensionality |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | none | Optional quantization method (see Quantization) |
Ivf
Inverted File Index. Clusters vectors and searches only a subset of clusters. Suitable for very large datasets.
[fields.embedding.Ivf]
dimension = 384
distance = "Cosine"
n_clusters = 100
n_probe = 1
base_weight = 1.0
| Option | Type | Default | Description |
|---|---|---|---|
dimension | integer | (required) | Vector dimensionality |
distance | string | "Cosine" | Distance metric (see Distance Metrics) |
n_clusters | integer | 100 | Number of clusters. More clusters = finer partitioning |
n_probe | integer | 1 | Number of clusters to search at query time. Higher = better recall, slower |
base_weight | float | 1.0 | Scoring weight in hybrid search fusion |
quantizer | object | none | Optional quantization method (see Quantization) |
Note: Unlike Hnsw and Flat, the
dimensionfield in Ivf is required and has no default value.
Tuning guidelines:
n_clusters: A common heuristic issqrt(N)where N is the total number of vectors.n_probe: Start with 1 and increase until recall is acceptable. Typical range is 1–20.
Distance Metrics
The distance option for vector fields accepts the following values:
| Value | Description | Use When |
|---|---|---|
"Cosine" | Cosine distance (1 - cosine similarity). Default. | Normalized text/image embeddings |
"Euclidean" | L2 (Euclidean) distance | Spatial data, non-normalized vectors |
"Manhattan" | L1 (Manhattan) distance | Sparse feature vectors |
"DotProduct" | Dot product (higher = more similar) | Pre-normalized vectors where magnitude matters |
"Angular" | Angular distance | Similar to cosine, but based on angle |
For most embedding models (BERT, Sentence Transformers, OpenAI, etc.), "Cosine" is the correct choice.
Quantization
Vector fields optionally support quantization to reduce memory usage at the cost of some accuracy. Specify the quantizer option as a TOML table.
None (default)
No quantization — full precision 32-bit floats.
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
# quantizer is omitted (no quantization)
Scalar 8-bit
Compresses each float32 component to uint8 (~4x memory reduction).
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
quantizer = "Scalar8Bit"
Product Quantization
Splits the vector into subvectors and quantizes each independently.
[fields.embedding.Hnsw]
dimension = 384
distance = "Cosine"
[fields.embedding.Hnsw.quantizer.ProductQuantization]
subvector_count = 48
| Option | Type | Description |
|---|---|---|
subvector_count | integer | Number of subvectors. Must evenly divide dimension. |
Complete Examples
Full-text search only
A simple blog post index with lexical search:
default_fields = ["title", "body"]
[fields.title.Text]
indexed = true
stored = true
term_vectors = false
[fields.body.Text]
indexed = true
stored = true
term_vectors = false
[fields.category.Text]
indexed = true
stored = true
term_vectors = false
[fields.published_at.DateTime]
indexed = true
stored = true
Vector search only
A vector-only index for semantic similarity:
[fields.embedding.Hnsw]
dimension = 768
distance = "Cosine"
m = 16
ef_construction = 200
Hybrid search (lexical + vector)
Combine lexical and vector search for best-of-both-worlds retrieval:
default_fields = ["title", "body"]
[fields.title.Text]
indexed = true
stored = true
term_vectors = false
[fields.body.Text]
indexed = true
stored = true
term_vectors = true
[fields.category.Text]
indexed = true
stored = true
term_vectors = false
[fields.body_vec.Hnsw]
dimension = 384
distance = "Cosine"
m = 16
ef_construction = 200
Tip: A single field cannot be both lexical and vector. Use separate fields (e.g.,
bodyfor text,body_vecfor embedding) and map them both to the same source content.
E-commerce product index
A more complex schema with mixed field types:
default_fields = ["name", "description"]
[fields.name.Text]
indexed = true
stored = true
term_vectors = false
[fields.description.Text]
indexed = true
stored = true
term_vectors = true
[fields.price.Float]
indexed = true
stored = true
[fields.in_stock.Boolean]
indexed = true
stored = true
[fields.created_at.DateTime]
indexed = true
stored = true
[fields.location.Geo]
indexed = true
stored = true
[fields.description_vec.Hnsw]
dimension = 384
distance = "Cosine"
Generating a Schema
You can generate a schema TOML file interactively using the CLI:
laurus create schema
laurus create schema --output my_schema.toml
See create schema for details.
Using a Schema
Once you have a schema file, create an index from it:
laurus create index --schema schema.toml
Or load it programmatically in Rust:
#![allow(unused)]
fn main() {
use laurus::Schema;
let toml_str = std::fs::read_to_string("schema.toml")?;
let schema: Schema = toml::from_str(&toml_str)?;
}