Skip to content

Enable exclude_source_vectors by default for new indices #131907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
8b03ce9
Enable `exclude_source_vectors` by default for new indices
jimczi Jul 25, 2025
a8f12fe
Update docs/changelog/131907.yaml
jimczi Jul 25, 2025
b0f932f
Update docs
jimczi Jul 25, 2025
9ababdd
Merge remote-tracking branch 'origin/exclude_vector_source' into excl…
jimczi Jul 25, 2025
a3d9dad
fix more tests
jimczi Jul 25, 2025
93d3ab3
remove non-applicable skip test
jimczi Jul 25, 2025
1af3c2c
remove non-applicable skip test (bis)
jimczi Jul 25, 2025
cd3cec2
skip tests
jimczi Jul 25, 2025
91f0103
Merge remote-tracking branch 'upstream/main' into exclude_vector_source
jimczi Jul 25, 2025
87c5d57
add capabilities
jimczi Jul 25, 2025
96e2d27
Fix the translog operation asserter to check for map equivalence rath…
jimczi Jul 25, 2025
bccdc3e
Merge remote-tracking branch 'upstream/main' into exclude_vector_source
jimczi Jul 25, 2025
12019cd
naming
jimczi Jul 25, 2025
ff7bc4d
Update docs/changelog/131907.yaml
jimczi Jul 28, 2025
7c3f34a
Update change log
jimczi Jul 28, 2025
3c530d0
Merge branch 'main' into exclude_vector_source
jimczi Jul 28, 2025
247f974
Update change log
jimczi Jul 28, 2025
916ecfd
changelog
jimczi Jul 28, 2025
8410849
Merge branch 'main' into exclude_vector_source
jimczi Jul 29, 2025
95efb82
Merge remote-tracking branch 'upstream/main' into exclude_vector_source
jimczi Aug 4, 2025
e769341
Merge branch 'main' into exclude_vector_source
jimczi Aug 4, 2025
69a2cf6
Merge branch 'main' into exclude_vector_source
jimczi Aug 4, 2025
762e7be
apply review comment
jimczi Aug 4, 2025
e81cd4d
Merge remote-tracking branch 'upstream/main' into exclude_vector_source
jimczi Aug 4, 2025
1828921
update docs
jimczi Aug 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions docs/changelog/131907.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
pr: 131907
summary: Enable `exclude_source_vectors` by default for new indices
area: Vector Search
type: breaking
issues: []
breaking:
title: Enable `exclude_source_vectors` by default for new indices
area: Search
details: |-
The `exclude_source_vectors` setting is now enabled by default for newly created indices.
This means that vector fields (e.g., `dense_vector`) are no longer stored in the `_source` field
by default, although they remain fully accessible through search and retrieval operations.

Instead of being persisted in `_source`, vectors are now rehydrated on demand from the underlying
index structures when needed. This reduces index size and improves performance for typical vector
search workloads where the original vector values do not need to be part of the `_source`.

If your use case requires vector fields to be stored in `_source`, you can disable this behavior by
setting `exclude_source_vectors: false` at index creation time.
impact: |-
Vector fields will no longer be stored in `_source` by default for new indices. Applications or tools
that expect to see vector fields in `_source` (for raw document inspection)
may need to be updated or configured to explicitly retain vectors using `exclude_source_vectors: false`.

Retrieval of vector fields via search or the `_source` API remains fully supported.
notable: true
91 changes: 83 additions & 8 deletions docs/reference/elasticsearch/mapping-reference/dense-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,81 @@ PUT my-index-2

{{es}} uses the [HNSW algorithm](https://arxiv.org/abs/1603.09320) to support efficient kNN search. Like most kNN algorithms, HNSW is an approximate method that sacrifices result accuracy for improved speed.

## Accessing `dense_vector` fields in search responses
```{applies_to}
stack: ga 9.2
serverless: ga
```

By default, `dense_vector` fields are **not included in `_source`** in responses from the `_search`, `_msearch`, `_get`, and `_mget` APIs.
This helps reduce response size and improve performance, especially in scenarios where vectors are used solely for similarity scoring and not required in the output.

To retrieve vector values explicitly, you can use:

* The `fields` option to request specific vector fields directly:

```console
POST my-index-2/_search
{
"fields": ["my_vector"]
}
```

- The `_source.exclude_vectors` flag to re-enable vector inclusion in `_source` responses:

```console
POST my-index-2/_search
{
"_source": {
"exclude_vectors": false
}
}
```

### Storage behavior and `_source`

By default, `dense_vector` fields are **not stored in `_source`** on disk. This is also controlled by the index setting `index.mapping.exclude_source_vectors`.
This setting is enabled by default for newly created indices and can only be set at index creation time.

When enabled:

* `dense_vector` fields are removed from `_source` and the rest of the `_source` is stored as usual.
* If a request includes `_source` and vector values are needed (e.g., during recovery or reindex), the vectors are rehydrated from their internal format.

This setting is compatible with synthetic `_source`, where the entire `_source` document is reconstructed from columnar storage. In full synthetic mode, no `_source` is stored on disk, and all fields — including vectors — are rebuilt when needed.

### Rehydration and precision

When vector values are rehydrated (e.g., for reindex, recovery, or explicit `_source` requests), they are restored from their internal format. Internally, vectors are stored at float precision, so if they were originally indexed as higher-precision types (e.g., `double` or `long`), the rehydrated values will have reduced precision. This lossy representation is intended to save space while preserving search quality.

### Storing original vectors in `_source`

If you want to preserve the original vector values exactly as they were provided, you can re-enable vector storage in `_source`:

```console
PUT my-index-include-vectors
{
"settings": {
"index.mapping.exclude_source_vectors": false
},
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector"
}
}
}
}
```

When this setting is disabled:

* `dense_vector` fields are stored as part of the `_source`, exactly as indexed.
* The index will store both the original `_source` value and the internal representation used for vector search, resulting in increased storage usage.
* Vectors are once again returned in `_source` by default in all relevant APIs, with no need to use `exclude_vectors` or `fields`.

This configuration is appropriate when full source fidelity is required, such as for auditing or round-tripping exact input values.

## Automatically quantize vectors for kNN search [dense-vector-quantization]

The `dense_vector` type supports quantization to reduce the memory footprint required when [searching](docs-content://solutions/search/vector/knn.md#approximate-knn) `float` vectors. The three following quantization strategies are supported:
Expand Down Expand Up @@ -266,16 +341,16 @@ $$$dense-vector-index-options$$$
`type`
: (Required, string) The type of kNN algorithm to use. Can be either any of:
* `hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) for scalable approximate kNN search. This supports all `element_type` values.
* `int8_hnsw` - The default index type for some float vectors:
* {applies_to}`stack: ga 9.1` Default for float vectors with less than 384 dimensions.
* `int8_hnsw` - The default index type for some float vectors:

* {applies_to}`stack: ga 9.1` Default for float vectors with less than 384 dimensions.
* {applies_to}`stack: ga 9.0` Default for float all vectors.

This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically scalar quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 4x at the cost of some accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
* `int4_hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically scalar quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 8x at the cost of some accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
* `bbq_hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically binary quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 32x at the cost of accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
{applies_to}`stack: ga 9.1` `bbq_hnsw` is the default index type for float vectors with greater than or equal to 384 dimensions.

{applies_to}`stack: ga 9.1` `bbq_hnsw` is the default index type for float vectors with greater than or equal to 384 dimensions.
* `flat` - This utilizes a brute-force search algorithm for exact kNN search. This supports all `element_type` values.
* `int8_flat` - This utilizes a brute-force search algorithm in addition to automatically scalar quantization. Only supports `element_type` of `float`.
* `int4_flat` - This utilizes a brute-force search algorithm in addition to automatically half-byte scalar quantization. Only supports `element_type` of `float`.
Expand All @@ -295,8 +370,8 @@ $$$dense-vector-index-options$$$
: (Optional, object) An optional section that configures automatic vector rescoring on knn queries for the given field. Only applicable to quantized index types.
:::::{dropdown} Properties of rescore_vector
`oversample`
: (required, float) The amount to oversample the search results by. This value should be one of the following:
* Greater than `1.0` and less than `10.0`
: (required, float) The amount to oversample the search results by. This value should be one of the following:
* Greater than `1.0` and less than `10.0`
* Exactly `0` to indicate no oversampling and rescoring should occur {applies_to}`stack: ga 9.1`
: The higher the value, the more vectors will be gathered and rescored with the raw values per shard.
: In case a knn query specifies a `rescore_vector` parameter, the query `rescore_vector` parameter will be used instead.
Expand Down
74 changes: 72 additions & 2 deletions docs/reference/elasticsearch/mapping-reference/rank-vectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,11 +108,81 @@ $$$rank-vectors-element-type$$$
`dims`
: (Optional, integer) Number of vector dimensions. Can’t exceed `4096`. If `dims` is not specified, it will be set to the length of the first vector added to the field.

## Accessing `dense_vector` fields in search responses
```{applies_to}
stack: ga 9.2
serverless: ga
```

By default, `dense_vector` fields are **not included in `_source`** in responses from the `_search`, `_msearch`, `_get`, and `_mget` APIs.
This helps reduce response size and improve performance, especially in scenarios where vectors are used solely for similarity scoring and not required in the output.

To retrieve vector values explicitly, you can use:

* The `fields` option to request specific vector fields directly:

```console
POST my-index-2/_search
{
"fields": ["my_vector"]
}
```

- The `_source.exclude_vectors` flag to re-enable vector inclusion in `_source` responses:

```console
POST my-index-2/_search
{
"_source": {
"exclude_vectors": false
}
}
```

### Storage behavior and `_source`

By default, `rank_vectors` fields are not stored in `_source` on disk. This is also controlled by the index setting `index.mapping.exclude_source_vectors`.
This setting is enabled by default for newly created indices and can only be set at index creation time.

When enabled:

* `rank_vectors` fields are removed from `_source` and the rest of the `_source` is stored as usual.
* If a request includes `_source` and vector values are needed (e.g., during recovery or reindex), the vectors are rehydrated from their internal format.

This setting is compatible with synthetic `_source`, where the entire `_source` document is reconstructed from columnar storage. In full synthetic mode, no `_source` is stored on disk, and all fields — including vectors — are rebuilt when needed.

### Rehydration and precision

When vector values are rehydrated (e.g., for reindex, recovery, or explicit `_source` requests), they are restored from their internal format. Internally, vectors are stored at float precision, so if they were originally indexed as higher-precision types (e.g., `double` or `long`), the rehydrated values will have reduced precision. This lossy representation is intended to save space while preserving search quality.

### Storing original vectors in `_source`

If you want to preserve the original vector values exactly as they were provided, you can re-enable vector storage in `_source`:

```console
PUT my-index-include-vectors
{
"settings": {
"index.mapping.exclude_source_vectors": false
},
"mappings": {
"properties": {
"my_vector": {
"type": "rank_vectors",
"dims": 128
}
}
}
}
```

## Synthetic `_source` [rank-vectors-synthetic-source]
When this setting is disabled:

`rank_vectors` fields support [synthetic `_source`](mapping-source-field.md#synthetic-source) .
* `rank_vectors` fields are stored as part of the `_source`, exactly as indexed.
* The index will store both the original `_source` value and the internal representation used for vector search, resulting in increased storage usage.
* Vectors are once again returned in `_source` by default in all relevant APIs, with no need to use `exclude_vectors` or `fields`.

This configuration is appropriate when full source fidelity is required, such as for auditing or round-tripping exact input values.

## Scoring with rank vectors [rank-vectors-scoring]

Expand Down
82 changes: 76 additions & 6 deletions docs/reference/elasticsearch/mapping-reference/sparse-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,12 +57,6 @@ See [semantic search with ELSER](docs-content://solutions/search/semantic-search

The following parameters are accepted by `sparse_vector` fields:

[store](/reference/elasticsearch/mapping-reference/mapping-store.md)
: Indicates whether the field value should be stored and retrievable independently of the [_source](/reference/elasticsearch/mapping-reference/mapping-source-field.md) field. Accepted values: true or false (default). The field’s data is stored using term vectors, a disk-efficient structure compared to the original JSON input. The input map can be retrieved during a search request via the [`fields` parameter](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#search-fields-param). To benefit from reduced disk usage, you must either:

* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering).
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source).

index_options {applies_to}`stack: ga 9.1`
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your [`sparse_vector` query](/reference/query-languages/query-dsl/query-dsl-sparse-vector-query.md), Elasticsearch will use the default options configured for the field, if any.

Expand Down Expand Up @@ -96,6 +90,82 @@ This ensures that:
* The tokens that are kept are frequent enough and have significant scoring.
* Very infrequent tokens that may not have as high of a score are removed.

## Accessing `dense_vector` fields in search responses
```{applies_to}
stack: ga 9.2
serverless: ga
```

By default, `dense_vector` fields are **not included in `_source`** in responses from the `_search`, `_msearch`, `_get`, and `_mget` APIs.
This helps reduce response size and improve performance, especially in scenarios where vectors are used solely for similarity scoring and not required in the output.

To retrieve vector values explicitly, you can use:

* The `fields` option to request specific vector fields directly:

```console
POST my-index-2/_search
{
"fields": ["my_vector"]
}
```

- The `_source.exclude_vectors` flag to re-enable vector inclusion in `_source` responses:

```console
POST my-index-2/_search
{
"_source": {
"exclude_vectors": false
}
}
```

### Storage behavior and `_source`

By default, `sparse_vector` fields are not stored in `_source` on disk. This is also controlled by the index setting `index.mapping.exclude_source_vectors`.
This setting is enabled by default for newly created indices and can only be set at index creation time.

When enabled:

* `sparse_vector` fields are removed from `_source` and the rest of the `_source` is stored as usual.
* If a request includes `_source` and vector values are needed (e.g., during recovery or reindex), the vectors are rehydrated from their internal format.

This setting is compatible with synthetic `_source`, where the entire `_source` document is reconstructed from columnar storage. In full synthetic mode, no `_source` is stored on disk, and all fields — including vectors — are rebuilt when needed.

### Rehydration and precision

When vector values are rehydrated (e.g., for reindex, recovery, or explicit `_source` requests), they are restored from their internal format.
Internally, vectors are stored as floats with 9 significant bits for the precision, so the rehydrated values will have reduced precision.
This lossy representation is intended to save space while preserving search quality.

### Storing original vectors in `_source`

If you want to preserve the original vector values exactly as they were provided, you can re-enable vector storage in `_source`:

```console
PUT my-index-include-vectors
{
"settings": {
"index.mapping.exclude_source_vectors": false
},
"mappings": {
"properties": {
"my_vector": {
"type": "sparse_vector"
}
}
}
}
```

When this setting is disabled:

* `sparse_vector` fields are stored as part of the `_source`, exactly as indexed.
* The index will store both the original `_source` value and the internal representation used for vector search, resulting in increased storage usage.
* Vectors are once again returned in `_source` by default in all relevant APIs, with no need to use `exclude_vectors` or `fields`.

This configuration is appropriate when full source fidelity is required, such as for auditing or round-tripping exact input values.

## Multi-value sparse vectors [index-multi-value-sparse-vectors]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
import java.util.Map;
import java.util.stream.Collectors;

import static org.elasticsearch.index.IndexSettings.SYNTHETIC_VECTORS;
import static org.elasticsearch.index.query.QueryBuilders.termQuery;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertHitCount;
Expand Down Expand Up @@ -182,14 +181,13 @@ public void testReindexFromComplexDateMathIndexName() throws Exception {
}

public void testReindexIncludeVectors() throws Exception {
assumeTrue("This test requires synthetic vectors to be enabled", SYNTHETIC_VECTORS);
var resp1 = prepareCreate("test").setSettings(
Settings.builder().put(IndexSettings.INDEX_MAPPING_SOURCE_SYNTHETIC_VECTORS_SETTING.getKey(), true).build()
Settings.builder().put(IndexSettings.INDEX_MAPPING_EXCLUDE_SOURCE_VECTORS_SETTING.getKey(), true).build()
).setMapping("foo", "type=dense_vector,similarity=l2_norm", "bar", "type=sparse_vector").get();
assertAcked(resp1);

var resp2 = prepareCreate("test_reindex").setSettings(
Settings.builder().put(IndexSettings.INDEX_MAPPING_SOURCE_SYNTHETIC_VECTORS_SETTING.getKey(), true).build()
Settings.builder().put(IndexSettings.INDEX_MAPPING_EXCLUDE_SOURCE_VECTORS_SETTING.getKey(), true).build()
).setMapping("foo", "type=dense_vector,similarity=l2_norm", "bar", "type=sparse_vector").get();
assertAcked(resp2);

Expand Down Expand Up @@ -237,5 +235,4 @@ public void testReindexIncludeVectors() throws Exception {
searchResponse.decRef();
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@
import java.util.Map;
import java.util.stream.Collectors;

import static org.elasticsearch.index.IndexSettings.SYNTHETIC_VECTORS;
import static org.elasticsearch.index.query.QueryBuilders.termQuery;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertHitCount;
Expand Down Expand Up @@ -158,9 +157,8 @@ public void testMissingSources() {
}

public void testUpdateByQueryIncludeVectors() throws Exception {
assumeTrue("This test requires synthetic vectors to be enabled", SYNTHETIC_VECTORS);
var resp1 = prepareCreate("test").setSettings(
Settings.builder().put(IndexSettings.INDEX_MAPPING_SOURCE_SYNTHETIC_VECTORS_SETTING.getKey(), true).build()
Settings.builder().put(IndexSettings.INDEX_MAPPING_EXCLUDE_SOURCE_VECTORS_SETTING.getKey(), true).build()
).setMapping("foo", "type=dense_vector,similarity=l2_norm", "bar", "type=sparse_vector").get();
assertAcked(resp1);

Expand Down
Loading
Loading