Abstracted how Text fields use Keyword fields inside of Text fields #132430

Kubik42 · 2025-08-05T02:11:25Z

This is a small refactor + bug for fix 131282. In the final PR, I will separate the refactor from the bug fix for clarity.

The refactor changes how Text and MatchOnlyText fields use Keyword multi fields for synthetic source. Currently, this is done via the hasSyntheticSourceCompatibleKeywordField argument, where we set a boolean flag to indicate whether there is a keyword multi field that is either stored or has doc values. This is not a good approach for addressing 131282 because we want to disable the following logic for multi fields. With that disabled, the parent fields will no longer have a multi field to use for synthetic source.

We could designate one of the keyword fields as some kind of "synthetic source provider" for the parent. This way the field will always create a StoredField when ignore_above is tripped. However, this is poor approach since it exposes how text fields are implemented to the keyword field. In my opinion, the parent field should be the one responsible for deciding what is and what isn't stored, as opposed to its multi fields.

To achieve the above, I've removed hasSyntheticSourceCompatibleKeywordField and instead relied on the syntheticSourceDelegate. With the addition of a new method isIgnored(), which is called during parsing on the supplied value, we can determine whether a particular keyword multi field is a valid supporter of synthetic source. If it isn't, then the parent field can explicitly create a StoredField for that.

Next, I've changed how SyntheticSourceSupport works in the text family of fields. Since we no longer know whether a text field is truly stored or not, we now do something similar to what keyword fields do: create a SyntheticFieldLoader with multiple layers. Each layer points to a potential source for synthetic source. For example, the first layer may contain the field loader for the parent field, while the next layer may contain the field loader for the keyword multi field.

The following, additional, changes were also made to clean things up a bit:

Introduced TextFamilyFieldMapper, which contains some common logic shared by Text, MatchOnlyText, and AnnotatedText when it comes to synthetic source
Added a Builder to MapperBuilderContext
Removed isSyntheticSource from Builders as that information comes from MapperBuilderContext anyway
Added some helper functions like isIgnoreAboveSet() for code readability
Moved originalFieldName up to StringFieldType and renamed it to syntheticSourceFallbackFieldName(). This removes a lot of duplicated code and clarifies what this "original" name even is
Wrapped syntheticSourceDelegate in an Optional to make it clear that it can be null
Extracted some field loader logic into separate functions inside of MatchOnlyTextFieldMapper

Kubik42 · 2025-08-05T02:12:59Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

        private final IndexMode indexMode;
        private final IndexSortConfig indexSortConfig;
        private final boolean hasDocValuesSkipper;
-        private final String originalName;


this was moved up to StringFieldType as its a commonly used function

Kubik42 · 2025-08-05T02:14:50Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

@@ -1349,7 +1377,7 @@ protected SyntheticSourceSupport syntheticSourceSupport() {
        return super.syntheticSourceSupport();
    }

-    public SourceLoader.SyntheticFieldLoader syntheticFieldLoader(String fullFieldName, String leafFieldName) {
+    public CompositeSyntheticFieldLoader syntheticFieldLoader(String fullFieldName, String leafFieldName) {


needed this to be more explicit in order to merge this CompositeSyntheticFieldLoader with the one in TextFieldMapper/MatchOnlyTextFieldMapper

Kubik42 · 2025-08-05T02:15:24Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -238,11 +240,9 @@ private static FielddataFrequencyFilter parseFrequencyFilter(String name, Mappin
    public static class Builder extends FieldMapper.Builder {

        private final IndexVersion indexCreatedVersion;
-        private final Parameter<Boolean> store;
-
-        private final boolean isSyntheticSourceEnabled;


no longer used

Kubik42 · 2025-08-05T02:16:39Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

            this.withinMultiField = withinMultiField;
-            this.store = Parameter.storeParam(m -> ((TextFieldMapper) m).store, () -> {


we no longer need to do this. TextFieldMapper now decides whether we need an extra StoredField during parsing.

This is probably a breaking change?

Yes, but I wonder if there are any customers that rely on store = true by default. We initially set it that way to support synthetic source. However, given thats done elsewhere now, we might be able to release this without too many concerns. I'll discuss with the team.

I think that depends on whether retrieving stored fields fail for requests like: GET my-index/_doc/1?stored_fields=message? I suspect it would no longer retrieve fields?

Now technically that is breaking and we should open a breaking change issue for this.

Kubik42 · 2025-08-05T02:17:22Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

         * not running in synthetic _source or synthetic source doesn't need it.
         */
-        private final KeywordFieldMapper.KeywordFieldType syntheticSourceDelegate;
+        private final Optional<KeywordFieldMapper.KeywordFieldType> syntheticSourceDelegate;


with Optional here, its now more clear that this delegate can be null. Which enforces null checks.

lkts

I think this looks good overall. It is not a small change but i think it makes control flow more clear in this case.

...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

server/src/main/java/org/elasticsearch/index/mapper/CompositeSyntheticFieldLoader.java

lkts · 2025-08-05T16:16:33Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

            this.withinMultiField = withinMultiField;
-            this.store = Parameter.storeParam(m -> ((TextFieldMapper) m).store, () -> {


This is probably a breaking change?

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

martijnvg

I did a first review pass. My two main points are:

I was hoping we could utilize ignored source as I mentioned in the comment in TextFieldmapper. With #132142 reading performance should improve, which was a bit of a concern before.
Looks like this PR also changes to parse logic to read utf8 bytes. Let's do that in a separate PR.

martijnvg · 2025-08-11T08:56:09Z

...ext/src/main/java/org/elasticsearch/index/mapper/annotatedtext/AnnotatedTextFieldMapper.java

@@ -84,28 +88,26 @@ public static class Builder extends FieldMapper.Builder {
        final Parameter<String> indexOptions = TextParams.textIndexOptions(m -> builder(m).indexOptions.getValue());
        final Parameter<Boolean> norms = TextParams.norms(true, m -> builder(m).norms.getValue());
        final Parameter<String> termVectors = TextParams.termVectors(m -> builder(m).termVectors.getValue());
+        private final Parameter<Boolean> store = Parameter.storeParam(m -> builder(m).store.getValue(), false);


I don't think we need to add a name mapping parameter for this field type?

How come? We had one before - see line 104 that was removed.

...ext/src/main/java/org/elasticsearch/index/mapper/annotatedtext/AnnotatedTextFieldMapper.java

martijnvg · 2025-08-11T09:08:09Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

            this.withinMultiField = withinMultiField;
-            this.store = Parameter.storeParam(m -> ((TextFieldMapper) m).store, () -> {


I think that depends on whether retrieving stored fields fail for requests like: GET my-index/_doc/1?stored_fields=message? I suspect it would no longer retrieve fields?

Now technically that is breaking and we should open a breaking change issue for this.

martijnvg · 2025-08-11T09:10:52Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+            var utfBytes = value.bytes();
+            var bytesRef = new BytesRef(utfBytes.bytes(), utfBytes.offset(), utfBytes.length());
+            final String fieldName = fieldType().syntheticSourceFallbackFieldName(true);
+            context.doc().add(new StoredField(fieldName, bytesRef));


Maybe we just use ignored source here? I think falling back to ignored source is a more simplistic approach.

So instead something like:

context.addIgnoredField(new IgnoredSourceFieldMapper.NameValue(fullPath(), fullPath().lastIndexOf(leafName()), bytesRef, context.doc()));

This way we can reuse the fallback synthetic source block loader and synthetic source ignored source support.

Good suggestion. Now that I've seen how ignored source is implemented around synthetic source, it makes sense to reuse that code.

One thing we discussed is how "ignored source" is the wrong name here and may confuse others. I will add a simple wrapper function that makes this more clear. That should suffice.

Update: ignored source will not work because all fields with the same name, under the same document, will need to be added to ignored source. This is because we skip using field loaders for fields that are in ignored source. This is fine for the fields that are genuinely added to ignored source, but not for fields that are not added.

For example, if our mappings include a text field with a keyword multi field + ignore_above, and we're given a document that contains two of such fields: one trips ignore_above and one doesn't. For the tripped one, we'll add the parent text field into ignored source. However, for the other one, we'll rely on the keyword multi field for synthetic source. Because the first field is already recorded in ignored source, we'll skip using the field loader for the second field and it will be missing from our result.

I will go back to the original code: creating StoredFields directly in parseCreateField().

server/src/main/java/org/elasticsearch/index/mapper/TextFamilyFieldMapper.java

…tic source delegate

Kubik42 added >bug Team:StorageEngine labels Aug 5, 2025

elasticsearchmachine added the v9.2.0 label Aug 5, 2025

Kubik42 commented Aug 5, 2025

View reviewed changes

Kubik42 force-pushed the 131282-2 branch 2 times, most recently from 4f03296 to 511735f Compare August 5, 2025 02:21

lkts reviewed Aug 5, 2025

View reviewed changes

Kubik42 force-pushed the 131282-2 branch 4 times, most recently from 9c95662 to a11f607 Compare August 11, 2025 14:09

martijnvg reviewed Aug 11, 2025

View reviewed changes

Kubik42 force-pushed the 131282-2 branch from 4c6f9a1 to c97ebe8 Compare August 11, 2025 22:35

Kubik42 added 6 commits August 13, 2025 16:35

Abstracted how Text fields use Keyword fields inside of Text fields

101b69f

Fixed incorrect synthetic source field name for text fields

3cd66fe

Fixed text field mapper index mode and match only text missing synthe…

eb994ee

…tic source delegate

Fixed match_only_text shallow field value fetcher

a70dc9a

Dont store text fields in binary format

ae49fd8

Fixed value string

accafb1

Kubik42 force-pushed the 131282-2 branch from 181af08 to accafb1 Compare August 13, 2025 23:42

Added TextFamilyFieldType

cfae927

Kubik42 force-pushed the 131282-2 branch from 1aca4f3 to cfae927 Compare August 14, 2025 20:24

Kubik42 added 3 commits August 14, 2025 13:39

Removed TextFamilyFieldType in favor of StringFieldType

e167e4d

Use ignored source for text fields

262738c

Renamed ignore source function for clarity

6524cce

Kubik42 closed this Aug 15, 2025

Kubik42 deleted the 131282-2 branch August 15, 2025 02:17

Kubik42 mentioned this pull request Aug 15, 2025

Abstracted how Text fields use Keyword fields inside of Text fields #132962

Draft

		this.withinMultiField = withinMultiField;
		this.store = Parameter.storeParam(m -> ((TextFieldMapper) m).store, () -> {

Abstracted how Text fields use Keyword fields inside of Text fields #132430

Abstracted how Text fields use Keyword fields inside of Text fields #132430

Uh oh!

Conversation

Kubik42 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kubik42 Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kubik42 commented Aug 5, 2025 •

edited

Loading

Kubik42 Aug 5, 2025 •

edited

Loading