Skip to content

Abstracted how Text fields use Keyword fields inside of Text fields #132430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

Kubik42
Copy link

@Kubik42 Kubik42 commented Aug 5, 2025

This is a small refactor + bug for fix 131282. In the final PR, I will separate the refactor from the bug fix for clarity.

The refactor changes how Text and MatchOnlyText fields use Keyword multi fields for synthetic source. Currently, this is done via the hasSyntheticSourceCompatibleKeywordField argument, where we set a boolean flag to indicate whether there is a keyword multi field that is either stored or has doc values. This is not a good approach for addressing 131282 because we want to disable the following logic for multi fields. With that disabled, the parent fields will no longer have a multi field to use for synthetic source.

We could designate one of the keyword fields as some kind of "synthetic source provider" for the parent. This way the field will always create a StoredField when ignore_above is tripped. However, this is poor approach since it exposes how text fields are implemented to the keyword field. In my opinion, the parent field should be the one responsible for deciding what is and what isn't stored, as opposed to its multi fields.

To achieve the above, I've removed hasSyntheticSourceCompatibleKeywordField and instead relied on the syntheticSourceDelegate. With the addition of a new method isIgnored(), which is called during parsing on the supplied value, we can determine whether a particular keyword multi field is a valid supporter of synthetic source. If it isn't, then the parent field can explicitly create a StoredField for that.

Next, I've changed how SyntheticSourceSupport works in the text family of fields. Since we no longer know whether a text field is truly stored or not, we now do something similar to what keyword fields do: create a SyntheticFieldLoader with multiple layers. Each layer points to a potential source for synthetic source. For example, the first layer may contain the field loader for the parent field, while the next layer may contain the field loader for the keyword multi field.

The following, additional, changes were also made to clean things up a bit:

  • Introduced TextFamilyFieldMapper, which contains some common logic shared by Text, MatchOnlyText, and AnnotatedText when it comes to synthetic source
  • Added a Builder to MapperBuilderContext
  • Removed isSyntheticSource from Builders as that information comes from MapperBuilderContext anyway
  • Added some helper functions like isIgnoreAboveSet() for code readability
  • Moved originalFieldName up to StringFieldType and renamed it to syntheticSourceFallbackFieldName(). This removes a lot of duplicated code and clarifies what this "original" name even is
  • Wrapped syntheticSourceDelegate in an Optional to make it clear that it can be null
  • Extracted some field loader logic into separate functions inside of MatchOnlyTextFieldMapper

private final IndexMode indexMode;
private final IndexSortConfig indexSortConfig;
private final boolean hasDocValuesSkipper;
private final String originalName;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was moved up to StringFieldType as its a commonly used function

@@ -1349,7 +1377,7 @@ protected SyntheticSourceSupport syntheticSourceSupport() {
return super.syntheticSourceSupport();
}

public SourceLoader.SyntheticFieldLoader syntheticFieldLoader(String fullFieldName, String leafFieldName) {
public CompositeSyntheticFieldLoader syntheticFieldLoader(String fullFieldName, String leafFieldName) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed this to be more explicit in order to merge this CompositeSyntheticFieldLoader with the one in TextFieldMapper/MatchOnlyTextFieldMapper

@@ -238,11 +240,9 @@ private static FielddataFrequencyFilter parseFrequencyFilter(String name, Mappin
public static class Builder extends FieldMapper.Builder {

private final IndexVersion indexCreatedVersion;
private final Parameter<Boolean> store;

private final boolean isSyntheticSourceEnabled;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer used

this.withinMultiField = withinMultiField;
this.store = Parameter.storeParam(m -> ((TextFieldMapper) m).store, () -> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we no longer need to do this. TextFieldMapper now decides whether we need an extra StoredField during parsing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a breaking change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I wonder if there are any customers that rely on store = true by default. We initially set it that way to support synthetic source. However, given thats done elsewhere now, we might be able to release this without too many concerns. I'll discuss with the team.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that depends on whether retrieving stored fields fail for requests like: GET my-index/_doc/1?stored_fields=message? I suspect it would no longer retrieve fields?

Now technically that is breaking and we should open a breaking change issue for this.

* not running in synthetic _source or synthetic source doesn't need it.
*/
private final KeywordFieldMapper.KeywordFieldType syntheticSourceDelegate;
private final Optional<KeywordFieldMapper.KeywordFieldType> syntheticSourceDelegate;
Copy link
Author

@Kubik42 Kubik42 Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with Optional here, its now more clear that this delegate can be null. Which enforces null checks.

@Kubik42 Kubik42 force-pushed the 131282-2 branch 2 times, most recently from 4f03296 to 511735f Compare August 5, 2025 02:21
Copy link
Contributor

@lkts lkts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good overall. It is not a small change but i think it makes control flow more clear in this case.

this.withinMultiField = withinMultiField;
this.store = Parameter.storeParam(m -> ((TextFieldMapper) m).store, () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a breaking change?

@Kubik42 Kubik42 force-pushed the 131282-2 branch 4 times, most recently from 9c95662 to a11f607 Compare August 11, 2025 14:09
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first review pass. My two main points are:

  1. I was hoping we could utilize ignored source as I mentioned in the comment in TextFieldmapper. With #132142 reading performance should improve, which was a bit of a concern before.
  2. Looks like this PR also changes to parse logic to read utf8 bytes. Let's do that in a separate PR.

@@ -84,28 +88,26 @@ public static class Builder extends FieldMapper.Builder {
final Parameter<String> indexOptions = TextParams.textIndexOptions(m -> builder(m).indexOptions.getValue());
final Parameter<Boolean> norms = TextParams.norms(true, m -> builder(m).norms.getValue());
final Parameter<String> termVectors = TextParams.termVectors(m -> builder(m).termVectors.getValue());
private final Parameter<Boolean> store = Parameter.storeParam(m -> builder(m).store.getValue(), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add a name mapping parameter for this field type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come? We had one before - see line 104 that was removed.

this.withinMultiField = withinMultiField;
this.store = Parameter.storeParam(m -> ((TextFieldMapper) m).store, () -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that depends on whether retrieving stored fields fail for requests like: GET my-index/_doc/1?stored_fields=message? I suspect it would no longer retrieve fields?

Now technically that is breaking and we should open a breaking change issue for this.

var utfBytes = value.bytes();
var bytesRef = new BytesRef(utfBytes.bytes(), utfBytes.offset(), utfBytes.length());
final String fieldName = fieldType().syntheticSourceFallbackFieldName(true);
context.doc().add(new StoredField(fieldName, bytesRef));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just use ignored source here? I think falling back to ignored source is a more simplistic approach.

So instead something like:

context.addIgnoredField(new IgnoredSourceFieldMapper.NameValue(fullPath(), fullPath().lastIndexOf(leafName()), bytesRef, context.doc()));

This way we can reuse the fallback synthetic source block loader and synthetic source ignored source support.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Now that I've seen how ignored source is implemented around synthetic source, it makes sense to reuse that code.

One thing we discussed is how "ignored source" is the wrong name here and may confuse others. I will add a simple wrapper function that makes this more clear. That should suffice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: ignored source will not work because all fields with the same name, under the same document, will need to be added to ignored source. This is because we skip using field loaders for fields that are in ignored source. This is fine for the fields that are genuinely added to ignored source, but not for fields that are not added.

For example, if our mappings include a text field with a keyword multi field + ignore_above, and we're given a document that contains two of such fields: one trips ignore_above and one doesn't. For the tripped one, we'll add the parent text field into ignored source. However, for the other one, we'll rely on the keyword multi field for synthetic source. Because the first field is already recorded in ignored source, we'll skip using the field loader for the second field and it will be missing from our result.

I will go back to the original code: creating StoredFields directly in parseCreateField().

@Kubik42 Kubik42 closed this Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid storing ignored source for multi-fields
4 participants