Skip to content

Fix Semantic Query Rewrite Interception Drops Boosts #129282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

Samiul-TheSoccerFan
Copy link
Contributor

Closes ##128696

Match query: boost only on inference field

PUT my-index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "author": {
        "type": "text"
      },
      "summary": {
        "type": "text",
        "copy_to": "semantic_summary"
      },
      "semantic_summary": {
        "type": "semantic_text"
      },
      "moral_lesson": {
        "type": "text",
        "copy_to": "semantic_moral"
      },
      "semantic_moral": {
        "type": "semantic_text"
      }
    }
  }
}

POST my-index/_doc/1
{
  "title": "Cinderella",
  "author": "Charles Perrault",
  "summary": "Cinderella is a young girl mistreated by her stepmother and stepsisters until she meets a fairy godmother.",
  "moral_lesson": "Goodness and kindness will always be rewarded."
}

POST my-index/_doc/2
{
  "title": "Little Red Riding Hood",
  "author": "Brothers Grimm",
  "summary": "A young girl encounters a cunning wolf on her way to visit her grandmother.",
  "moral_lesson": "Beware of strangers."
}

GET my-index/_search
{
  "query": {
    "match": {
      "semantic_moral": {
        "query": "danger",
        "boost": 2.0
      }
    }
  }
}

Match query: boost on both semantic_text and text fields

PUT normal-text-index/
{
  "mappings": {
    "properties": {
      "semantic1": {
        "type": "text"
      },
      "semantic2": {
        "type": "text"
      }
    }
  }
}

POST normal-text-index/_doc/1
{
  "semantic1": "Cinderella is a young girl mistreated by her stepmother and stepsisters until she meets a fairy godmother.",
  "semantic2": "Goodness and kindness will always be rewarded."
}

PUT semantic-text-index/
{
  "mappings": {
    "properties": {
      "semantic1": {
        "type": "semantic_text"
      },
      "semantic2": {
        "type": "semantic_text"
      }
    }
  }
}

POST semantic-text-index/_doc/1
{
  "semantic1": "A young girl encounters a cunning wolf on her way to visit her grandmother.",
  "semantic2": "Beware of strangers."
}

GET normal-text-index,semantic-text-index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "semantic1": {
              "query": "wolf",
              "boost": 2.0
            }
          }
        },
        {
          "match": {
            "semantic2": {
              "query": "strangers",
              "boost": 1.0
            }
          }
        }
      ]
    }
  }
}

KNN

PUT _inference/text_embedding/my-e5-model
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": ".multilingual-e5-small"
  }
}

PUT test-knn-semantic-boost
{
  "mappings": {
    "properties": {
      "sem_field": {
        "type": "semantic_text",
        "inference_id": "my-e5-model"
      }
    }
  }
}

POST test-knn-semantic-boost/_doc/1?refresh=true
{
  "sem_field": "star wars droids"
}

POST test-knn-semantic-boost/_doc/2?refresh=true
{
  "sem_field": "fairy tale godmother"
}

// no boost given
GET test-knn-semantic-boost/_search
{
  "query": {
    "knn": {
      "field": "sem_field",
      "query_vector": [0.1, 0.2, 0.3],
      "k": 2,
      "num_candidates": 10
    }
  }
}

// boost preserve
GET test-knn-semantic-boost/_search
{
  "query": {
    "knn": {
      "field": "sem_field",
      "query_vector": [0.1, 0.2, 0.3], // the query throws error which is expected but the boost is propagated properly
      "k": 2,
      "num_candidates": 10,
      "boost": 5.0
    }
  }
}

Sparse

PUT test-sparse-semantic-boost
{
  "mappings": {
    "properties": {
      "sem_field": {
        "type": "semantic_text"
      }
    }
  }
}

POST test-sparse-semantic-boost/_doc/1?refresh=true
{
  "sem_field": "star wars droids"
}

POST test-sparse-semantic-boost/_doc/2?refresh=true
{
  "sem_field": "fairy tale godmother"
}

// sparse without boost
GET test-sparse-semantic-boost/_search
{
  "query": {
    "sparse_vector": {
      "field": "sem_field",
      "query": "driods"
    }
  }
}

// boost preserved
GET test-sparse-semantic-boost/_search
{
  "query": {
    "sparse_vector": {
      "field": "sem_field",
      "query": "driods",
      "boost": 5.0
    }
  }
}

@Samiul-TheSoccerFan Samiul-TheSoccerFan added >bug auto-backport Automatically create backport pull requests when merged v9.0.0 v8.18.0 v8.19.0 :SearchOrg/Relevance Label for the Search (solution/org) Relevance team labels Jun 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @Samiul-TheSoccerFan ! Can we please add some tests?

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see a solution based on copy constructors. Delegating the responsibility of generating a complete copy to the caller is error-prone and hard to maintain, which is how this class of bugs got introduced in the first place.

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good! Just a couple of things to clean up and we're good to merge :)


- match: { hits.total.value: 1 }
- match: { hits.hits.0._id: "doc_1" }
- close_to: { hits.hits.0._score: { value: 0.9984111, error: 1e15 } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is still wrong here

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice iterations @Samiul-TheSoccerFan - I've left some feedback, much of which is non blocking. I think it's very close to being ready to go though!

private MatchQueryBuilder createTestQueryBuilder() {
return new MatchQueryBuilder(FIELD_NAME, VALUE);
}

private MatchQueryBuilder createTestQueryBuilderWithBoostAndQueryName() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking nitpick: Since this is pretty simple and only used once, it doesn't need to be its own method. Maybe more readable to include the test query builder inside the test itself.

@@ -52,62 +52,68 @@ public void cleanup() {
}

public void testSparseVectorQueryOnInferenceFieldIsInterceptedAndRewritten() throws IOException {
float boost = randomFloatBetween(1, 10, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a randomBoolean() check to determine whether to set these boosts and names at all?

}

public void testSparseVectorQueryOnInferenceFieldWithoutInferenceIdIsInterceptedAndRewritten() throws IOException {
float boost = randomFloatBetween(1, 10, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note on randomBoolean()

assertEquals(QUERY, sparseVectorQueryBuilder.getQuery());
original.boost(boost);
original.queryName(queryName);
testRewrittenInferenceQuery(context, original);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice consolidation 🙌


- do:
indices.create:
index: test-sparse-index-random
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we name this random?

index: test-sparse-index-random
body:
settings:
number_of_shards: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get away with not setting shards/replicas here? I can understand why we need to do it when we're directly comparing scores, but if we don't absolutely need to do this, it's a more robust test without specifying it. For this particular test, since there's only one document, let's experiment with not needing it? (Suggestion applies to the other yaml tests as well).

If we do need it for score, perhaps we could add another test that doesn't compare scores, just names and successful searches.

return boolQueryBuilder;
}

@Override
public String getQueryName() {
return MatchQueryBuilder.NAME;
}

private MatchQueryBuilder copyMatchQueryBuilder(MatchQueryBuilder queryBuilder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking feedback: This PR has been around for a while, and I'm OK with this as written if we just want to get it in. However, future maintainability would be a concern of mine (what if we add a new param to match, such as prefiltering?). I'd almost suggest creating a new constructor in MatchQueryBuilder instead of this private method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Search - Relevance The Search organization Search Relevance team Team:SearchOrg Meta label for the Search Org (Enterprise Search) v8.18.0 v8.19.0 v9.0.0 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants