-
Notifications
You must be signed in to change notification settings - Fork 2k
Feat: Search Eval Testing Overhaul (provide ground truth, categorize query, etc.) #4739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces a comprehensive overhaul of the search evaluation testing framework, adding structured test queries with ground truth validation and category-based analysis.
- The new
test_queries.json.template
has a syntax error (missing comma) and incomplete second test query that needs to be fixed search_eval_config.yaml.template
reducesNUM_RETURNED_HITS
from 200 to 50 andEVAL_TOPK
from 20 to 5, which may impact evaluation quality- New metrics system in
util_eval.py
compares against both ground truth and reranked results but needs better edge case handling - Modular code structure with separate utility files improves maintainability but some error handling gaps exist
- Query keyword generation in
util_data.py
should persist generated keywords to prevent result inconsistency across runs
💡 (5/5) You can turn off certain types of comments like style here!
9 file(s) reviewed, 17 comment(s)
Edit PR Review Bot Settings | Greptile
backend/tests/regression/search_quality/test_queries.json.template
Outdated
Show resolved
Hide resolved
pass | ||
|
||
|
||
warned = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Using global state for warning flag could cause issues in concurrent test execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits!
…query, etc.) (onyx-dot-app#4739) * fix: autoflake & import order * docs: readme * fix: mypy * feat: eval * docs: readme * fix: oops forgot to remove comment * fix: typo * fix: rename var * updated default config * fix: config issue * oops * fix: black * fix: eval and config * feat: non tool calling query mod
…query, etc.) (onyx-dot-app#4739) * fix: autoflake & import order * docs: readme * fix: mypy * feat: eval * docs: readme * fix: oops forgot to remove comment * fix: typo * fix: rename var * updated default config * fix: config issue * oops * fix: black * fix: eval and config * feat: non tool calling query mod
…query, etc.) (onyx-dot-app#4739) * fix: autoflake & import order * docs: readme * fix: mypy * feat: eval * docs: readme * fix: oops forgot to remove comment * fix: typo * fix: rename var * updated default config * fix: config issue * oops * fix: black * fix: eval and config * feat: non tool calling query mod
…query, etc.) (onyx-dot-app#4739) * fix: autoflake & import order * docs: readme * fix: mypy * feat: eval * docs: readme * fix: oops forgot to remove comment * fix: typo * fix: rename var * updated default config * fix: config issue * oops * fix: black * fix: eval and config * feat: non tool calling query mod
Description
Greatly overhauled search eval testing script to take in a list of relevant search results (optional). Also more seamlessly generates the modified queries.
Also added question categories and category-based aggregation to see what type of search results are performing poorly.
How Has This Been Tested?
Ran script, manually computed metrics to ensure correctness. Looked at soft truth set to validate that too.
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.