-
Notifications
You must be signed in to change notification settings - Fork 6
203 add support for roman numerals within search #800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
203 add support for roman numerals within search #800
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for Roman numeral handling in course searches and refines instructor indexing and search analyzers.
- Introduces a
roman_numerals_synonym_filter
andcourse_analyzer
for courses - Adds a
min_length_filter
andinstructor_analyzer
to clean up instructor tokens - Updates search queries to use the new analyzers and adjusts boost settings
Comments suppressed due to low confidence (6)
search/es_util.py:365
- The
slop
parameter only applies to phrase-based queries. Formulti_match
withtype: "best_fields"
, either removeslop
or switch totype: "phrase"
/"phrase_prefix"
if phrase proximity is required.
"slop": 1
search/es_util.py:110
- [nitpick] Update this function docstring to mention the addition of the Roman numeral synonym filter and
course_analyzer
so that maintainers are aware of the new indexing behavior.
Index courses into Elasticsearch.
search/es_util.py:279
- [nitpick] Consider revising this comment for clarity and tone (e.g., remove "literally all it does it" and explain why single-character tokens are excluded).
# literally all it does it to remove tokens shorter than 2 characters after analyzer tokenizes the string
search/es_util.py:118
- Add or update tests that verify the Roman numeral synonyms are correctly applied during indexing and that searches for "I" vs "1" return equivalent results.
"roman_numerals_synonym_filter": {
search/es_util.py:166
- Deleting and recreating the index on each run will drop existing data. Consider using index templates or migrations to update mappings without data loss.
if es.indices.exists(index="courses"):
search/es_util.py:287
- A
min_length_filter
withmin: 2
will drop valid single-character tokens (e.g., initials or names like 'O'). Confirm this behavior is acceptable or adjust the filter.
"min": 2
if es.indices.exists(index="instructors"): | ||
es.indices.delete(index="instructors") | ||
es.indices.create(index="instructors", body=settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The index creation logic for courses and instructors is duplicated. Consider extracting a helper function to DRY this pattern and simplify future changes.
if es.indices.exists(index="instructors"): | |
es.indices.delete(index="instructors") | |
es.indices.create(index="instructors", body=settings) | |
create_index(es, "instructors", settings) |
Copilot uses AI. Check for mistakes.
I'll leave this PR open, there's a lot to review and I just want to make sure that this doesn't impact the current search scoring. I think we should begin looking into CICD checks for search, defining some test cases and see how well search can match queries (i.e. 75% of queries contain an expected result, 50% or lower should fail the CI job). That should help us clearly define what how/what makes the search better. |
I can try to look into CI/CD. I actually need to do CI/CD later for my job so I would really want to learn it |
@twangodev pls test on your end to make sure its all good. I ended up changing a lot of original parameters