Skip to content

203 add support for roman numerals within search #800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ProfessorAtomicManiac
Copy link
Collaborator

@ProfessorAtomicManiac ProfessorAtomicManiac commented Jun 4, 2025

@twangodev pls test on your end to make sure its all good. I ended up changing a lot of original parameters

@Copilot Copilot AI review requested due to automatic review settings June 4, 2025 07:18
@ProfessorAtomicManiac ProfessorAtomicManiac linked an issue Jun 4, 2025 that may be closed by this pull request
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for Roman numeral handling in course searches and refines instructor indexing and search analyzers.

  • Introduces a roman_numerals_synonym_filter and course_analyzer for courses
  • Adds a min_length_filter and instructor_analyzer to clean up instructor tokens
  • Updates search queries to use the new analyzers and adjusts boost settings
Comments suppressed due to low confidence (6)

search/es_util.py:365

  • The slop parameter only applies to phrase-based queries. For multi_match with type: "best_fields", either remove slop or switch to type: "phrase"/"phrase_prefix" if phrase proximity is required.
"slop": 1

search/es_util.py:110

  • [nitpick] Update this function docstring to mention the addition of the Roman numeral synonym filter and course_analyzer so that maintainers are aware of the new indexing behavior.
Index courses into Elasticsearch.

search/es_util.py:279

  • [nitpick] Consider revising this comment for clarity and tone (e.g., remove "literally all it does it" and explain why single-character tokens are excluded).
# literally all it does it to remove tokens shorter than 2 characters after analyzer tokenizes the string 

search/es_util.py:118

  • Add or update tests that verify the Roman numeral synonyms are correctly applied during indexing and that searches for "I" vs "1" return equivalent results.
"roman_numerals_synonym_filter": {

search/es_util.py:166

  • Deleting and recreating the index on each run will drop existing data. Consider using index templates or migrations to update mappings without data loss.
if es.indices.exists(index="courses"):

search/es_util.py:287

  • A min_length_filter with min: 2 will drop valid single-character tokens (e.g., initials or names like 'O'). Confirm this behavior is acceptable or adjust the filter.
"min": 2  

Comment on lines +319 to +321
if es.indices.exists(index="instructors"):
es.indices.delete(index="instructors")
es.indices.create(index="instructors", body=settings)
Copy link
Preview

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The index creation logic for courses and instructors is duplicated. Consider extracting a helper function to DRY this pattern and simplify future changes.

Suggested change
if es.indices.exists(index="instructors"):
es.indices.delete(index="instructors")
es.indices.create(index="instructors", body=settings)
create_index(es, "instructors", settings)

Copilot uses AI. Check for mistakes.

@twangodev
Copy link
Owner

twangodev commented Jun 4, 2025

I'll leave this PR open, there's a lot to review and I just want to make sure that this doesn't impact the current search scoring.

I think we should begin looking into CICD checks for search, defining some test cases and see how well search can match queries (i.e. 75% of queries contain an expected result, 50% or lower should fail the CI job). That should help us clearly define what how/what makes the search better.

@ProfessorAtomicManiac
Copy link
Collaborator Author

I can try to look into CI/CD. I actually need to do CI/CD later for my job so I would really want to learn it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support for roman numerals within search
2 participants