Skip to content

Feature/improved-language-detection #525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

suzinyou
Copy link
Collaborator

Reviewer: @lickem22
Estimate: 1 hr


Ticket

Fixes:

Description

Detect both language AND script, and use them in generating the response.

Goal

Changes

  1. In /chat, we used to first modify the query (paraphrase) to be optimized as a search query, then run the modified query through the guardrails+search pipeline. In some cases, the initial modification rendered Hinglish sentences as English (e.g. "Portal kahan hai" -> "Portal location"; Note that in Hinglish people would often use English words like "location"). So we do not use the modified search query any more. (TODO: remove that logic)
  2. New IdentifiedScript enum and modified prompts
  3. Modified test cases
  4. Modified language detection guardrail test

Future Tasks (optional)

How has this been tested?

For guardrails, at project root

make setup-llm-proxy
python -m pytest core_backend/tests/rails/test_language_identification.py

For the question-answering endpoints, test various scenarios in the dev environment.

To-do before merge (optional)

Checklist

Fill with x for completed.

  • My code follows the style guidelines of this project
  • I have reviewed my own code to ensure good quality
  • I have tested the functionality of my code to ensure it works as intended
  • I have resolved merge conflicts

(Delete any items below that are not relevant)

  • I have updated the automated tests


IMPORTANT NOTES ON THE "answer" FIELD:
- Keep in mind that the user is asking a {message_type} question.
- Answer in the language {original_language} in the script {original_script}.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only line that actually changed. Evernthing else is exactly the same!

Comment on lines 177 to 221
"query": "The vector database query that you have constructed based on
the user's LATEST MESSAGE and the conversation history."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now just rely on paraphrase guardrail for search.

Comment on lines -120 to -121
ADDITIONAL RELEVANT INFORMATION BELOW
=====================================
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would render as a Heading when I display in Markdown, and just thought the HTML syntax is clearer!

Comment on lines 651 to 849
if user_query_refined.chat_query_params:
user_query_refined.query_text = user_query_refined.chat_query_params.pop(
"search_query"
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line meant that we're passing the modified search query through the entire guarail+RAG pipeline, instead of the original text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then should the modified search be after the original query went through the guardrails?

@suzinyou suzinyou force-pushed the feature/improved-language-detection branch from 3988358 to 50dc513 Compare April 10, 2025 13:19
@suzinyou suzinyou marked this pull request as ready for review April 10, 2025 13:19
@@ -460,10 +492,9 @@ async def wrapper(
The appropriate response object.
"""

if not query_refined.chat_query_params:
Copy link
Contributor

@lickem22 lickem22 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the check? I am guessing paraphrasing key words instead of a full sentence is not useful

@suzinyou suzinyou force-pushed the feature/improved-language-detection branch from 9959e13 to 40187b8 Compare April 17, 2025 03:48
@suzinyou suzinyou requested a review from lickem22 April 17, 2025 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants