-
Notifications
You must be signed in to change notification settings - Fork 2.1k
enable and run contextual rag (search untested) #2793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces a new Contextual RAG feature across the Danswer application, affecting multiple components:
- Added
enable_contextual_rag
flag toSearchSettings
model and database migration - Implemented context generation for chunks using LLM in
chunker.py
- Modified embedding process to include document summary and chunk context
- Updated UI components to display and configure Contextual RAG settings
- Added
max_tokens
parameter to LLM interfaces for output control - Introduced new fields (
doc_summary
,chunk_context
) in various models and Vespa indexing
Key concerns:
- Feature is entirely untested, posing significant risks to system stability
- Potential for increased computational costs without verified benefits
- Lack of proper error handling and edge case considerations
- Incomplete implementation of environment variable controls
- Absence of performance impact assessment on existing functionalities
25 file(s) reviewed, 28 comment(s)
Edit PR Review Bot Settings | Greptile
ENABLE_MULTIPASS_INDEXING = ( | ||
os.environ.get("ENABLE_MULTIPASS_INDEXING", "").lower() == "true" | ||
) | ||
ENABLE_CONTEXTUAL_RAG = os.environ.get("ENABLE_CONTEXTUAL_RAG", "").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: This new feature is currently untested. Implement and run tests before merging.
ENABLE_MULTIPASS_INDEXING = ( | ||
os.environ.get("ENABLE_MULTIPASS_INDEXING", "").lower() == "true" | ||
) | ||
ENABLE_CONTEXTUAL_RAG = os.environ.get("ENABLE_CONTEXTUAL_RAG", "").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider adding a comment explaining what Contextual RAG is and its implications.
ENABLE_MULTIPASS_INDEXING = ( | ||
os.environ.get("ENABLE_MULTIPASS_INDEXING", "").lower() == "true" | ||
) | ||
ENABLE_CONTEXTUAL_RAG = os.environ.get("ENABLE_CONTEXTUAL_RAG", "").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The PR description mentions that the environment variable code path might not work. Verify and fix this issue before merging.
def get_content(self) -> str: | ||
return " ".join([section.text for section in self.sections]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider adding a docstring to explain the purpose and usage of this method
content=fields[CONTENT], # Includes extra title prefix and metadata suffix; | ||
# also sometimes context for contextual rag | ||
source_links=source_links_dict or {0: ""}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The comment suggests that the content field now includes context for Contextual RAG. This change might affect existing functionality that relies on the content field. Verify that this doesn't break any existing features.
prompt: LanguageModelInput, | ||
tools: list[dict] | None = None, | ||
tool_choice: ToolChoiceOptions | None = None, | ||
max_tokens: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Add documentation for the max_tokens
parameter to explain its purpose and usage
def _remove_contextual_rag(chunk: InferenceChunkUncleaned) -> str: | ||
# remove document summary | ||
if chunk.content.startswith(chunk.doc_summary): | ||
return chunk.content[len(chunk.doc_summary) :].lstrip() | ||
# remove chunk context | ||
if chunk.content.endswith(chunk.chunk_context): | ||
return chunk.content[: -len(chunk.chunk_context)].rstrip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: This function modifies chunk content without any safeguards. Consider adding checks to ensure the content is not empty after removal.
for chunk in chunks: | ||
chunk.content = _remove_title(chunk) | ||
chunk.content = _remove_metadata_suffix(chunk) | ||
chunk.content = _remove_contextual_rag(chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The new _remove_contextual_rag function is called here, but its effects are untested. This could lead to unexpected behavior in the search pipeline.
def _remove_contextual_rag(chunk: InferenceChunkUncleaned) -> str: | ||
# remove document summary | ||
if chunk.content.startswith(chunk.doc_summary): | ||
return chunk.content[len(chunk.doc_summary) :].lstrip() | ||
# remove chunk context | ||
if chunk.content.endswith(chunk.chunk_context): | ||
return chunk.content[: -len(chunk.chunk_context)].rstrip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The function assumes that doc_summary and chunk_context are always at the start and end of the content respectively. This might not always be true and could lead to incorrect content removal.
metadata_suffix_keyword="", | ||
mini_chunk_texts=None, | ||
large_chunk_reference_ids=[], | ||
chunk_context="Test chunk context", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: New 'chunk_context' parameter added, but not used in assertions
300c466
to
7ea074a
Compare
* Tiny confluence fix * Update utils.py --------- Co-authored-by: pablodanswer <pablo@danswer.ai>
* functional notifications * typing * minor * ports * nit * verify functionality * pretty
* fresh indexing feature branch * cherry pick test * Revert "cherry pick test" This reverts commit 2a62422. * set multitenant so that vespa fields match when indexing * cleanup pass * mypy * pass through env var to control celery indexing concurrency * comments on task kickoff and some logging improvements * disentangle configuration for different workers and beats. * use get_session_with_tenant * comment out all of update.py * rename to RedisConnectorIndexingFenceData * first check num_indexing_workers * refactor RedisConnectorIndexingFenceData * comment out on_worker_process_init * missed a file * scope db sessions to short lengths * update launch.json template * fix types * keep index button disabled until indexing is truly finished * change priority order of tooltips * should be using the logger from app_base * if we run out of retries, just mark the doc as modified so it gets synced later * tighten up the logging ... we know these are ID's * add logging
* add default schema config * resolve circular import * k
* Temporary fix for empty Google App credentials * added it to credential creation
* add multi tenancy to redis * rename context var * k * args -> kwargs * minor update to kv interface * robustify
* clear up llm * remove logs
* working chat feedback dump script (with api addition) * mypy fix * comment out pydantic models (but leave for reference) * small code review tweaks * bump to clear vercel issue?
* add global assistants context * nit * minor cleanup * minor clarity * nit
* can't add to primary_worker_locks if it doesn't exist * move init
* try hiding celery task spam * mypy fix
* silence log * silence
loopio connector: entry["id"] can apparently be a number, so convert to str
* first cut at deletion hardening * clean up logging * remove commented code
…with size <1 chunk
old version of contextual retrieval |
Description
implements Contextual RAG across chunk sizes with a UI flag or environment variable (in theory, although I think the env var code path doesn't work for either this or multipass indexing)
How Has This Been Tested?
Indexing has been manually tested, search + retrieval has not. Also working on unit tests.
Accepted Risk
The main risk is that this option adds significant cost for currently untested benefits
It could be fine if we call it a "beta test" feature and just make it available to get community feedback, but before that I need to test that it doesn't break the backend
Related Issue(s)
[If applicable, link to the issue(s) this PR addresses]
Checklist: