Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Oct 14, 2024

Description

implements Contextual RAG across chunk sizes with a UI flag or environment variable (in theory, although I think the env var code path doesn't work for either this or multipass indexing)

How Has This Been Tested?

Indexing has been manually tested, search + retrieval has not. Also working on unit tests.

Accepted Risk

The main risk is that this option adds significant cost for currently untested benefits
It could be fine if we call it a "beta test" feature and just make it available to get community feedback, but before that I need to test that it doesn't break the backend

Related Issue(s)

[If applicable, link to the issue(s) this PR addresses]

Checklist:

  • All of the automated tests pass
  • All PR comments are addressed and marked resolved
  • If there are migrations, they have been rebased to latest main
  • If there are new dependencies, they are added to the requirements
  • If there are new environment variables, they are added to all of the deployment methods
  • If there are new APIs that don't require auth, they are added to PUBLIC_ENDPOINT_SPECS
  • Docker images build and basic functionalities work
  • Author has done a final read through of the PR right before merge

Copy link

vercel bot commented Oct 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 12, 2024 11:27pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR introduces a new Contextual RAG feature across the Danswer application, affecting multiple components:

  • Added enable_contextual_rag flag to SearchSettings model and database migration
  • Implemented context generation for chunks using LLM in chunker.py
  • Modified embedding process to include document summary and chunk context
  • Updated UI components to display and configure Contextual RAG settings
  • Added max_tokens parameter to LLM interfaces for output control
  • Introduced new fields (doc_summary, chunk_context) in various models and Vespa indexing

Key concerns:

  • Feature is entirely untested, posing significant risks to system stability
  • Potential for increased computational costs without verified benefits
  • Lack of proper error handling and edge case considerations
  • Incomplete implementation of environment variable controls
  • Absence of performance impact assessment on existing functionalities

25 file(s) reviewed, 28 comment(s)
Edit PR Review Bot Settings | Greptile

ENABLE_MULTIPASS_INDEXING = (
os.environ.get("ENABLE_MULTIPASS_INDEXING", "").lower() == "true"
)
ENABLE_CONTEXTUAL_RAG = os.environ.get("ENABLE_CONTEXTUAL_RAG", "").lower() == "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This new feature is currently untested. Implement and run tests before merging.

ENABLE_MULTIPASS_INDEXING = (
os.environ.get("ENABLE_MULTIPASS_INDEXING", "").lower() == "true"
)
ENABLE_CONTEXTUAL_RAG = os.environ.get("ENABLE_CONTEXTUAL_RAG", "").lower() == "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider adding a comment explaining what Contextual RAG is and its implications.

ENABLE_MULTIPASS_INDEXING = (
os.environ.get("ENABLE_MULTIPASS_INDEXING", "").lower() == "true"
)
ENABLE_CONTEXTUAL_RAG = os.environ.get("ENABLE_CONTEXTUAL_RAG", "").lower() == "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The PR description mentions that the environment variable code path might not work. Verify and fix this issue before merging.

Comment on lines +145 to +146
def get_content(self) -> str:
return " ".join([section.text for section in self.sections])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider adding a docstring to explain the purpose and usage of this method

Comment on lines +129 to 132
content=fields[CONTENT], # Includes extra title prefix and metadata suffix;
# also sometimes context for contextual rag
source_links=source_links_dict or {0: ""},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The comment suggests that the content field now includes context for Contextual RAG. This change might affect existing functionality that relies on the content field. Verify that this doesn't break any existing features.

prompt: LanguageModelInput,
tools: list[dict] | None = None,
tool_choice: ToolChoiceOptions | None = None,
max_tokens: int | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Add documentation for the max_tokens parameter to explain its purpose and usage

Comment on lines 71 to 77
def _remove_contextual_rag(chunk: InferenceChunkUncleaned) -> str:
# remove document summary
if chunk.content.startswith(chunk.doc_summary):
return chunk.content[len(chunk.doc_summary) :].lstrip()
# remove chunk context
if chunk.content.endswith(chunk.chunk_context):
return chunk.content[: -len(chunk.chunk_context)].rstrip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This function modifies chunk content without any safeguards. Consider adding checks to ensure the content is not empty after removal.

for chunk in chunks:
chunk.content = _remove_title(chunk)
chunk.content = _remove_metadata_suffix(chunk)
chunk.content = _remove_contextual_rag(chunk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The new _remove_contextual_rag function is called here, but its effects are untested. This could lead to unexpected behavior in the search pipeline.

Comment on lines 71 to 77
def _remove_contextual_rag(chunk: InferenceChunkUncleaned) -> str:
# remove document summary
if chunk.content.startswith(chunk.doc_summary):
return chunk.content[len(chunk.doc_summary) :].lstrip()
# remove chunk context
if chunk.content.endswith(chunk.chunk_context):
return chunk.content[: -len(chunk.chunk_context)].rstrip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The function assumes that doc_summary and chunk_context are always at the start and end of the content respectively. This might not always be true and could lead to incorrect content removal.

metadata_suffix_keyword="",
mini_chunk_texts=None,
large_chunk_reference_ids=[],
chunk_context="Test chunk context",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: New 'chunk_context' parameter added, but not used in assertions

hagen-danswer and others added 10 commits October 23, 2024 19:57
* Tiny confluence fix

* Update utils.py

---------

Co-authored-by: pablodanswer <pablo@danswer.ai>
* functional notifications

* typing

* minor

* ports

* nit

* verify functionality

* pretty
* fresh indexing feature branch

* cherry pick test

* Revert "cherry pick test"

This reverts commit 2a62422.

* set multitenant so that vespa fields match when indexing

* cleanup pass

* mypy

* pass through env var to control celery indexing concurrency

* comments on task kickoff and some logging improvements

* disentangle configuration for different workers and beats.

* use get_session_with_tenant

* comment out all of update.py

* rename to RedisConnectorIndexingFenceData

* first check num_indexing_workers

* refactor RedisConnectorIndexingFenceData

* comment out on_worker_process_init

* missed a file

* scope db sessions to short lengths

* update launch.json template

* fix types

* keep index button disabled until indexing is truly finished

* change priority order of tooltips

* should be using the logger from app_base

* if we run out of retries, just mark the doc as modified so it gets synced later

* tighten up the logging ... we know these are ID's

* add logging
* add default schema config

* resolve circular import

* k
* Temporary fix for empty Google App credentials

* added it to credential creation
* add multi tenancy to redis

* rename context var

* k

* args -> kwargs

* minor update to kv interface

* robustify
* clear up llm

* remove logs
* working chat feedback dump script (with api addition)

* mypy fix

* comment out pydantic models (but leave for reference)

* small code review tweaks

* bump to clear vercel issue?
pablonyx and others added 27 commits October 24, 2024 21:27
* add global assistants context

* nit

* minor cleanup

* minor clarity

* nit
* can't add to primary_worker_locks if it doesn't exist

* move init
* try hiding celery task spam

* mypy fix
* checkpoint

* k

* k

* k

* fixed slack api calls

* missed one

---------

Co-authored-by: hagen-danswer <hagen@danswer.ai>
* silence log

* silence
loopio connector: entry["id"] can apparently be a number, so convert to str
* first cut at deletion hardening

* clean up logging

* remove commented code
@evan-onyx
Copy link
Contributor Author

old version of contextual retrieval

@evan-onyx evan-onyx closed this Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants