-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: tf/dr flow improvements #5380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements significant improvements to the TF/DR (Task Force/Deep Research) agent search system, introducing configurable timeout management and a new FAST research mode. The core changes include:
Timeout System Overhaul: Replaces hardcoded timeout values throughout the TF/DR pipeline with a configurable multiplier system via the TF_DR_TIMEOUT_MULTIPLIER
environment variable. Base timeouts are increased across the board (e.g., 40→60s, 30→50s, 15→30s) and then multiplied by the configurable factor, allowing deployment-specific tuning without code changes.
New FAST Research Mode: Introduces ResearchType.FAST
as a lightweight research option with a 0.5-second time budget, creating a three-tier system (FAST < THOUGHTFUL < DEEP). The system now intelligently routes requests based on complexity and configuration flags.
Enhanced Agent Logic: Updates the answer generation flow to support three research types instead of the previous binary choice. Non-agentic searches can now default to either THOUGHTFUL or FAST mode based on the TF_DR_DEFAULT_THOUGHTFUL
configuration flag.
Prompt Improvements: Refines DR prompts to eliminate Pydantic validation errors by making JSON formatting requirements more explicit, particularly for the 'next_step' field where parameter references were causing parsing failures.
Bug Fixes: Addresses an edge case in citation validation where empty citation lists could cause crashes, and includes general code cleanup by removing commented dead code.
The changes span the entire TF/DR pipeline from orchestration nodes to individual sub-agents, creating a more robust and flexible research system that can adapt to different performance requirements and deployment environments.
Confidence Score: 2/5
- This PR requires careful review due to critical configuration issues and undefined imports that will prevent deployment
- Score lowered due to undefined
TF_DR_TIMEOUT_MULTIPLIER
constant and severe code duplication issues in the configs file - Pay close attention to
backend/onyx/configs/agent_configs.py
and verify all timeout multiplier imports resolve correctly
12 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13 issues found across 13 files
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
backend/onyx/agents/agent_search/dr/nodes/dr_a0_clarification.py
Outdated
Show resolved
Hide resolved
backend/onyx/agents/agent_search/dr/nodes/dr_a0_clarification.py
Outdated
Show resolved
Hide resolved
backend/onyx/agents/agent_search/dr/nodes/dr_a0_clarification.py
Outdated
Show resolved
Hide resolved
backend/onyx/agents/agent_search/dr/nodes/dr_a1_orchestrator.py
Outdated
Show resolved
Hide resolved
backend/onyx/agents/agent_search/dr/sub_agents/web_search/dr_ws_2_search.py
Outdated
Show resolved
Hide resolved
backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_2_act.py
Outdated
Show resolved
Hide resolved
backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_2_act.py
Outdated
Show resolved
Hide resolved
...yx/agents/agent_search/dr/sub_agents/generic_internal_tool/dr_generic_internal_tool_2_act.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm! Not sure if we need to expose TF_DR_TIMEOUT_MULTIPLIER as an env variable since it seems like a pretty advanced setting.
also curious to see what FAST mode looks like (will it still do some thinking? or just go and directly answer the question)
backend/onyx/agents/agent_search/dr/sub_agents/web_search/dr_ws_6_summarize.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This review covers only the changes made since the last review, not the entire PR.
The latest changes in this PR implement a complete refactoring of the timeout configuration system for the TF/DR (Thoughtful/Fast/Deep Research) agent flows. The primary change replaces the previous multiplier-based timeout approach (TF_DR_TIMEOUT_MULTIPLIER
) with explicit timeout constants (TF_DR_TIMEOUT_LONG
and TF_DR_TIMEOUT_SHORT
). This change affects multiple files across the TF/DR system including orchestrator nodes, sub-agents, and search components.
The refactoring addresses previous concerns about unbounded timeout multipliers that could cause negative or zero timeout values. The new approach provides two predefined timeout values (likely 120 and 60 seconds based on the config) that can be configured via environment variables. This change impacts approximately 10+ files across the TF/DR system, from clarification nodes to web search components to custom tool execution.
Additionally, the changes introduce support for a new ResearchType.FAST mode in several components, particularly in the closer and orchestrator nodes. The FAST mode appears to use the same prompt structure as THOUGHTFUL mode but with different timeout characteristics. The PR also includes improved error handling with safer logging practices and citation validation to prevent runtime errors when documents are not retrieved.
Confidence score: 2/5
- This PR addresses previous timeout validation concerns but introduces a significant configuration duplication issue that needs immediate attention
- Score lowered due to extensive variable duplication in
agent_configs.py
that creates maintenance risk and potential configuration inconsistencies - Pay close attention to
backend/onyx/configs/agent_configs.py
which contains numerous duplicate variable definitions throughout the file
11 files reviewed, 9 comments
backend/onyx/prompts/dr_prompts.py
Outdated
- Please cite your sources inline in format [[2]][[4]], etc! The numbers of the documents \ | ||
are provided above. | ||
- THIS IS VERY IMPORTANT: Please cite your sources inline in format [[2]][[4]], etc! The numbers of the documents \ | ||
are provided above. Also, if yoyu refer to sub-answers, the provided reference numbers \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Typo: 'yoyu' should be 'you'
are provided above. Also, if yoyu refer to sub-answers, the provided reference numbers \ | |
are provided above. Also, if you refer to sub-answers, the provided reference numbers \ |
), | ||
schema=ClarificationGenerationResponse, | ||
timeout_override=25, | ||
timeout_override=int(TF_DR_TIMEOUT_SHORT), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Using int(TF_DR_TIMEOUT_SHORT)
suggests the constant might be a float - consider making timeout constants integers if they're always used as integers
), | ||
schema=SearchAnswer, | ||
timeout_override=40, | ||
timeout_override=TF_DR_TIMEOUT_LONG, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Inconsistent type handling - line 99 uses int()
wrapper while this line doesn't, despite both using timeout constants
timeout_override=TF_DR_TIMEOUT_LONG, | |
timeout_override=int(TF_DR_TIMEOUT_LONG), |
answer_string = str( | ||
graph_config.tooling.primary_llm.invoke( | ||
tool_summary_prompt, timeout_override=40 | ||
tool_summary_prompt, timeout_override=int(TF_DR_TIMEOUT_SHORT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: The int()
conversion is unnecessary since TF_DR_TIMEOUT_SHORT should already be an integer constant
tool_summary_prompt, timeout_override=int(TF_DR_TIMEOUT_SHORT) | |
tool_summary_prompt, timeout_override=TF_DR_TIMEOUT_SHORT |
6 Jobs Failed: Run MIT Integration Tests v2 / integration-tests-mit (connector_job_tests/sharepoint, connector-sharepoint)Step "Start Docker containers" from job "integration-tests-mit (connector_job_tests/sharepoint, connector-sharepoint)" is failing. The last 20 log lines are:
Run MIT Integration Tests v2 / integration-tests-mit (tests/connector, tests-connector)Step "Start Docker containers" from job "integration-tests-mit (tests/connector, tests-connector)" is failing. The last 20 log lines are:
Run MIT Integration Tests v2 / integration-tests-mit (tests/kg, tests-kg)Step "Start Docker containers" from job "integration-tests-mit (tests/kg, tests-kg)" is failing. The last 20 log lines are:
Run MIT Integration Tests v2 / integration-tests-mit (tests/permissions, tests-permissions)Step "Start Docker containers" from job "integration-tests-mit (tests/permissions, tests-permissions)" is failing. The last 20 log lines are:
Run MIT Integration Tests v2 / integration-tests-mit (tests/personas, tests-personas)Step "Start Docker containers" from job "integration-tests-mit (tests/personas, tests-personas)" is failing. The last 20 log lines are:
Run MIT Integration Tests v2 / integration-tests-mit (tests/query_history, tests-query_history)Step "Start Docker containers" from job "integration-tests-mit (tests/query_history, tests-query_history)" is failing. The last 20 log lines are:
Summary: 9 successful workflows, 1 failed workflow
Last updated: 2025-09-11 01:47:43 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The current changes focus specifically on removing unnecessary int()
wrapper conversions around timeout constants in the TF/DR (Thought/Deep Research) agent system.
The changes standardize timeout handling by removing redundant int()
conversions from TF_DR_TIMEOUT_SHORT
constants across six files in the DR agent subsystem:
dr_basic_search_2_act.py
- Basic search node timeout cleanupdr_custom_tool_2_act.py
- Custom tool execution timeout cleanupdr_a1_orchestrator.py
- Orchestrator timeout cleanupdr_ws_6_summarize.py
- Web search summarization timeout cleanupdr_generic_internal_tool_2_act.py
- Generic internal tool timeout cleanupdr_ws_2_search.py
- Web search timeout cleanupdr_a0_clarification.py
- Clarification timeout cleanup
These changes align with the broader PR improvements that introduced the TF_DR_TIMEOUT_MULTIPLIER
environment variable for configurable timeout scaling. By removing the int()
wrappers, the code assumes that TF_DR_TIMEOUT_SHORT
is now properly typed as an integer constant, likely due to changes in how timeout constants are defined or calculated with the new multiplier system. This creates consistency across the codebase where some timeout usages already omitted the int()
conversion while others included it.
Confidence score: 4/5
- This PR contains safe cleanup changes that standardize timeout handling patterns across the TF/DR system
- Score reflects that these are simple consistency improvements with minimal risk, though the underlying timeout multiplier system could still have validation issues noted in previous reviews
- Pay close attention to ensuring the timeout constants are properly validated elsewhere in the system to prevent negative or zero timeout values
7 files reviewed, no comments
Description
Various improvements:
Tickets:
https://linear.app/danswer/issue/DAN-2449/pydantic-error-in-orchestrator-step-selection
https://linear.app/danswer/issue/DAN-2450/timeouts-are-too-tight-for-bedrockclaude
How Has This Been Tested?
locally
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.