fix: tf/dr flow improvements #5380

joachim-danswer · 2025-09-09T19:43:31Z

Description

Various improvements:

env variables for tf/dr timeouts
hidden FAST flow
fixed internal search issue if no docs retrieved
improved prompts, particularly for pydantic error in next step determination

Tickets:
https://linear.app/danswer/issue/DAN-2449/pydantic-error-in-orchestrator-step-selection

https://linear.app/danswer/issue/DAN-2450/timeouts-are-too-tight-for-bedrockclaude

How Has This Been Tested?

locally

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

vercel · 2025-09-09T19:43:38Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
internal-search	Ready	Preview	Comment	Sep 10, 2025 9:15pm

greptile-apps

Greptile Summary

This PR implements significant improvements to the TF/DR (Task Force/Deep Research) agent search system, introducing configurable timeout management and a new FAST research mode. The core changes include:

Timeout System Overhaul: Replaces hardcoded timeout values throughout the TF/DR pipeline with a configurable multiplier system via the TF_DR_TIMEOUT_MULTIPLIER environment variable. Base timeouts are increased across the board (e.g., 40→60s, 30→50s, 15→30s) and then multiplied by the configurable factor, allowing deployment-specific tuning without code changes.

New FAST Research Mode: Introduces ResearchType.FAST as a lightweight research option with a 0.5-second time budget, creating a three-tier system (FAST < THOUGHTFUL < DEEP). The system now intelligently routes requests based on complexity and configuration flags.

Enhanced Agent Logic: Updates the answer generation flow to support three research types instead of the previous binary choice. Non-agentic searches can now default to either THOUGHTFUL or FAST mode based on the TF_DR_DEFAULT_THOUGHTFUL configuration flag.

Prompt Improvements: Refines DR prompts to eliminate Pydantic validation errors by making JSON formatting requirements more explicit, particularly for the 'next_step' field where parameter references were causing parsing failures.

Bug Fixes: Addresses an edge case in citation validation where empty citation lists could cause crashes, and includes general code cleanup by removing commented dead code.

The changes span the entire TF/DR pipeline from orchestration nodes to individual sub-agents, creating a more robust and flexible research system that can adapt to different performance requirements and deployment environments.

Confidence Score: 2/5

This PR requires careful review due to critical configuration issues and undefined imports that will prevent deployment
Score lowered due to undefined TF_DR_TIMEOUT_MULTIPLIER constant and severe code duplication issues in the configs file
Pay close attention to backend/onyx/configs/agent_configs.py and verify all timeout multiplier imports resolve correctly

_{12 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

cubic-dev-ai

13 issues found across 13 files

_{React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

backend/onyx/agents/agent_search/dr/nodes/dr_a0_clarification.py

backend/onyx/agents/agent_search/dr/nodes/dr_a1_orchestrator.py

backend/onyx/agents/agent_search/dr/nodes/dr_a2_closer.py

backend/onyx/agents/agent_search/dr/sub_agents/web_search/dr_ws_2_search.py

backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_2_act.py

...yx/agents/agent_search/dr/sub_agents/generic_internal_tool/dr_generic_internal_tool_2_act.py

rguan72

overall lgtm! Not sure if we need to expose TF_DR_TIMEOUT_MULTIPLIER as an env variable since it seems like a pretty advanced setting.

also curious to see what FAST mode looks like (will it still do some thinking? or just go and directly answer the question)

backend/onyx/configs/agent_configs.py

backend/onyx/agents/agent_search/dr/sub_agents/web_search/dr_ws_6_summarize.py

backend/onyx/agents/agent_search/dr/nodes/dr_a1_orchestrator.py

joachim-danswer · 2025-09-10T20:52:49Z

@greptileai

greptile-apps

Greptile Summary

This review covers only the changes made since the last review, not the entire PR.

The latest changes in this PR implement a complete refactoring of the timeout configuration system for the TF/DR (Thoughtful/Fast/Deep Research) agent flows. The primary change replaces the previous multiplier-based timeout approach (TF_DR_TIMEOUT_MULTIPLIER) with explicit timeout constants (TF_DR_TIMEOUT_LONG and TF_DR_TIMEOUT_SHORT). This change affects multiple files across the TF/DR system including orchestrator nodes, sub-agents, and search components.

The refactoring addresses previous concerns about unbounded timeout multipliers that could cause negative or zero timeout values. The new approach provides two predefined timeout values (likely 120 and 60 seconds based on the config) that can be configured via environment variables. This change impacts approximately 10+ files across the TF/DR system, from clarification nodes to web search components to custom tool execution.

Additionally, the changes introduce support for a new ResearchType.FAST mode in several components, particularly in the closer and orchestrator nodes. The FAST mode appears to use the same prompt structure as THOUGHTFUL mode but with different timeout characteristics. The PR also includes improved error handling with safer logging practices and citation validation to prevent runtime errors when documents are not retrieved.

Confidence score: 2/5

This PR addresses previous timeout validation concerns but introduces a significant configuration duplication issue that needs immediate attention
Score lowered due to extensive variable duplication in agent_configs.py that creates maintenance risk and potential configuration inconsistencies
Pay close attention to backend/onyx/configs/agent_configs.py which contains numerous duplicate variable definitions throughout the file

_{11 files reviewed, 9 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-09-10T20:54:33Z

backend/onyx/prompts/dr_prompts.py

- Please cite your sources inline in format [[2]][[4]], etc! The numbers of the documents \
-are provided above.
+- THIS IS VERY IMPORTANT: Please cite your sources inline in format [[2]][[4]], etc! The numbers of the documents \
+are provided above. Also, if yoyu refer to sub-answers, the provided reference numbers \


syntax: Typo: 'yoyu' should be 'you'

Suggested change

are provided above. Also, if yoyu refer to sub-answers, the provided reference numbers \

are provided above. Also, if you refer to sub-answers, the provided reference numbers \

greptile-apps · 2025-09-10T20:54:34Z

backend/onyx/agents/agent_search/dr/nodes/dr_a0_clarification.py

                    ),
                    schema=ClarificationGenerationResponse,
-                    timeout_override=25,
+                    timeout_override=int(TF_DR_TIMEOUT_SHORT),


style: Using int(TF_DR_TIMEOUT_SHORT) suggests the constant might be a float - consider making timeout constants integers if they're always used as integers

greptile-apps · 2025-09-10T20:54:34Z

backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_2_act.py

            ),
            schema=SearchAnswer,
-            timeout_override=40,
+            timeout_override=TF_DR_TIMEOUT_LONG,


style: Inconsistent type handling - line 99 uses int() wrapper while this line doesn't, despite both using timeout constants

Suggested change

timeout_override=TF_DR_TIMEOUT_LONG,

timeout_override=int(TF_DR_TIMEOUT_LONG),

greptile-apps · 2025-09-10T20:54:35Z

backend/onyx/agents/agent_search/dr/sub_agents/custom_tool/dr_custom_tool_2_act.py

    answer_string = str(
        graph_config.tooling.primary_llm.invoke(
-            tool_summary_prompt, timeout_override=40
+            tool_summary_prompt, timeout_override=int(TF_DR_TIMEOUT_SHORT)


style: The int() conversion is unnecessary since TF_DR_TIMEOUT_SHORT should already be an integer constant

Suggested change

tool_summary_prompt, timeout_override=int(TF_DR_TIMEOUT_SHORT)

tool_summary_prompt, timeout_override=TF_DR_TIMEOUT_SHORT

blacksmith-sh · 2025-09-10T21:17:07Z

6 Jobs Failed:

Run MIT Integration Tests v2 / integration-tests-mit (connector_job_tests/sharepoint, connector-sharepoint)

Step "Start Docker containers" from job "integration-tests-mit (connector_job_tests/sharepoint, connector-sharepoint)" is failing. The last 20 log lines are:

[...]
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 index Pulling 
 minio Pulling 
 cache Pulling 
 relational_db Pulling 
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 relational_db Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 index Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.

Run MIT Integration Tests v2 / integration-tests-mit (tests/connector, tests-connector)

Step "Start Docker containers" from job "integration-tests-mit (tests/connector, tests-connector)" is failing. The last 20 log lines are:

[...]
  SLACK_BOT_TOKEN: ***
  CONFLUENCE_TEST_SPACE_URL: ***
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 index Pulling 
 cache Pulling 
 relational_db Pulling 
 minio Pulling 
toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.

Run MIT Integration Tests v2 / integration-tests-mit (tests/kg, tests-kg)

Step "Start Docker containers" from job "integration-tests-mit (tests/kg, tests-kg)" is failing. The last 20 log lines are:

[...]
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 minio Pulling 
 relational_db Pulling 
 index Pulling 
 cache Pulling 
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 index Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 relational_db Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.

Run MIT Integration Tests v2 / integration-tests-mit (tests/permissions, tests-permissions)

Step "Start Docker containers" from job "integration-tests-mit (tests/permissions, tests-permissions)" is failing. The last 20 log lines are:

[...]
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 cache Pulling 
 relational_db Pulling 
 minio Pulling 
 index Pulling 
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 index Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 relational_db Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.

Run MIT Integration Tests v2 / integration-tests-mit (tests/personas, tests-personas)

Step "Start Docker containers" from job "integration-tests-mit (tests/personas, tests-personas)" is failing. The last 20 log lines are:

[...]
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 index Pulling 
 cache Pulling 
 minio Pulling 
 relational_db Pulling 
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.

Run MIT Integration Tests v2 / integration-tests-mit (tests/query_history, tests-query_history)

Step "Start Docker containers" from job "integration-tests-mit (tests/query_history, tests-query_history)" is failing. The last 20 log lines are:

[...]
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 relational_db Pulling 
 cache Pulling 
 index Pulling 
 minio Pulling 
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 index Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 relational_db Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.

Summary: 9 successful workflows, 1 failed workflow

✅ Backport on Merge (1 job succeeded)
✅ Ensure PR references Linear (1 job succeeded)
✅ Python Checks (1 job succeeded)
✅ External Dependency Unit Tests (5 jobs succeeded)
✅ Run Integration Tests v2 (1 job succeeded)
✅ Python Unit Tests (1 job succeeded)
❌ Run MIT Integration Tests v2 (28 jobs succeeded, 6 jobs failed)
✅ Quality Checks PR (1 job succeeded)
✅ Helm - Lint and Test Charts (1 job succeeded)
✅ Connector Tests (1 job succeeded)

Last updated: 2025-09-11 01:47:43 UTC

joachim-danswer · 2025-09-10T21:58:10Z

@greptileai

greptile-apps

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The current changes focus specifically on removing unnecessary int() wrapper conversions around timeout constants in the TF/DR (Thought/Deep Research) agent system.

The changes standardize timeout handling by removing redundant int() conversions from TF_DR_TIMEOUT_SHORT constants across six files in the DR agent subsystem:

dr_basic_search_2_act.py - Basic search node timeout cleanup
dr_custom_tool_2_act.py - Custom tool execution timeout cleanup
dr_a1_orchestrator.py - Orchestrator timeout cleanup
dr_ws_6_summarize.py - Web search summarization timeout cleanup
dr_generic_internal_tool_2_act.py - Generic internal tool timeout cleanup
dr_ws_2_search.py - Web search timeout cleanup
dr_a0_clarification.py - Clarification timeout cleanup

These changes align with the broader PR improvements that introduced the TF_DR_TIMEOUT_MULTIPLIER environment variable for configurable timeout scaling. By removing the int() wrappers, the code assumes that TF_DR_TIMEOUT_SHORT is now properly typed as an integer constant, likely due to changes in how timeout constants are defined or calculated with the new multiplier system. This creates consistency across the codebase where some timeout usages already omitted the int() conversion while others included it.

Confidence score: 4/5

This PR contains safe cleanup changes that standardize timeout handling patterns across the TF/DR system
Score reflects that these are simple consistency improvements with minimal risk, though the underlying timeout multiplier system could still have validation issues noted in previous reviews
Pay close attention to ensuring the timeout constants are properly validated elsewhere in the system to prevent negative or zero timeout values

_{7 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

joachim-danswer added 5 commits September 9, 2025 09:31

timeout factor

fca4ad0

timeout default factor reset

13ede6d

Hidden FAST mode and prompt changes

ed0f6b8

default time-out adjustments + more factors

8a006c8

search fix

edabc08

joachim-danswer requested a review from a team as a code owner September 9, 2025 19:43

greptile-apps bot reviewed Sep 9, 2025

View reviewed changes

cubic-dev-ai bot reviewed Sep 9, 2025

View reviewed changes

rguan72 reviewed Sep 9, 2025

View reviewed changes

cubic comments

22241d6

vercel bot deployed to Preview September 9, 2025 21:24 View deployment

joachim-danswer changed the title ~~TF/DR Flow Improvements~~ fix: tf/dr flow improvements Sep 9, 2025

mypy fix

777931b

vercel bot deployed to Preview September 9, 2025 21:39 View deployment

rguan72 approved these changes Sep 9, 2025

View reviewed changes

Weves reviewed Sep 10, 2025

View reviewed changes

backend/onyx/configs/agent_configs.py Outdated Show resolved Hide resolved

backend/onyx/agents/agent_search/dr/sub_agents/web_search/dr_ws_6_summarize.py Outdated Show resolved Hide resolved

backend/onyx/agents/agent_search/dr/nodes/dr_a1_orchestrator.py Show resolved Hide resolved

CW comments + citation prompt adjustments

3b624f1

vercel bot deployed to Preview September 10, 2025 20:27 View deployment

greptile-apps bot reviewed Sep 10, 2025

View reviewed changes

joachim-danswer added 2 commits September 10, 2025 14:10

format nit

c71b633

nit

180b6e7

vercel bot deployed to Preview September 10, 2025 21:15 View deployment

greptile-apps bot reviewed Sep 10, 2025

View reviewed changes

joachim-danswer merged commit 3baae2d into main Sep 10, 2025
51 of 58 checks passed

joachim-danswer deleted the tf-dr-timeouts branch September 10, 2025 23:39

	are provided above. Also, if yoyu refer to sub-answers, the provided reference numbers \
	are provided above. Also, if you refer to sub-answers, the provided reference numbers \

	timeout_override=TF_DR_TIMEOUT_LONG,
	timeout_override=int(TF_DR_TIMEOUT_LONG),

	tool_summary_prompt, timeout_override=int(TF_DR_TIMEOUT_SHORT)
	tool_summary_prompt, timeout_override=TF_DR_TIMEOUT_SHORT

fix: tf/dr flow improvements #5380

fix: tf/dr flow improvements #5380

Conversation

joachim-danswer commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Uh oh!

vercel bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence Score: 2/5

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rguan72 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joachim-danswer commented Sep 10, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Uh oh!

greptile-apps bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

blacksmith-sh bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joachim-danswer commented Sep 10, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

Uh oh!

Uh oh!

joachim-danswer commented Sep 9, 2025 •

edited

Loading

vercel bot commented Sep 9, 2025 •

edited

Loading

blacksmith-sh bot commented Sep 10, 2025 •

edited

Loading