Skip to content

Conversation

joachim-danswer
Copy link
Contributor

@joachim-danswer joachim-danswer commented Sep 9, 2025

Description

Various improvements:

  • env variables for tf/dr timeouts
  • hidden FAST flow
  • fixed internal search issue if no docs retrieved
  • improved prompts, particularly for pydantic error in next step determination

Tickets:
https://linear.app/danswer/issue/DAN-2449/pydantic-error-in-orchestrator-step-selection

https://linear.app/danswer/issue/DAN-2450/timeouts-are-too-tight-for-bedrockclaude

How Has This Been Tested?

locally

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@joachim-danswer joachim-danswer requested a review from a team as a code owner September 9, 2025 19:43
Copy link

vercel bot commented Sep 9, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Sep 10, 2025 9:15pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements significant improvements to the TF/DR (Task Force/Deep Research) agent search system, introducing configurable timeout management and a new FAST research mode. The core changes include:

Timeout System Overhaul: Replaces hardcoded timeout values throughout the TF/DR pipeline with a configurable multiplier system via the TF_DR_TIMEOUT_MULTIPLIER environment variable. Base timeouts are increased across the board (e.g., 40→60s, 30→50s, 15→30s) and then multiplied by the configurable factor, allowing deployment-specific tuning without code changes.

New FAST Research Mode: Introduces ResearchType.FAST as a lightweight research option with a 0.5-second time budget, creating a three-tier system (FAST < THOUGHTFUL < DEEP). The system now intelligently routes requests based on complexity and configuration flags.

Enhanced Agent Logic: Updates the answer generation flow to support three research types instead of the previous binary choice. Non-agentic searches can now default to either THOUGHTFUL or FAST mode based on the TF_DR_DEFAULT_THOUGHTFUL configuration flag.

Prompt Improvements: Refines DR prompts to eliminate Pydantic validation errors by making JSON formatting requirements more explicit, particularly for the 'next_step' field where parameter references were causing parsing failures.

Bug Fixes: Addresses an edge case in citation validation where empty citation lists could cause crashes, and includes general code cleanup by removing commented dead code.

The changes span the entire TF/DR pipeline from orchestration nodes to individual sub-agents, creating a more robust and flexible research system that can adapt to different performance requirements and deployment environments.

Confidence Score: 2/5

  • This PR requires careful review due to critical configuration issues and undefined imports that will prevent deployment
  • Score lowered due to undefined TF_DR_TIMEOUT_MULTIPLIER constant and severe code duplication issues in the configs file
  • Pay close attention to backend/onyx/configs/agent_configs.py and verify all timeout multiplier imports resolve correctly

12 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 issues found across 13 files

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

Copy link
Contributor

@rguan72 rguan72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm! Not sure if we need to expose TF_DR_TIMEOUT_MULTIPLIER as an env variable since it seems like a pretty advanced setting.

also curious to see what FAST mode looks like (will it still do some thinking? or just go and directly answer the question)

@joachim-danswer joachim-danswer changed the title TF/DR Flow Improvements fix: tf/dr flow improvements Sep 9, 2025
@joachim-danswer
Copy link
Contributor Author

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This review covers only the changes made since the last review, not the entire PR.

The latest changes in this PR implement a complete refactoring of the timeout configuration system for the TF/DR (Thoughtful/Fast/Deep Research) agent flows. The primary change replaces the previous multiplier-based timeout approach (TF_DR_TIMEOUT_MULTIPLIER) with explicit timeout constants (TF_DR_TIMEOUT_LONG and TF_DR_TIMEOUT_SHORT). This change affects multiple files across the TF/DR system including orchestrator nodes, sub-agents, and search components.

The refactoring addresses previous concerns about unbounded timeout multipliers that could cause negative or zero timeout values. The new approach provides two predefined timeout values (likely 120 and 60 seconds based on the config) that can be configured via environment variables. This change impacts approximately 10+ files across the TF/DR system, from clarification nodes to web search components to custom tool execution.

Additionally, the changes introduce support for a new ResearchType.FAST mode in several components, particularly in the closer and orchestrator nodes. The FAST mode appears to use the same prompt structure as THOUGHTFUL mode but with different timeout characteristics. The PR also includes improved error handling with safer logging practices and citation validation to prevent runtime errors when documents are not retrieved.

Confidence score: 2/5

  • This PR addresses previous timeout validation concerns but introduces a significant configuration duplication issue that needs immediate attention
  • Score lowered due to extensive variable duplication in agent_configs.py that creates maintenance risk and potential configuration inconsistencies
  • Pay close attention to backend/onyx/configs/agent_configs.py which contains numerous duplicate variable definitions throughout the file

11 files reviewed, 9 comments

Edit Code Review Bot Settings | Greptile

- Please cite your sources inline in format [[2]][[4]], etc! The numbers of the documents \
are provided above.
- THIS IS VERY IMPORTANT: Please cite your sources inline in format [[2]][[4]], etc! The numbers of the documents \
are provided above. Also, if yoyu refer to sub-answers, the provided reference numbers \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Typo: 'yoyu' should be 'you'

Suggested change
are provided above. Also, if yoyu refer to sub-answers, the provided reference numbers \
are provided above. Also, if you refer to sub-answers, the provided reference numbers \

),
schema=ClarificationGenerationResponse,
timeout_override=25,
timeout_override=int(TF_DR_TIMEOUT_SHORT),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Using int(TF_DR_TIMEOUT_SHORT) suggests the constant might be a float - consider making timeout constants integers if they're always used as integers

),
schema=SearchAnswer,
timeout_override=40,
timeout_override=TF_DR_TIMEOUT_LONG,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Inconsistent type handling - line 99 uses int() wrapper while this line doesn't, despite both using timeout constants

Suggested change
timeout_override=TF_DR_TIMEOUT_LONG,
timeout_override=int(TF_DR_TIMEOUT_LONG),

answer_string = str(
graph_config.tooling.primary_llm.invoke(
tool_summary_prompt, timeout_override=40
tool_summary_prompt, timeout_override=int(TF_DR_TIMEOUT_SHORT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The int() conversion is unnecessary since TF_DR_TIMEOUT_SHORT should already be an integer constant

Suggested change
tool_summary_prompt, timeout_override=int(TF_DR_TIMEOUT_SHORT)
tool_summary_prompt, timeout_override=TF_DR_TIMEOUT_SHORT

Copy link

blacksmith-sh bot commented Sep 10, 2025

6 Jobs Failed:

Run MIT Integration Tests v2 / integration-tests-mit (connector_job_tests/sharepoint, connector-sharepoint)

Step "Start Docker containers" from job "integration-tests-mit (connector_job_tests/sharepoint, connector-sharepoint)" is failing. The last 20 log lines are:

[...]
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 index Pulling 
 minio Pulling 
 cache Pulling 
 relational_db Pulling 
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 relational_db Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 index Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.
Run MIT Integration Tests v2 / integration-tests-mit (tests/connector, tests-connector)

Step "Start Docker containers" from job "integration-tests-mit (tests/connector, tests-connector)" is failing. The last 20 log lines are:

[...]
  SLACK_BOT_TOKEN: ***
  CONFLUENCE_TEST_SPACE_URL: ***
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 index Pulling 
 cache Pulling 
 relational_db Pulling 
 minio Pulling 
toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.
Run MIT Integration Tests v2 / integration-tests-mit (tests/kg, tests-kg)

Step "Start Docker containers" from job "integration-tests-mit (tests/kg, tests-kg)" is failing. The last 20 log lines are:

[...]
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 minio Pulling 
 relational_db Pulling 
 index Pulling 
 cache Pulling 
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 index Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 relational_db Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.
Run MIT Integration Tests v2 / integration-tests-mit (tests/permissions, tests-permissions)

Step "Start Docker containers" from job "integration-tests-mit (tests/permissions, tests-permissions)" is failing. The last 20 log lines are:

[...]
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 cache Pulling 
 relational_db Pulling 
 minio Pulling 
 index Pulling 
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 index Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 relational_db Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.
Run MIT Integration Tests v2 / integration-tests-mit (tests/personas, tests-personas)

Step "Start Docker containers" from job "integration-tests-mit (tests/personas, tests-personas)" is failing. The last 20 log lines are:

[...]
  CONFLUENCE_USER_NAME: ***
  CONFLUENCE_ACCESS_TOKEN: ***
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 index Pulling 
 cache Pulling 
 minio Pulling 
 relational_db Pulling 
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.
Run MIT Integration Tests v2 / integration-tests-mit (tests/query_history, tests-query_history)

Step "Start Docker containers" from job "integration-tests-mit (tests/query_history, tests-query_history)" is failing. The last 20 log lines are:

[...]
  JIRA_BASE_URL: ***
  JIRA_USER_EMAIL: ***
  JIRA_API_TOKEN: ***
  PERM_SYNC_SHAREPOINT_CLIENT_ID: ***
  PERM_SYNC_SHAREPOINT_PRIVATE_KEY: ***
  PERM_SYNC_SHAREPOINT_CERTIFICATE_PASSWORD: ***
  PERM_SYNC_SHAREPOINT_DIRECTORY_ID: ***
  GITHUB_REPO_NAME: onyx-dot-app/onyx
  AWS_DEFAULT_REGION: us-west-2
 relational_db Pulling 
 cache Pulling 
 index Pulling 
 minio Pulling 
 cache Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 index Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 minio Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
 relational_db Error toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error: Process completed with exit code 1.

Summary: 9 successful workflows, 1 failed workflow

Last updated: 2025-09-11 01:47:43 UTC

@joachim-danswer
Copy link
Contributor Author

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The current changes focus specifically on removing unnecessary int() wrapper conversions around timeout constants in the TF/DR (Thought/Deep Research) agent system.

The changes standardize timeout handling by removing redundant int() conversions from TF_DR_TIMEOUT_SHORT constants across six files in the DR agent subsystem:

  • dr_basic_search_2_act.py - Basic search node timeout cleanup
  • dr_custom_tool_2_act.py - Custom tool execution timeout cleanup
  • dr_a1_orchestrator.py - Orchestrator timeout cleanup
  • dr_ws_6_summarize.py - Web search summarization timeout cleanup
  • dr_generic_internal_tool_2_act.py - Generic internal tool timeout cleanup
  • dr_ws_2_search.py - Web search timeout cleanup
  • dr_a0_clarification.py - Clarification timeout cleanup

These changes align with the broader PR improvements that introduced the TF_DR_TIMEOUT_MULTIPLIER environment variable for configurable timeout scaling. By removing the int() wrappers, the code assumes that TF_DR_TIMEOUT_SHORT is now properly typed as an integer constant, likely due to changes in how timeout constants are defined or calculated with the new multiplier system. This creates consistency across the codebase where some timeout usages already omitted the int() conversion while others included it.

Confidence score: 4/5

  • This PR contains safe cleanup changes that standardize timeout handling patterns across the TF/DR system
  • Score reflects that these are simple consistency improvements with minimal risk, though the underlying timeout multiplier system could still have validation issues noted in previous reviews
  • Pay close attention to ensuring the timeout constants are properly validated elsewhere in the system to prevent negative or zero timeout values

7 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@joachim-danswer joachim-danswer merged commit 3baae2d into main Sep 10, 2025
51 of 58 checks passed
@joachim-danswer joachim-danswer deleted the tf-dr-timeouts branch September 10, 2025 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants