Skip to content

Conversation

Weves
Copy link
Contributor

@Weves Weves commented Aug 20, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-2269/deep-research-ui
Fixes https://linear.app/danswer/issue/DAN-2270/deep-research-backend

How Has This Been Tested?

Tested locally

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Adds Deep Research v2 with a clarifier→orchestrator→closer workflow and packetized streaming, plus DB support for research iterations and answer purpose. Updates KB search, tools, and chat piping to the new streaming model.

  • New Features

    • Deep Research agent with nodes (Clarifier, Orchestrator, Closer) and sub-agents: internal search, internet search, knowledge graph, image generation, and custom tool.
    • Packet-based streaming model (MessageStart/Delta, SectionEnd, SearchTool*, ImageGeneration*, Reasoning*, Citation*, OverallStop) and unified streaming utils.
    • Database: adds research_type, research_plan, research_answer_purpose on chat messages; new research_agent_iteration tables; migration to move prior agent data.
    • Tools: new KnowledgeGraphTool; all tools expose a stable id; builder wires ids through constructors.
    • KB search refactor: clearer step descriptions, streaming helpers renamed, improved answer generation and state shape.
    • Improved citation processing with a graph-based processor; consistent CitationInfo usage.
    • Chat API/server: returns packets for messages, supports replay from DB, and routes streaming through new models.
    • Prompts: comprehensive DR prompts and a safe PromptTemplate helper.
    • Config: adds research_type to GraphSearchConfig; KG Beta persona description and constants.
  • Migration

    • Run Alembic migrations (adds research fields/tables and migrates legacy agent data).
    • Tool constructors now require tool_id; ensure custom/image/internet tools pass it.
    • grounded_source_name is nullable; KG extraction and configs updated accordingly.

Copy link

vercel bot commented Aug 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Aug 26, 2025 5:45am

Copy link
Contributor

@evan-onyx evan-onyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits

and generate_final_answer
and response_part.answer_piece
):
if chat_message_id is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be checked outside the for loop

)

if (
hasattr(response_part, "answer_piece")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should check isinstance(response_part, (OnyxAnswerPiece, AgentAnswerPiece, ...))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

writer,
)

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment explaining what this case is used for

write_custom_event(
ind,
MessageDelta(
content=response_part.answer_piece, type="message_delta"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message_delta should be a global constant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed all message_delta as they are defaults of the class anyway with only one allowed value

f"{tool_name}: {tool.cost}" for tool_name, tool in available_tools.items()
)

tool_differentiations: list[str] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn this into a list comp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

focussing on providing the citations and providing some answer facts. But the \
main content should be in the cited documents for each sub-question.
- Pay close attention to whether the sub-answers mention whether the topic of interest \
was explicitly mentioned! If not you cannot reliably use that information to construct your answer, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not you -> delete "not"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

FINAL_ANSWER_PROMPT_WITHOUT_SUB_ANSWERS = PromptTemplate(
f"""
You are great at answering a user question based \
a list of documents that were retrieved in response to subh-questions, and possibly also \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subh -> sub

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

they MUST NOT be part of the rewritten search query... take it out in that case! \
Particularly look for expressions like '...in our Google docs...', '...in our \
Google calls', etc., in which case the source type is 'google_drive' or 'gong' \
should not be included in the rewritten query!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "which" or "and" to the start of the line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed wording a bit, though differently!

You are great at 1) determining whether a question can be answered \
by you directly using your knowledge alone and the chat history (if any), and 2) actually \
answering the question/request, \
if the request DOES NOT require or would strongly benefit from ANY external tool \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or -> nor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""


"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be commented out

@Weves Weves requested a review from a team as a code owner August 25, 2025 22:39
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40 issues found across 297 files

Note: This PR contains a large number of files. cubic only reviews up to 150 files per PR, so some files may not have been reviewed.

React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.

return (
"...\n"
if len(chat_history) > len(past_messages)
else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing concatenation in ternary return causes syntax error or drops chat history content.

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/utils.py at line 192:

<comment>Missing concatenation in ternary return causes syntax error or drops chat history content.</comment>

<file context>
@@ -0,0 +1,333 @@
+import re
+
+from langchain.schema.messages import BaseMessage
+from langchain.schema.messages import HumanMessage
+from sqlalchemy.orm import Session
+
+from onyx.agents.agent_search.dr.enums import ResearchAnswerPurpose
+from onyx.agents.agent_search.dr.enums import ResearchType
+from onyx.agents.agent_search.dr.models import AggregatedDRContext
</file context>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

claims,
) = extract_document_citations(answer_string, claims)
cited_documents = {
citation_number: retrieved_docs[citation_number - 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out-of-bounds citation indexing may raise IndexError in DEEP path

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_2_act.py at line 223:

<comment>Out-of-bounds citation indexing may raise IndexError in DEEP path</comment>

<file context>
@@ -0,0 +1,258 @@
+import re
+from datetime import datetime
+from typing import cast
+
+from langchain_core.runnables import RunnableConfig
+from langgraph.types import StreamWriter
+
+from onyx.agents.agent_search.dr.enums import ResearchType
+from onyx.agents.agent_search.dr.models import BaseSearchProcessingResponse
</file context>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if not state.available_tools:
raise ValueError("available_tools is not set")

search_tool_info = state.available_tools[state.tools_used[-1]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError/KeyError accessing last used tool without guards

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_2_act.py at line 69:

<comment>Potential IndexError/KeyError accessing last used tool without guards</comment>

<file context>
@@ -0,0 +1,258 @@
+import re
+from datetime import datetime
+from typing import cast
+
+from langchain_core.runnables import RunnableConfig
+from langgraph.types import StreamWriter
+
+from onyx.agents.agent_search.dr.enums import ResearchType
+from onyx.agents.agent_search.dr.models import BaseSearchProcessingResponse
</file context>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from onyx.db.persona import upsert_persona
from onyx.db.prompts import get_default_prompt
from onyx.tools.built_in_tools import get_search_tool
from onyx.tools.built_in_tools import get_builtin_tool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect SQLAlchemy filter uses is (Python identity) instead of SQL expression, making the default-exists check always false.

Prompt for AI agents
Address the following comment on backend/onyx/db/slack_channel_config.py at line 19:

<comment>Incorrect SQLAlchemy filter uses `is` (Python identity) instead of SQL expression, making the default-exists check always false.</comment>

<file context>
@@ -16,7 +16,8 @@
 from onyx.db.persona import mark_persona_as_deleted
 from onyx.db.persona import upsert_persona
 from onyx.db.prompts import get_default_prompt
-from onyx.tools.built_in_tools import get_search_tool
+from onyx.tools.built_in_tools import get_builtin_tool
+from onyx.tools.tool_implementations.search.search_tool import SearchTool
 from onyx.utils.errors import EERequiredError
</file context>

llm: LLM,
force_run: bool = False,
) -> dict[str, Any] | None:
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KnowledgeGraphTool is registered as built-in but all execution methods raise ValueError; if surfaced, selecting or evaluating it will crash both tool-calling and non-tool-calling flows.

Prompt for AI agents
Address the following comment on backend/onyx/tools/tool_implementations/knowledge_graph/knowledge_graph_tool.py at line 69:

<comment>KnowledgeGraphTool is registered as built-in but all execution methods raise ValueError; if surfaced, selecting or evaluating it will crash both tool-calling and non-tool-calling flows.</comment>

<file context>
@@ -0,0 +1,106 @@
+from collections.abc import Generator
+from typing import Any
+
+from onyx.chat.prompt_builder.answer_prompt_builder import AnswerPromptBuilder
+from onyx.llm.interfaces import LLM
+from onyx.llm.models import PreviousMessage
+from onyx.tools.message import ToolCallSummary
+from onyx.tools.models import ToolResponse
+from onyx.tools.tool import Tool
</file context>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave for now

# TODO: extra review regarding coding style
@property
def llm_name(self) -> str:
return self.display_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm_name returns display_name, which is not guaranteed unique and can collide when used as a dict key; prefer a stable unique identifier (e.g., name) to avoid overwriting tools.

Prompt for AI agents
Address the following comment on backend/onyx/tools/tool.py at line 48:

<comment>llm_name returns display_name, which is not guaranteed unique and can collide when used as a dict key; prefer a stable unique identifier (e.g., name) to avoid overwriting tools.</comment>

<file context>
@@ -35,6 +40,13 @@ def description(self) -&gt; str:
     def display_name(self) -&gt; str:
         raise NotImplementedError
 
+    # Added to make tools work better with LLMs in prompts. Should be unique
+    # TODO: looks at ways how to best ensure uniqueness.
+    # TODO: extra review regarding coding style
+    @property
+    def llm_name(self) -&gt; str:
+        return self.display_name
</file context>
Suggested change
return self.display_name
return self.name

graph.add_node(DRPath.ORCHESTRATOR, orchestrator)

basic_search_graph = dr_basic_search_graph_builder().compile()
graph.add_node(DRPath.INTERNAL_SEARCH, basic_search_graph)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiled subgraphs with SubAgentMainState are added as nodes to a parent graph with MainState, causing state schema mismatch at runtime

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/graph_builder.py at line 50:

<comment>Compiled subgraphs with SubAgentMainState are added as nodes to a parent graph with MainState, causing state schema mismatch at runtime</comment>

<file context>
@@ -0,0 +1,88 @@
+from langgraph.graph import END
+from langgraph.graph import START
+from langgraph.graph import StateGraph
+
+from onyx.agents.agent_search.dr.conditional_edges import completeness_router
+from onyx.agents.agent_search.dr.conditional_edges import decision_router
+from onyx.agents.agent_search.dr.enums import DRPath
+from onyx.agents.agent_search.dr.nodes.dr_a0_clarification import clarifier
+from onyx.agents.agent_search.dr.nodes.dr_a1_orchestrator import orchestrator
</file context>

sa.ForeignKey("chat_message.id", ondelete="CASCADE"),
nullable=False,
),
sa.Column("iteration_nr", sa.Integer(), nullable=False),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub-steps are not linked to a specific iteration via a foreign key, allowing inconsistent data. Add an iteration_id FK (or a composite FK using a unique constraint) to enforce referential integrity.

Prompt for AI agents
Address the following comment on backend/alembic/versions/5ae8240accb3_add_research_agent_database_tables_and_.py at line 42:

<comment>Sub-steps are not linked to a specific iteration via a foreign key, allowing inconsistent data. Add an iteration_id FK (or a composite FK using a unique constraint) to enforce referential integrity.</comment>

<file context>
@@ -0,0 +1,102 @@
+&quot;&quot;&quot;add research agent database tables and chat message research fields
+
+Revision ID: 5ae8240accb3
+Revises: b558f51620b4
+Create Date: 2025-08-06 14:29:24.691388
+
+&quot;&quot;&quot;
+
+from alembic import op
</file context>

@@ -395,19 +298,14 @@ export function ChatInputBar({

const handleKeyDown = (e: React.KeyboardEvent<HTMLTextAreaElement>) => {
if (
((showSuggestions && assistantTagOptions.length > 0) || showPrompts) &&
(showSuggestions || showPrompts) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enter/Tab handling intercepts keystrokes when showSuggestions is true but no assistant suggestions are rendered, blocking sends after typing a trailing @mention.

(Based on your team's feedback about keeping UX regressions out of refactors and ensuring removed features don't leave dead-end UI states.)

Prompt for AI agents
Address the following comment on web/src/app/chat/components/input/ChatInputBar.tsx at line 301:

<comment>Enter/Tab handling intercepts keystrokes when showSuggestions is true but no assistant suggestions are rendered, blocking sends after typing a trailing @mention.

(Based on your team&#39;s feedback about keeping UX regressions out of refactors and ensuring removed features don&#39;t leave dead-end UI states.)</comment>

<file context>
@@ -395,19 +298,14 @@ export function ChatInputBar({
 
   const handleKeyDown = (e: React.KeyboardEvent&lt;HTMLTextAreaElement&gt;) =&gt; {
     if (
-      ((showSuggestions &amp;&amp; assistantTagOptions.length &gt; 0) || showPrompts) &amp;&amp;
+      (showSuggestions || showPrompts) &amp;&amp;
       (e.key === &quot;Tab&quot; || e.key == &quot;Enter&quot;)
     ) {
</file context>
Suggested change
(showSuggestions || showPrompts) &&
showPrompts &&

GENERAL_DR_ANSWER_PROMPT = PromptTemplate(
f"""\
Below you see a user question and potentially an earlier chat history that can be referred to \
for context. Also, today is {datetime.now().strftime("%Y-%m-%d")}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current date is interpolated at import time; should be provided at render time to avoid stale values.

Prompt for AI agents
Address the following comment on backend/onyx/prompts/dr_prompts.py at line 1202:

<comment>Current date is interpolated at import time; should be provided at render time to avoid stale values.</comment>

<file context>
@@ -0,0 +1,1378 @@
+from datetime import datetime
+
+from onyx.agents.agent_search.dr.constants import MAX_DR_PARALLEL_SEARCH
+from onyx.agents.agent_search.dr.enums import DRPath
+from onyx.agents.agent_search.dr.enums import ResearchType
+from onyx.prompts.prompt_template import PromptTemplate
+
+
+# Standards
</file context>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Weves Weves merged commit 9d997e2 into main Aug 26, 2025
10 of 15 checks passed
@Weves Weves deleted the dr-merge_v2 branch August 26, 2025 07:26
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* squash: combine all DR commits into one

Co-authored-by: Joachim Rahmfeld <joachim@onyx.app>
Co-authored-by: Rei Meguro <rmeguro@umich.edu>

* Fixes

* show KG in Assistant only if available

* KG only usable for KG Beta (for now)

* base file upload

* raise error if uploaded context is too long

* improvements

* More improvements

* Fix citations

* better decision making

* improved decision-making in Orchestrator

* generic_internal tools

* Small tweak

* tool use improvements

* add on

* More image gen stuff

* fixes

* Small color improvements

* Markdown utils

* fixed end conditions (incl early exit for image generation)

* remove agent search + image fixes

* Okta tool support for reload

* Some cleanup

* Stream back search tool results as they come

* tool forcing

* fixed no-Tool-Assistant

* Support anthropic tool calling

* Support anthropic models better

* More stuff

* prompt fixes and search step numbers

* Fix hook ordering issue

* internal search fix

* Improve citation look

* Small UI improvements

* Improvements

* Improve dot

* Small chat fixes

* Small UI tweaks

* Small improvements

* Remove un-used code

* Fix

* Remove test_answer.py for now

* Fix

* improvements

* Add foreign keys

* early forcing

* Fix tests

* Fix tests

---------

Co-authored-by: Joachim Rahmfeld <joachim@onyx.app>
Co-authored-by: Rei Meguro <rmeguro@umich.edu>
Co-authored-by: joachim-danswer <joachim@danswer.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants