feat: Eval Pipeline 2 for ci/cd fix #5440

rguan72 · 2025-09-17T17:03:40Z

Description

[Provide a brief description of the changes in this PR]

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

Summary by cubic

Adds a Braintrust-backed evaluation pipeline with a local CLI, a secure EE API endpoint, and a Celery task to run evals in CI/CD. Enables running evals on local data or Braintrust datasets without DB side effects.

New Features
- Eval framework (backend/onyx/evals): configuration models, provider abstraction, Braintrust provider, and isolated transactional DB sessions.
- EE API: POST /api/evals/eval_run (cloud superuser only) enqueues eval_run_task via Celery.
- Celery task: eval_run_task executes evaluations using EvalConfigurationOptions.
- CLI: backend/onyx/evals/eval_cli.py supports local runs or remote triggers; VS Code launch “Eval CLI” added.
- Docs: backend/onyx/evals/README.md; script to create Braintrust datasets.
- Chat: prepare_chat_message_request now supports llm_override and allowed_tool_ids for eval control.
- Config: BRAINTRUST_* envs and max concurrency; .gitignore updated for eval data.
Dependencies
- Adds braintrust, braintrust-langchain, autoevals to backend requirements.
- Celery app loads onyx.background.celery.tasks.evals.

vercel · 2025-09-17T17:03:44Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
internal-search	Ready	Preview	Comment	Sep 17, 2025 6:26pm

greptile-apps

Greptile Summary

This PR implements a comprehensive evaluation pipeline system for Onyx, enabling automated testing and quality assessment of chat and retrieval functionality. The changes introduce a Braintrust-based evaluation framework that supports both local development testing and remote CI/CD integration.

The core implementation includes:

Evaluation Infrastructure: New models (EvalConfiguration, EvalConfigurationOptions) define evaluation settings including persona overrides, tool restrictions, and LLM configurations
Provider Architecture: Abstract EvalProvider interface with concrete BraintrustEvalProvider implementation for running evaluations against datasets
Database Isolation: Ephemeral session factory ensures evaluations don't affect production data by rolling back all database changes
CLI Integration: New eval_cli.py provides command-line tools for running evaluations locally or triggering remote execution via API
Background Processing: Celery task integration allows evaluations to run asynchronously without blocking API responses
Enterprise API: New /evals/eval_run endpoint restricted to cloud superusers for triggering evaluation jobs

The system integrates deeply with existing Onyx components by extending chat message request preparation to support LLM overrides and tool restrictions, updating the Celery autodiscovery to include evaluation tasks, and modifying built-in tool type definitions for better evaluation compatibility. Configuration management is enhanced with new Braintrust-specific environment variables for API keys, project names, and concurrency limits.

The evaluation pipeline supports different research types (DEEP/THOUGHTFUL) and can work with both local JSON datasets and remote Braintrust datasets, providing flexibility for various testing scenarios from development to production monitoring.

Confidence score: 2/5

This PR has several critical issues that make it unsafe to merge without significant fixes
Score lowered due to incomplete implementations, missing configurations, potential type safety issues, and debugging code left in production files
Pay close attention to backend/onyx/evals/providers/braintrust.py, backend/onyx/tools/built_in_tools.py, and backend/onyx/evals/models.py - these files contain the most problematic code that needs immediate attention

Context used:

Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)

_{26 files reviewed, 17 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-09-17T17:05:57Z

backend/onyx/evals/eval_cli.py

+        dataset_size = len(list(dataset.fetch()))
+        print(f"Dataset size: {dataset_size}")


style: inefficient dataset size calculation - list(dataset.fetch()) loads entire dataset into memory just to count items

greptile-apps · 2025-09-17T17:05:58Z

backend/onyx/evals/eval_cli.py

+                args.remote_dataset_name,
+                search_permissions_email=args.search_permissions_email,


logic: args.remote_dataset_name could be None when using --local-data-path, causing TypeError

Suggested change

args.remote_dataset_name,

search_permissions_email=args.search_permissions_email,

if not args.remote_dataset_name:

print("Error: --remote-dataset-name is required for remote evaluation")

return

try:

result = run_remote(

args.base_url,

api_key,

args.remote_dataset_name,

search_permissions_email=args.search_permissions_email,

)

greptile-apps · 2025-09-17T17:05:59Z

backend/onyx/evals/eval.py

+                else None
+            )
+            research_type = ResearchType(eval_input.get("research_type", "THOUGHTFUL"))
+            print(eval_input)


syntax: Remove debugging print statement before merging to production

Suggested change

print(eval_input)

research_type = ResearchType(eval_input.get("research_type", "THOUGHTFUL"))

Context Used: Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)

greptile-apps · 2025-09-17T17:06:00Z

backend/onyx/evals/models.py

+    persona_override_config: PersonaOverrideConfig | None = None
+    llm: LLMOverride = Field(default_factory=LLMOverride)
+    search_permissions_email: str | None = None
+    allowed_tool_ids: list[int]


logic: Missing default value for required field. This will cause Pydantic validation errors when creating EvalConfiguration instances without explicit allowed_tool_ids.

Suggested change

allowed_tool_ids: list[int]

allowed_tool_ids: list[int] = Field(default_factory=list)

greptile-apps · 2025-09-17T17:06:00Z

backend/onyx/evals/models.py

+        )
+
+
+class EvalationAck(BaseModel):


syntax: Class name has typo: 'EvalationAck' should be 'EvaluationAck'

Suggested change

class EvalationAck(BaseModel):

class EvaluationAck(BaseModel):

greptile-apps · 2025-09-17T17:06:06Z

backend/onyx/evals/README.md

+make sure your AUTH_TYPE=disabled when running evals locally. Save the env var ONYX_EVAL_API_KEY in your .env file so you don't 
+have to specify it every time for triggering remote runs. 


style: Missing punctuation and formatting issues in the sentence about AUTH_TYPE and environment variables

greptile-apps · 2025-09-17T17:06:07Z

backend/onyx/background/celery/tasks/evals/tasks.py

+    except Exception:
+        logger.error("Failed to run eval task")
+        raise


logic: Generic exception handling loses valuable error context. Consider logging the exception details with logger.exception() or including specific error information.

Suggested change

except Exception:

logger.error("Failed to run eval task")

raise

except Exception:

logger.exception("Failed to run eval task")

raise

greptile-apps · 2025-09-17T17:06:07Z

backend/onyx/evals/providers/braintrust.py

+            Eval(
+                name=BRAINTRUST_PROJECT,
+                data=eval_data,
+                task=task,
+                scores=[],
+                metadata={**configuration.model_dump()},
+                max_concurrency=BRAINTRUST_MAX_CONCURRENCY,
+            )


logic: The Eval() call result is not captured or awaited. This means the evaluation runs but results are lost and there's no error handling.

Suggested change

Eval(

name=BRAINTRUST_PROJECT,

data=eval_data,

task=task,

scores=[],

metadata={**configuration.model_dump()},

max_concurrency=BRAINTRUST_MAX_CONCURRENCY,

)

eval_result = Eval(

name=BRAINTRUST_PROJECT,

data=eval_data,

task=task,

scores=[],

metadata={**configuration.model_dump()},

max_concurrency=BRAINTRUST_MAX_CONCURRENCY,

)

greptile-apps · 2025-09-17T17:06:08Z

backend/onyx/evals/providers/braintrust.py

+            Eval(
+                name=BRAINTRUST_PROJECT,
+                data=eval_cases,
+                task=task,
+                scores=[],
+                metadata={**configuration.model_dump()},
+                max_concurrency=BRAINTRUST_MAX_CONCURRENCY,
+            )


logic: Same issue - Eval() result is not captured. Both evaluation paths have this critical problem.

Suggested change

Eval(

name=BRAINTRUST_PROJECT,

data=eval_cases,

task=task,

scores=[],

metadata={**configuration.model_dump()},

max_concurrency=BRAINTRUST_MAX_CONCURRENCY,

)

eval_result = Eval(

name=BRAINTRUST_PROJECT,

data=eval_cases,

task=task,

scores=[],

metadata={**configuration.model_dump()},

max_concurrency=BRAINTRUST_MAX_CONCURRENCY,

)

greptile-apps · 2025-09-17T17:06:09Z

backend/onyx/evals/providers/braintrust.py

+            if data is None:
+                raise ValueError(
+                    "Must specify data when remote_dataset_name is not specified"
+                )


logic: This validation check is redundant - if remote_dataset_name is None, we're already in the else branch, so data cannot also be None due to the earlier validation.

cubic-dev-ai

28 issues found across 27 files

Prompt for AI agents (all 28 issues)


Understand the root cause of the following 28 issues and fix them.


<file name="backend/onyx/evals/provider.py">

<violation number="1" location="backend/onyx/evals/provider.py:2">
Hard dependency on optional braintrust package causes import-time crash (ModuleNotFoundError)</violation>
</file>

<file name="backend/onyx/db/tools.py">

<violation number="1" location="backend/onyx/db/tools.py:12">
Top-level import introduces a tools-layer runtime dependency in the DB module, undermining the local-import pattern meant to avoid circular imports and tight coupling.</violation>
</file>

<file name="backend/onyx/evals/one_off/create_braintrust_dataset.py">

<violation number="1" location="backend/onyx/evals/one_off/create_braintrust_dataset.py:63">
Filtering does not enforce categories containing &#39;web-only&#39; as stated; condition only checks should_use and question, leading to incorrect dataset contents.</violation>

<violation number="2" location="backend/onyx/evals/one_off/create_braintrust_dataset.py:69">
Typo in user-facing string: &quot;quesiton&quot; should be &quot;question&quot;.</violation>

<violation number="3" location="backend/onyx/evals/one_off/create_braintrust_dataset.py:143">
Avoid user-specific absolute path as default for --csv-path; use a repo-relative path derived from __file__ for portability.</violation>
</file>

<file name="backend/onyx/evals/models.py">

<violation number="1" location="backend/onyx/evals/models.py:45">
Redundant DB lookups in list comprehensions; cache tool IDs to avoid calling get_builtin_tool twice per tool.</violation>

<violation number="2" location="backend/onyx/evals/models.py:69">
Typo in class name &quot;EvalationAck&quot;; consider renaming to &quot;EvaluationAck&quot; for clarity and consistency across the codebase.</violation>

<violation number="3" location="backend/onyx/evals/models.py:77">
Callable type for task may be inconsistent with data item structure; consider aligning the task signature with the data element type (e.g., Callable[[dict[str, dict[str, str]]], str]) or adjusting data typing.</violation>
</file>

<file name="backend/ee/onyx/server/evals/api.py">

<violation number="1" location="backend/ee/onyx/server/evals/api.py:5">
Use the versioned Celery client app import for consistency and correct EE/OSS wiring.</violation>

<violation number="2" location="backend/ee/onyx/server/evals/api.py:24">
Docstring claims API key auth, but the dependency enforces superuser access; update the description to reflect actual auth.</violation>
</file>

<file name="backend/onyx/evals/eval.py">

<violation number="1" location="backend/onyx/evals/eval.py:75">
Avoid printing potentially sensitive eval_input; use structured logging with redaction or remove debug output.</violation>
</file>

<file name="backend/onyx/evals/eval_cli.py">

<violation number="1" location="backend/onyx/evals/eval_cli.py:114">
Network request lacks a timeout; add a finite timeout to avoid indefinite hangs and improve resiliency.</violation>

<violation number="2" location="backend/onyx/evals/eval_cli.py:188">
Empty API key fallback results in sending an empty Bearer token; validate and fail fast when no API key is provided.</violation>

<violation number="3" location="backend/onyx/evals/eval_cli.py:199">
run_remote is invoked with possibly None dataset_name/search_permissions_email, resulting in JSON nulls and server-side failure.</violation>

<violation number="4" location="backend/onyx/evals/eval_cli.py:204">
Printing the raw exception may leak sensitive details; prefer logging only safe metadata (e.g., HTTP status code) or a generic message.

(Based on your team&#39;s feedback about avoiding logging raw exception strings that may contain URLs with temporary auth tokens.)</violation>
</file>

<file name="backend/onyx/evals/providers/braintrust.py">

<violation number="1" location="backend/onyx/evals/providers/braintrust.py:32">
Eval name uses BRAINTRUST_PROJECT (project) instead of configuration.dataset_name; this likely mislabels eval runs and leaves dataset_name unused.</violation>
</file>

<file name="backend/requirements/default.txt">

<violation number="1" location="backend/requirements/default.txt:108">
New dependency appears unused across the repository; consider avoiding adding unused runtime deps or move to a dev/CI-only requirements file to reduce image size and attack surface.</violation>

<violation number="2" location="backend/requirements/default.txt:109">
Integration package appears unused; adding unused deps increases maintenance burden. Consider deferring until referenced in code or include only in eval/CI requirements.</violation>

<violation number="3" location="backend/requirements/default.txt:110">
Added dependency seems unused; avoid introducing unused runtime packages or move to a dedicated eval/CI requirements file.</violation>
</file>

<file name="backend/onyx/configs/app_configs.py">

<violation number="1" location="backend/onyx/configs/app_configs.py:697">
Zero/negative concurrency allowed; clamp to a minimum of 1 to avoid disabling evaluations or runtime errors.</violation>
</file>

<file name="backend/onyx/evals/README.md">

<violation number="1" location="backend/onyx/evals/README.md:17">
Duplicate --remote flag in example command; remove the redundancy.</violation>

<violation number="2" location="backend/onyx/evals/README.md:23">
Incorrect flag name: use --local-data-path (not --local-dataset-path).</violation>

<violation number="3" location="backend/onyx/evals/README.md:39">
VS Code configuration &quot;Eval CLI&quot; does not exist in the repo templates; instruction is misleading.</violation>

<violation number="4" location="backend/onyx/evals/README.md:43">
Claims a default data file path that does not exist; remove or correct the path/default description.</violation>

<violation number="5" location="backend/onyx/evals/README.md:49">
Incorrect default documented for --local-data-path; the CLI sets no default.</violation>

<violation number="6" location="backend/onyx/evals/README.md:51">
Misleading: --braintrust-project does not override provider’s BRAINTRUST_PROJECT; it’s used only for dataset lookup.</violation>

<violation number="7" location="backend/onyx/evals/README.md:56">
Path `evals/data/data.json` does not exist in the repo; reference should not promise a fixed path.</violation>

<violation number="8" location="backend/onyx/evals/README.md:58">
Incorrect data shape: research_type belongs inside input, not at the top level.</violation>
</file>

_{React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

cubic-dev-ai · 2025-09-17T17:21:25Z

backend/onyx/evals/provider.py

@@ -0,0 +1,6 @@
+from onyx.evals.models import EvalProvider
+from onyx.evals.providers.braintrust import BraintrustEvalProvider


Hard dependency on optional braintrust package causes import-time crash (ModuleNotFoundError)

Prompt for AI agents

Address the following comment on backend/onyx/evals/provider.py at line 2: <comment>Hard dependency on optional braintrust package causes import-time crash (ModuleNotFoundError)</comment> <file context> @@ -0,0 +1,6 @@ +from onyx.evals.models import EvalProvider +from onyx.evals.providers.braintrust import BraintrustEvalProvider + + </file context>

cubic-dev-ai · 2025-09-17T17:21:25Z

backend/onyx/db/tools.py


 from onyx.db.models import Tool
 from onyx.server.features.tool.models import Header
+from onyx.tools.built_in_tools import BUILT_IN_TOOL_TYPES


Top-level import introduces a tools-layer runtime dependency in the DB module, undermining the local-import pattern meant to avoid circular imports and tight coupling.

Prompt for AI agents

Address the following comment on backend/onyx/db/tools.py at line 12: <comment>Top-level import introduces a tools-layer runtime dependency in the DB module, undermining the local-import pattern meant to avoid circular imports and tight coupling.</comment> <file context> @@ -2,29 +2,19 @@ from onyx.db.models import Tool from onyx.server.features.tool.models import Header +from onyx.tools.built_in_tools import BUILT_IN_TOOL_TYPES from onyx.utils.headers import HeaderItemDict from onyx.utils.logger import setup_logger </file context>

cubic-dev-ai · 2025-09-17T17:21:26Z

backend/onyx/evals/one_off/create_braintrust_dataset.py

+    )
+    parser.add_argument(
+        "--csv-path",
+        default="/Users/richardguan/onyx/backend/onyx/evals/data/DR Master Question & Metric Sheet - Sheet1.csv",


Avoid user-specific absolute path as default for --csv-path; use a repo-relative path derived from file for portability.

Prompt for AI agents

Address the following comment on backend/onyx/evals/one_off/create_braintrust_dataset.py at line 143: <comment>Avoid user-specific absolute path as default for --csv-path; use a repo-relative path derived from __file__ for portability.</comment> <file context> @@ -0,0 +1,183 @@ + ) + parser.add_argument( + "--csv-path", + default="/Users/richardguan/onyx/backend/onyx/evals/data/DR Master Question & Metric Sheet - Sheet1.csv", + help="Path to the CSV file (default: %(default)s)", + ) </file context>

cubic-dev-ai · 2025-09-17T17:21:26Z

backend/onyx/evals/one_off/create_braintrust_dataset.py

+                    [
+                        {
+                            "question": question
+                            + ". All info is contained in the quesiton. DO NOT ask any clarifying questions.",


Typo in user-facing string: "quesiton" should be "question".

Prompt for AI agents

Address the following comment on backend/onyx/evals/one_off/create_braintrust_dataset.py at line 69: <comment>Typo in user-facing string: "quesiton" should be "question".</comment> <file context> @@ -0,0 +1,183 @@ + [ + { + "question": question + + ". All info is contained in the quesiton. DO NOT ask any clarifying questions.", + "research_type": "DEEP", + "categories": categories, </file context>

cubic-dev-ai · 2025-09-17T17:21:26Z

backend/onyx/evals/one_off/create_braintrust_dataset.py

+            categories = row[12].strip() if len(row) > 12 else ""
+
+            # Filter records: should_use = TRUE and categories contains "web-only"
+            if should_use == "TRUE" and question:  # Ensure question is not empty


Filtering does not enforce categories containing 'web-only' as stated; condition only checks should_use and question, leading to incorrect dataset contents.

Prompt for AI agents

Address the following comment on backend/onyx/evals/one_off/create_braintrust_dataset.py at line 63: <comment>Filtering does not enforce categories containing 'web-only' as stated; condition only checks should_use and question, leading to incorrect dataset contents.</comment> <file context> @@ -0,0 +1,183 @@ + categories = row[12].strip() if len(row) > 12 else "" + + # Filter records: should_use = TRUE and categories contains "web-only" + if should_use == "TRUE" and question: # Ensure question is not empty + + records.extend( </file context>

cubic-dev-ai · 2025-09-17T17:21:29Z

backend/onyx/evals/README.md

+4. Click the play button or press F5
+
+This will run the evaluation with the following default settings:
+- Uses the local data file at `evals/data/data.json`


Claims a default data file path that does not exist; remove or correct the path/default description.

Prompt for AI agents

Address the following comment on backend/onyx/evals/README.md at line 43: <comment>Claims a default data file path that does not exist; remove or correct the path/default description.</comment> <file context> @@ -0,0 +1,68 @@ +4. Click the play button or press F5 + +This will run the evaluation with the following default settings: +- Uses the local data file at `evals/data/data.json` +- Enables verbose output +- Sets up proper environment variables and Python path </file context>

cubic-dev-ai · 2025-09-17T17:21:29Z

backend/onyx/evals/README.md

+
+1. Open VS Code in the project root
+2. Go to the "Run and Debug" panel (Ctrl/Cmd + Shift + D)
+3. Select "Eval CLI" from the dropdown


VS Code configuration "Eval CLI" does not exist in the repo templates; instruction is misleading.

Prompt for AI agents

Address the following comment on backend/onyx/evals/README.md at line 39: <comment>VS Code configuration "Eval CLI" does not exist in the repo templates; instruction is misleading.</comment> <file context> @@ -0,0 +1,68 @@ + +1. Open VS Code in the project root +2. Go to the "Run and Debug" panel (Ctrl/Cmd + Shift + D) +3. Select "Eval CLI" from the dropdown +4. Click the play button or press F5 + </file context>

cubic-dev-ai · 2025-09-17T17:21:29Z

backend/onyx/evals/README.md

+You can also run the CLI directly from the command line:
+
+```bash
+onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-dataset-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app


Incorrect flag name: use --local-data-path (not --local-dataset-path).

Prompt for AI agents

Address the following comment on backend/onyx/evals/README.md at line 23: <comment>Incorrect flag name: use --local-data-path (not --local-dataset-path).</comment> <file context> @@ -0,0 +1,68 @@ +You can also run the CLI directly from the command line: + +```bash +onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-dataset-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app +``` +make sure your AUTH_TYPE=disabled when running evals locally. Save the env var ONYX_EVAL_API_KEY in your .env file so you don't </file context>

Suggested change

onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-dataset-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app

onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-data-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app

cubic-dev-ai · 2025-09-17T17:21:29Z

backend/onyx/evals/README.md

+
+Kick off a remote job
+```bash
+onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote --remote-dataset-name Simple


Duplicate --remote flag in example command; remove the redundancy.

Prompt for AI agents

Address the following comment on backend/onyx/evals/README.md at line 17: <comment>Duplicate --remote flag in example command; remove the redundancy.</comment> <file context> @@ -0,0 +1,68 @@ + +Kick off a remote job +```bash +onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote --remote-dataset-name Simple +``` + </file context>

Suggested change

onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote --remote-dataset-name Simple

onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote-dataset-name Simple

cubic-dev-ai · 2025-09-17T17:21:29Z

backend/onyx/evals/eval_cli.py

+            result = run_remote(
+                args.base_url,
+                api_key,
+                args.remote_dataset_name,


run_remote is invoked with possibly None dataset_name/search_permissions_email, resulting in JSON nulls and server-side failure.

Prompt for AI agents

Address the following comment on backend/onyx/evals/eval_cli.py at line 199: <comment>run_remote is invoked with possibly None dataset_name/search_permissions_email, resulting in JSON nulls and server-side failure.</comment> <file context> @@ -0,0 +1,220 @@ + result = run_remote( + args.base_url, + api_key, + args.remote_dataset_name, + search_permissions_email=args.search_permissions_email, + ) </file context>

rguan72 added 30 commits September 8, 2025 19:11

.

6050371

Commit changes

8a3ae51

.

3afd3c7

.

743674d

.

4952cfa

.

b58d194

.

90dd786

.

fd03c60

.

02b41a0

.

f67ff6f

.

96932dc

.

0e7addf

.

6b187d9

.

5ff174b

.

84b9d7f

.

16aaa34

Merge branch 'main' into richard/eval-pipeline

ddaf625

.

300d530

.

a14d462

.

1e26503

.

81a9fab

.

f9832fa

.

0832602

.

a0b1f76

.

ea0fcb6

.

02a1702

.

5109688

.

fd08a94

.

2a908f7

.

cd3e916

rguan72 and others added 19 commits September 15, 2025 18:40

.

ed599a5

.

a026b74

.

d08a298

.

961bc77

.

6c864aa

.

8a67d69

.

1bfb90c

.

8eb0575

Merge branch 'main' into richard/eval-pipeline

c501bbb

.

7329d06

.

d254401

.

031140f

.

1ecab74

.

909519f

Merge branch 'main' into richard/eval-pipeline

1cacddb

.

a0256a8

.

933bbc7

.

4e53f8a

.

5ee321d

rguan72 requested a review from a team as a code owner September 17, 2025 17:03

rguan72 changed the title ~~Eval Pipeline 2 for ci/cd fix~~ feat: Eval Pipeline 2 for ci/cd fix Sep 17, 2025

greptile-apps bot reviewed Sep 17, 2025

View reviewed changes

cubic-dev-ai bot reviewed Sep 17, 2025

View reviewed changes

.

9dcfc09

vercel bot deployed to Preview September 17, 2025 18:15 View deployment

rguan72 added 2 commits September 17, 2025 11:21

.

c67c8d6

.

a18e10a

vercel bot deployed to Preview September 17, 2025 18:26 View deployment

rguan72 closed this Sep 17, 2025

		dataset_size = len(list(dataset.fetch()))
		print(f"Dataset size: {dataset_size}")

		args.remote_dataset_name,
		search_permissions_email=args.search_permissions_email,

	print(eval_input)
	research_type = ResearchType(eval_input.get("research_type", "THOUGHTFUL"))

	allowed_tool_ids: list[int]
	allowed_tool_ids: list[int] = Field(default_factory=list)

	class EvalationAck(BaseModel):
	class EvaluationAck(BaseModel):

		make sure your AUTH_TYPE=disabled when running evals locally. Save the env var ONYX_EVAL_API_KEY in your .env file so you don't
		have to specify it every time for triggering remote runs.

		@@ -0,0 +1,6 @@
		from onyx.evals.models import EvalProvider
		from onyx.evals.providers.braintrust import BraintrustEvalProvider

	onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-dataset-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app
	onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-data-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app

	onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote --remote-dataset-name Simple
	onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote-dataset-name Simple

feat: Eval Pipeline 2 for ci/cd fix #5440

feat: Eval Pipeline 2 for ci/cd fix #5440

Conversation

rguan72 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Summary by cubic

Uh oh!

vercel bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Context used:

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rguan72 commented Sep 17, 2025 •

edited

Loading

vercel bot commented Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading

cubic-dev-ai bot Sep 17, 2025 •

edited

Loading