-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Eval Pipeline 2 for ci/cd fix #5440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements a comprehensive evaluation pipeline system for Onyx, enabling automated testing and quality assessment of chat and retrieval functionality. The changes introduce a Braintrust-based evaluation framework that supports both local development testing and remote CI/CD integration.
The core implementation includes:
- Evaluation Infrastructure: New models (
EvalConfiguration
,EvalConfigurationOptions
) define evaluation settings including persona overrides, tool restrictions, and LLM configurations - Provider Architecture: Abstract
EvalProvider
interface with concreteBraintrustEvalProvider
implementation for running evaluations against datasets - Database Isolation: Ephemeral session factory ensures evaluations don't affect production data by rolling back all database changes
- CLI Integration: New
eval_cli.py
provides command-line tools for running evaluations locally or triggering remote execution via API - Background Processing: Celery task integration allows evaluations to run asynchronously without blocking API responses
- Enterprise API: New
/evals/eval_run
endpoint restricted to cloud superusers for triggering evaluation jobs
The system integrates deeply with existing Onyx components by extending chat message request preparation to support LLM overrides and tool restrictions, updating the Celery autodiscovery to include evaluation tasks, and modifying built-in tool type definitions for better evaluation compatibility. Configuration management is enhanced with new Braintrust-specific environment variables for API keys, project names, and concurrency limits.
The evaluation pipeline supports different research types (DEEP/THOUGHTFUL) and can work with both local JSON datasets and remote Braintrust datasets, providing flexibility for various testing scenarios from development to production monitoring.
Confidence score: 2/5
- This PR has several critical issues that make it unsafe to merge without significant fixes
- Score lowered due to incomplete implementations, missing configurations, potential type safety issues, and debugging code left in production files
- Pay close attention to
backend/onyx/evals/providers/braintrust.py
,backend/onyx/tools/built_in_tools.py
, andbackend/onyx/evals/models.py
- these files contain the most problematic code that needs immediate attention
Context used:
Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)
26 files reviewed, 17 comments
dataset_size = len(list(dataset.fetch())) | ||
print(f"Dataset size: {dataset_size}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: inefficient dataset size calculation - list(dataset.fetch()) loads entire dataset into memory just to count items
args.remote_dataset_name, | ||
search_permissions_email=args.search_permissions_email, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: args.remote_dataset_name could be None when using --local-data-path, causing TypeError
args.remote_dataset_name, | |
search_permissions_email=args.search_permissions_email, | |
if not args.remote_dataset_name: | |
print("Error: --remote-dataset-name is required for remote evaluation") | |
return | |
try: | |
result = run_remote( | |
args.base_url, | |
api_key, | |
args.remote_dataset_name, | |
search_permissions_email=args.search_permissions_email, | |
) |
else None | ||
) | ||
research_type = ResearchType(eval_input.get("research_type", "THOUGHTFUL")) | ||
print(eval_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Remove debugging print statement before merging to production
print(eval_input) | |
research_type = ResearchType(eval_input.get("research_type", "THOUGHTFUL")) |
Context Used: Rule - Remove temporary debugging code before merging to production, especially tenant-specific debugging logs. (link)
persona_override_config: PersonaOverrideConfig | None = None | ||
llm: LLMOverride = Field(default_factory=LLMOverride) | ||
search_permissions_email: str | None = None | ||
allowed_tool_ids: list[int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Missing default value for required field. This will cause Pydantic validation errors when creating EvalConfiguration instances without explicit allowed_tool_ids.
allowed_tool_ids: list[int] | |
allowed_tool_ids: list[int] = Field(default_factory=list) |
) | ||
|
||
|
||
class EvalationAck(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Class name has typo: 'EvalationAck' should be 'EvaluationAck'
class EvalationAck(BaseModel): | |
class EvaluationAck(BaseModel): |
make sure your AUTH_TYPE=disabled when running evals locally. Save the env var ONYX_EVAL_API_KEY in your .env file so you don't | ||
have to specify it every time for triggering remote runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Missing punctuation and formatting issues in the sentence about AUTH_TYPE and environment variables
except Exception: | ||
logger.error("Failed to run eval task") | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Generic exception handling loses valuable error context. Consider logging the exception details with logger.exception()
or including specific error information.
except Exception: | |
logger.error("Failed to run eval task") | |
raise | |
except Exception: | |
logger.exception("Failed to run eval task") | |
raise |
Eval( | ||
name=BRAINTRUST_PROJECT, | ||
data=eval_data, | ||
task=task, | ||
scores=[], | ||
metadata={**configuration.model_dump()}, | ||
max_concurrency=BRAINTRUST_MAX_CONCURRENCY, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The Eval()
call result is not captured or awaited. This means the evaluation runs but results are lost and there's no error handling.
Eval( | |
name=BRAINTRUST_PROJECT, | |
data=eval_data, | |
task=task, | |
scores=[], | |
metadata={**configuration.model_dump()}, | |
max_concurrency=BRAINTRUST_MAX_CONCURRENCY, | |
) | |
eval_result = Eval( | |
name=BRAINTRUST_PROJECT, | |
data=eval_data, | |
task=task, | |
scores=[], | |
metadata={**configuration.model_dump()}, | |
max_concurrency=BRAINTRUST_MAX_CONCURRENCY, | |
) |
Eval( | ||
name=BRAINTRUST_PROJECT, | ||
data=eval_cases, | ||
task=task, | ||
scores=[], | ||
metadata={**configuration.model_dump()}, | ||
max_concurrency=BRAINTRUST_MAX_CONCURRENCY, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Same issue - Eval()
result is not captured. Both evaluation paths have this critical problem.
Eval( | |
name=BRAINTRUST_PROJECT, | |
data=eval_cases, | |
task=task, | |
scores=[], | |
metadata={**configuration.model_dump()}, | |
max_concurrency=BRAINTRUST_MAX_CONCURRENCY, | |
) | |
eval_result = Eval( | |
name=BRAINTRUST_PROJECT, | |
data=eval_cases, | |
task=task, | |
scores=[], | |
metadata={**configuration.model_dump()}, | |
max_concurrency=BRAINTRUST_MAX_CONCURRENCY, | |
) |
if data is None: | ||
raise ValueError( | ||
"Must specify data when remote_dataset_name is not specified" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: This validation check is redundant - if remote_dataset_name
is None, we're already in the else branch, so data
cannot also be None due to the earlier validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
28 issues found across 27 files
Prompt for AI agents (all 28 issues)
Understand the root cause of the following 28 issues and fix them.
<file name="backend/onyx/evals/provider.py">
<violation number="1" location="backend/onyx/evals/provider.py:2">
Hard dependency on optional braintrust package causes import-time crash (ModuleNotFoundError)</violation>
</file>
<file name="backend/onyx/db/tools.py">
<violation number="1" location="backend/onyx/db/tools.py:12">
Top-level import introduces a tools-layer runtime dependency in the DB module, undermining the local-import pattern meant to avoid circular imports and tight coupling.</violation>
</file>
<file name="backend/onyx/evals/one_off/create_braintrust_dataset.py">
<violation number="1" location="backend/onyx/evals/one_off/create_braintrust_dataset.py:63">
Filtering does not enforce categories containing 'web-only' as stated; condition only checks should_use and question, leading to incorrect dataset contents.</violation>
<violation number="2" location="backend/onyx/evals/one_off/create_braintrust_dataset.py:69">
Typo in user-facing string: "quesiton" should be "question".</violation>
<violation number="3" location="backend/onyx/evals/one_off/create_braintrust_dataset.py:143">
Avoid user-specific absolute path as default for --csv-path; use a repo-relative path derived from __file__ for portability.</violation>
</file>
<file name="backend/onyx/evals/models.py">
<violation number="1" location="backend/onyx/evals/models.py:45">
Redundant DB lookups in list comprehensions; cache tool IDs to avoid calling get_builtin_tool twice per tool.</violation>
<violation number="2" location="backend/onyx/evals/models.py:69">
Typo in class name "EvalationAck"; consider renaming to "EvaluationAck" for clarity and consistency across the codebase.</violation>
<violation number="3" location="backend/onyx/evals/models.py:77">
Callable type for task may be inconsistent with data item structure; consider aligning the task signature with the data element type (e.g., Callable[[dict[str, dict[str, str]]], str]) or adjusting data typing.</violation>
</file>
<file name="backend/ee/onyx/server/evals/api.py">
<violation number="1" location="backend/ee/onyx/server/evals/api.py:5">
Use the versioned Celery client app import for consistency and correct EE/OSS wiring.</violation>
<violation number="2" location="backend/ee/onyx/server/evals/api.py:24">
Docstring claims API key auth, but the dependency enforces superuser access; update the description to reflect actual auth.</violation>
</file>
<file name="backend/onyx/evals/eval.py">
<violation number="1" location="backend/onyx/evals/eval.py:75">
Avoid printing potentially sensitive eval_input; use structured logging with redaction or remove debug output.</violation>
</file>
<file name="backend/onyx/evals/eval_cli.py">
<violation number="1" location="backend/onyx/evals/eval_cli.py:114">
Network request lacks a timeout; add a finite timeout to avoid indefinite hangs and improve resiliency.</violation>
<violation number="2" location="backend/onyx/evals/eval_cli.py:188">
Empty API key fallback results in sending an empty Bearer token; validate and fail fast when no API key is provided.</violation>
<violation number="3" location="backend/onyx/evals/eval_cli.py:199">
run_remote is invoked with possibly None dataset_name/search_permissions_email, resulting in JSON nulls and server-side failure.</violation>
<violation number="4" location="backend/onyx/evals/eval_cli.py:204">
Printing the raw exception may leak sensitive details; prefer logging only safe metadata (e.g., HTTP status code) or a generic message.
(Based on your team's feedback about avoiding logging raw exception strings that may contain URLs with temporary auth tokens.)</violation>
</file>
<file name="backend/onyx/evals/providers/braintrust.py">
<violation number="1" location="backend/onyx/evals/providers/braintrust.py:32">
Eval name uses BRAINTRUST_PROJECT (project) instead of configuration.dataset_name; this likely mislabels eval runs and leaves dataset_name unused.</violation>
</file>
<file name="backend/requirements/default.txt">
<violation number="1" location="backend/requirements/default.txt:108">
New dependency appears unused across the repository; consider avoiding adding unused runtime deps or move to a dev/CI-only requirements file to reduce image size and attack surface.</violation>
<violation number="2" location="backend/requirements/default.txt:109">
Integration package appears unused; adding unused deps increases maintenance burden. Consider deferring until referenced in code or include only in eval/CI requirements.</violation>
<violation number="3" location="backend/requirements/default.txt:110">
Added dependency seems unused; avoid introducing unused runtime packages or move to a dedicated eval/CI requirements file.</violation>
</file>
<file name="backend/onyx/configs/app_configs.py">
<violation number="1" location="backend/onyx/configs/app_configs.py:697">
Zero/negative concurrency allowed; clamp to a minimum of 1 to avoid disabling evaluations or runtime errors.</violation>
</file>
<file name="backend/onyx/evals/README.md">
<violation number="1" location="backend/onyx/evals/README.md:17">
Duplicate --remote flag in example command; remove the redundancy.</violation>
<violation number="2" location="backend/onyx/evals/README.md:23">
Incorrect flag name: use --local-data-path (not --local-dataset-path).</violation>
<violation number="3" location="backend/onyx/evals/README.md:39">
VS Code configuration "Eval CLI" does not exist in the repo templates; instruction is misleading.</violation>
<violation number="4" location="backend/onyx/evals/README.md:43">
Claims a default data file path that does not exist; remove or correct the path/default description.</violation>
<violation number="5" location="backend/onyx/evals/README.md:49">
Incorrect default documented for --local-data-path; the CLI sets no default.</violation>
<violation number="6" location="backend/onyx/evals/README.md:51">
Misleading: --braintrust-project does not override provider’s BRAINTRUST_PROJECT; it’s used only for dataset lookup.</violation>
<violation number="7" location="backend/onyx/evals/README.md:56">
Path `evals/data/data.json` does not exist in the repo; reference should not promise a fixed path.</violation>
<violation number="8" location="backend/onyx/evals/README.md:58">
Incorrect data shape: research_type belongs inside input, not at the top level.</violation>
</file>
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
@@ -0,0 +1,6 @@ | |||
from onyx.evals.models import EvalProvider | |||
from onyx.evals.providers.braintrust import BraintrustEvalProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard dependency on optional braintrust package causes import-time crash (ModuleNotFoundError)
Prompt for AI agents
Address the following comment on backend/onyx/evals/provider.py at line 2:
<comment>Hard dependency on optional braintrust package causes import-time crash (ModuleNotFoundError)</comment>
<file context>
@@ -0,0 +1,6 @@
+from onyx.evals.models import EvalProvider
+from onyx.evals.providers.braintrust import BraintrustEvalProvider
+
+
</file context>
|
||
from onyx.db.models import Tool | ||
from onyx.server.features.tool.models import Header | ||
from onyx.tools.built_in_tools import BUILT_IN_TOOL_TYPES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top-level import introduces a tools-layer runtime dependency in the DB module, undermining the local-import pattern meant to avoid circular imports and tight coupling.
Prompt for AI agents
Address the following comment on backend/onyx/db/tools.py at line 12:
<comment>Top-level import introduces a tools-layer runtime dependency in the DB module, undermining the local-import pattern meant to avoid circular imports and tight coupling.</comment>
<file context>
@@ -2,29 +2,19 @@
from onyx.db.models import Tool
from onyx.server.features.tool.models import Header
+from onyx.tools.built_in_tools import BUILT_IN_TOOL_TYPES
from onyx.utils.headers import HeaderItemDict
from onyx.utils.logger import setup_logger
</file context>
) | ||
parser.add_argument( | ||
"--csv-path", | ||
default="/Users/richardguan/onyx/backend/onyx/evals/data/DR Master Question & Metric Sheet - Sheet1.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid user-specific absolute path as default for --csv-path; use a repo-relative path derived from file for portability.
Prompt for AI agents
Address the following comment on backend/onyx/evals/one_off/create_braintrust_dataset.py at line 143:
<comment>Avoid user-specific absolute path as default for --csv-path; use a repo-relative path derived from __file__ for portability.</comment>
<file context>
@@ -0,0 +1,183 @@
+ )
+ parser.add_argument(
+ "--csv-path",
+ default="/Users/richardguan/onyx/backend/onyx/evals/data/DR Master Question & Metric Sheet - Sheet1.csv",
+ help="Path to the CSV file (default: %(default)s)",
+ )
</file context>
[ | ||
{ | ||
"question": question | ||
+ ". All info is contained in the quesiton. DO NOT ask any clarifying questions.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in user-facing string: "quesiton" should be "question".
Prompt for AI agents
Address the following comment on backend/onyx/evals/one_off/create_braintrust_dataset.py at line 69:
<comment>Typo in user-facing string: "quesiton" should be "question".</comment>
<file context>
@@ -0,0 +1,183 @@
+ [
+ {
+ "question": question
+ + ". All info is contained in the quesiton. DO NOT ask any clarifying questions.",
+ "research_type": "DEEP",
+ "categories": categories,
</file context>
categories = row[12].strip() if len(row) > 12 else "" | ||
|
||
# Filter records: should_use = TRUE and categories contains "web-only" | ||
if should_use == "TRUE" and question: # Ensure question is not empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering does not enforce categories containing 'web-only' as stated; condition only checks should_use and question, leading to incorrect dataset contents.
Prompt for AI agents
Address the following comment on backend/onyx/evals/one_off/create_braintrust_dataset.py at line 63:
<comment>Filtering does not enforce categories containing 'web-only' as stated; condition only checks should_use and question, leading to incorrect dataset contents.</comment>
<file context>
@@ -0,0 +1,183 @@
+ categories = row[12].strip() if len(row) > 12 else ""
+
+ # Filter records: should_use = TRUE and categories contains "web-only"
+ if should_use == "TRUE" and question: # Ensure question is not empty
+
+ records.extend(
</file context>
4. Click the play button or press F5 | ||
|
||
This will run the evaluation with the following default settings: | ||
- Uses the local data file at `evals/data/data.json` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claims a default data file path that does not exist; remove or correct the path/default description.
Prompt for AI agents
Address the following comment on backend/onyx/evals/README.md at line 43:
<comment>Claims a default data file path that does not exist; remove or correct the path/default description.</comment>
<file context>
@@ -0,0 +1,68 @@
+4. Click the play button or press F5
+
+This will run the evaluation with the following default settings:
+- Uses the local data file at `evals/data/data.json`
+- Enables verbose output
+- Sets up proper environment variables and Python path
</file context>
|
||
1. Open VS Code in the project root | ||
2. Go to the "Run and Debug" panel (Ctrl/Cmd + Shift + D) | ||
3. Select "Eval CLI" from the dropdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VS Code configuration "Eval CLI" does not exist in the repo templates; instruction is misleading.
Prompt for AI agents
Address the following comment on backend/onyx/evals/README.md at line 39:
<comment>VS Code configuration "Eval CLI" does not exist in the repo templates; instruction is misleading.</comment>
<file context>
@@ -0,0 +1,68 @@
+
+1. Open VS Code in the project root
+2. Go to the "Run and Debug" panel (Ctrl/Cmd + Shift + D)
+3. Select "Eval CLI" from the dropdown
+4. Click the play button or press F5
+
</file context>
You can also run the CLI directly from the command line: | ||
|
||
```bash | ||
onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-dataset-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect flag name: use --local-data-path (not --local-dataset-path).
Prompt for AI agents
Address the following comment on backend/onyx/evals/README.md at line 23:
<comment>Incorrect flag name: use --local-data-path (not --local-dataset-path).</comment>
<file context>
@@ -0,0 +1,68 @@
+You can also run the CLI directly from the command line:
+
+```bash
+onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-dataset-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app
+```
+make sure your AUTH_TYPE=disabled when running evals locally. Save the env var ONYX_EVAL_API_KEY in your .env file so you don't
</file context>
onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-dataset-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app | |
onyx$ python -m dotenv -f .vscode/.env run -- python backend/onyx/evals/eval_cli.py --local-data-path backend/onyx/evals/data/eval.json --search-permissions-email richard@onyx.app |
|
||
Kick off a remote job | ||
```bash | ||
onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote --remote-dataset-name Simple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate --remote flag in example command; remove the redundancy.
Prompt for AI agents
Address the following comment on backend/onyx/evals/README.md at line 17:
<comment>Duplicate --remote flag in example command; remove the redundancy.</comment>
<file context>
@@ -0,0 +1,68 @@
+
+Kick off a remote job
+```bash
+onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote --remote-dataset-name Simple
+```
+
</file context>
onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote --remote-dataset-name Simple | |
onyx/backend$ python onyx/evals/eval_cli.py --remote --api-key <SUPER_CLOUD_USER_API_KEY> --search-permissions-email <email account to reference> --remote-dataset-name Simple |
result = run_remote( | ||
args.base_url, | ||
api_key, | ||
args.remote_dataset_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_remote is invoked with possibly None dataset_name/search_permissions_email, resulting in JSON nulls and server-side failure.
Prompt for AI agents
Address the following comment on backend/onyx/evals/eval_cli.py at line 199:
<comment>run_remote is invoked with possibly None dataset_name/search_permissions_email, resulting in JSON nulls and server-side failure.</comment>
<file context>
@@ -0,0 +1,220 @@
+ result = run_remote(
+ args.base_url,
+ api_key,
+ args.remote_dataset_name,
+ search_permissions_email=args.search_permissions_email,
+ )
</file context>
Description
[Provide a brief description of the changes in this PR]
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Adds a Braintrust-backed evaluation pipeline with a local CLI, a secure EE API endpoint, and a Celery task to run evals in CI/CD. Enables running evals on local data or Braintrust datasets without DB side effects.
New Features
Dependencies