[issue-979] [SDK] Support Additional Evals and Fix LLMJ Eval Speed #3440

vincentkoc · 2025-09-27T16:39:54Z

Details

Improvements to our metrics/evals on Python SDK:

Add additional evaluation metrics into the Python SDK based on classical NLP (ROUGE, etc), Embedding (BERTScore), Multi-Turn, and default presets for LLM-as-a-Judge (gEval).
Added support to map gEval to a ConversationMetric through a new GEvalConversationMetric
Resolved issues with use of GPT-5 giving errors with log_probs and temperature so using LiteLLM model supports to validate and drop with warnings.
LLM Judges are very slow and found some memory leaking issues which were patched, updated default model to gpt-5-nano for both speed and improvement in judge quality.
(for validation )Additional helper script to test and validate conversations can be found here: https://gist.github.com/vincentkoc/50deb524b6869cd511b31d4a56c568ec - run with python sdks/python/examples/multi_turn_conversation_evaluation.py --metrics all --single-metrics all --heuristic-metrics all

Metrics Added

Conversational (Multi-Turn Evals)

Conversation Degeneration (Detect repetition/degeneration patterns across assistant turns)
Knowledge Retention (Checks whether final assistant replies retain earlier user-provided facts)
GEvalConversationMetric (Allows the use of LLM Judges on Threads)

Heuristics (Non-LLM Based Scores)

BERTScore (Tokenizer based similarity score, good alternative to Levenstien)
chrF/chrF++ (Machine translation evaluation)
Distribution metrics (histogram-based text distribution metrics for comparisons)
GLEU (Estimate estimate fluency for gramatical errors)
Language Adherence (Checks whether text adheres to an expected language code)
METEOR (Computes the METEOR score between output and reference text)
Prompt Injection (Simple heuristic detector for prompt-injection or leakage attempts)
Readability (Deterministic readability scores)
ROUGE** (Improved existing ROUGE with rouge1, rouge2, rougeL, rougeLsum, rougeW)
Spearman (Spearman's rank correlation for two equal-length rankings)
Tone (Flags tone issues such as negativity, shouting, or forbidden phrases)
VADER Sentiment (Compute VADER sentiment scores for a piece of text)

LLM Judges (LLM Based Scores)

MMAD (LLM Jury, Aggregate multiple judge metrics (the 'juries') into a consensus score)
Agent Tool Correctness Judge (Evaluates whether an agent used tools correctly)
Agent Task Completion Judge (Scores whether an agent completed the assigned task)
Demographic Bias Judge (Scores demographic bias present in a model response)
Political Bias Judge (Scores political/ideological bias in a model response)
Compliance Risk Judge (Evaluates non-factual or non-compliant statements for regulated sectors)
Prompt Perplexity Judge (Rates how difficult a prompt is for an LLM to interpret)
Revis Eval Judge (LLM judge that revises answers using grounded evidence)
QA Summarization Consistency Judge (Scores how faithful a summary is to its source content)
QA Summarization Coherence Judge (Evaluates coherence and structure of generated summaries)
QA Dialogue Helpfulness Judge (Judges how helpful an assistant reply is within a dialogue)
QA Relevance Judge (Checks whether an answer directly addresses the user question)

Change checklist

User facing
Documentation update

Issues

Testing

Done. Using above generated conversations and validated the scores:

Documentation

Done

Note

Adds many new eval metrics (heuristics, LLM judges, conversation-level), GEval presets/conversation adapters, switches default judge model to gpt-5-nano, and hardens LiteLLM handling; comprehensive docs added.

SDK (Python):
- Metrics Expansion:
  - Heuristics: BERTScore, GLEU, ChrF, METEOR, distribution metrics (JSDivergence, JSDistance, KLDivergence), SpearmanRanking, ReadabilityGuard, ToneGuard, PromptInjectionGuard, LanguageAdherence, VADERSentiment.
  - Conversation-level: RougeCMetric, BleuCMetric, MeteorCMetric, ConversationDegenerationMetric, KnowledgeRetentionMetric.
  - LLM Judges: ComplianceRiskJudge, PromptPerplexityJudge, PromptUncertaintyJudge, agent judges (AgentTaskCompletionJudge, AgentToolCorrectnessJudge), bias judges, QA suite (DialogueHelpfulnessJudge, QARelevanceJudge, SummarizationConsistencyJudge, SummarizationCoherenceJudge), RevisEvalJudge, LLMJuriesJudge, TrajectoryAccuracy.
- GEval:
  - Add GEvalPreset (prebuilt rubrics) and conversation adapters (Conversation* metrics).
  - Cache CoT prompts; improved LiteLLM logprob parsing.
- Core/Infra:
  - BaseMetric gains input preprocessor hook.
  - Default model -> gpt-5-nano; model factory caches instances.
  - LiteLLM model: drop unsupported params (e.g., temperature, logprobs on GPT‑5), warn, retries/backoff, robust content parsing.
  - BLEU/ROUGE improvements (warning suppression; add rougeW).
  - New text preprocessing utilities.
- Tests: add/adjust unit/integration tests for new metrics, presets, adapters, and model changes.
Docs:
- Add/expand pages for new metrics (heuristics, LLM judges, conversation, prompt diagnostics, compliance, agent/tool judges, summarization metrics, RevisEval, LLM Juries, trajectory accuracy) and advanced configuration.
- Update overviews; default evaluator model updated to gpt-5-nano; new GEval conversation metrics doc.
Misc:
- Minor YAML/backup doc tweaks.

^{Written by Cursor Bugbot for commit 9a0d822. This will update automatically on new commits. Configure here.}

vincentkoc

Checks for myself (resolved)

sdks/python/src/opik/evaluation/metrics/conversation/helpers.py

sdks/python/src/opik/evaluation/metrics/llm_judges/g_eval/metric.py

…o feat/new-evals * 'feat/new-evals' of https://github.com/comet-ml/opik: (48 commits) [NA] [SDK] Opik Optimizer Cursor Rules and AGENT.md (#3895) [NA] [SDK] Opik Optimizer Fix GEPA e2e Tests and test all Python Versions (#3899) Fix broken links (#3923) [OPIK-2816] [P SDK] alexkuzmik / update sdk prompt docs (#3918) Update base version to 1.8.99 Update TypeScript SDK version to 1.8.98 [OPIK-2870] [P SDK] Implement an utility function to workaround the langgraph execution context disconnection when running in asyncio context (#3914) [NA] - Updating dev-runner.ps1 script to support --quick-restart (#3915) [NA] [BE] Upgrade MySQL container from Testcontainers (#3909) [OPIK-2786] [P SDK] Implement recording context manager that will allow to get Trace- and Span- models for the submitted data (#3880) [OPIK-2909][FE] Playground fe improvements (#3890) [NA] [DOCS] Add Google ADK section to agent graph logging documentation (#3911) [NA] [Docs] Align playground supported providers with AI providers configuration (#3908) [NA] [Docs] Add Amazon Bedrock as AI Provider configuration guide (#3903) Bump com.mysql:mysql-connector-j in /apps/opik-backend (#3906) Bump software.amazon.awssdk:bom in /apps/opik-backend (#3905) test names and tags for suites (#3888) Bump com.jayway.jsonpath:json-path in /apps/opik-backend (#3907) [NA] Fix help text alignment for --quick-restart mode in dev-runner.sh (#3810) [OPIK-2863] [HELM] Add initContainer wait-for-mysql for opik-backend (#3902) ...

Copilot

Pull Request Overview

Copilot reviewed 106 out of 116 changed files in this pull request and generated 4 comments.

sdks/python/src/opik/evaluation/metrics/llm_judges/g_eval/metric.py

sdks/python/src/opik/evaluation/metrics/heuristics/bleu.py

sdks/python/src/opik/evaluation/metrics/conversation/heuristics/degeneration/metric.py

sdks/python/src/opik/evaluation/models/litellm/litellm_chat_model.py

vincentkoc · 2025-11-03T21:13:54Z

Tested all the judges and metrics, screenshots. One small issue found with prompt injection heuristics and patched. All tests passing and issues addressed

…3440)

vincentkoc added 10 commits September 27, 2025 09:33

feat: add BELU and GLEU

1a8164e

feat: added ROUGEC and additional ROUGE metrics

d8c2067

feat: added METEOR eval

2353965

feat: added BERTScore and BLANC

e85d0e4

feat: Added METEORC

7b5ba41

feat: multi-turn evals

c704d06

feat: additional heuristic evals

23c2781

feat: out of the box gEvals

f4010a5

test: new evals

6e572e8

chore: init metrics

ef22c25

github-actions bot assigned vincentkoc Sep 27, 2025

vincentkoc and others added 14 commits September 27, 2025 09:51

chore: refactor tests

8da26f0

chore: remove superT

a996b70

chore: more evals

219dc9b

chore: additional evals

47f60be

Update test_llm_judges.py

915c02e

chore: patch

34a4ed2

Update meteor.py

5f43102

Update chrf.py

a132ba6

Update bertscore.py

f169bd1

Update ribes.py

7b8313b

chore: remove broken metrics

83f5c8b

chore: lint

7ef21ec

chore: mypy

d91ec8d

Merge branch 'main' into feat/new-evals

01f0e9e

vincentkoc marked this pull request as ready for review October 1, 2025 14:36

vincentkoc requested a review from a team as a code owner October 1, 2025 14:36

Copilot AI review requested due to automatic review settings October 1, 2025 14:36

comet-ml deleted a comment from github-actions bot Oct 1, 2025

vincentkoc added 7 commits November 3, 2025 11:06

refactor: git mv

086b00b

refactor: remap

36fd3c7

refactor: imports

198e7a9

refactor: refactoring

2a1a652

refactor: kr

46b166e

Update test_public_api.py

b19466a

refactor: revert changes

67b5e52

vincentkoc commented Nov 3, 2025

View reviewed changes

sdks/python/src/opik/evaluation/metrics/conversation/helpers.py Outdated Show resolved Hide resolved

sdks/python/src/opik/evaluation/metrics/llm_judges/g_eval/metric.py Show resolved Hide resolved

vincentkoc and others added 3 commits November 3, 2025 12:11

Merge branch 'main' into feat/new-evals

7fe571d

Update __init__.py

f3490ce

comet-ml deleted a comment from github-actions bot Nov 3, 2025

vincentkoc requested review from alexkuzmik and Copilot November 3, 2025 20:28

Copilot AI reviewed Nov 3, 2025

View reviewed changes

Update litellm_chat_model.py

5324698

comet-ml deleted a comment from github-actions bot Nov 3, 2025

vincentkoc added 2 commits November 3, 2025 12:45

Update metric.py

e8e898d

Update prompt_injection.py

bcc7afc

comet-ml deleted a comment from github-actions bot Nov 3, 2025

Merge branch 'main' into feat/new-evals

69fa758

alexkuzmik approved these changes Nov 4, 2025

View reviewed changes

comet-ml deleted a comment from github-actions bot Nov 4, 2025

vincentkoc merged commit c41fa7f into main Nov 4, 2025
227 of 254 checks passed

vincentkoc deleted the feat/new-evals branch November 4, 2025 16:11

awkoy pushed a commit that referenced this pull request Nov 12, 2025

[issue-979] [SDK] Support Additional Evals and Fix LLMJ Eval Speed (#…

f619bed

…3440)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[issue-979] [SDK] Support Additional Evals and Fix LLMJ Eval Speed #3440

[issue-979] [SDK] Support Additional Evals and Fix LLMJ Eval Speed #3440

Uh oh!

vincentkoc commented Sep 27, 2025 •

edited

Loading

Uh oh!

vincentkoc left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vincentkoc commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[issue-979] [SDK] Support Additional Evals and Fix LLMJ Eval Speed #3440

[issue-979] [SDK] Support Additional Evals and Fix LLMJ Eval Speed #3440

Uh oh!

Conversation

vincentkoc commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Metrics Added

Change checklist

Issues

Testing

Documentation

Uh oh!

vincentkoc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vincentkoc commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vincentkoc commented Sep 27, 2025 •

edited

Loading

vincentkoc left a comment •

edited

Loading