Skip to content

Conversation

@vincentkoc
Copy link
Member

@vincentkoc vincentkoc commented Sep 27, 2025

Details

Improvements to our metrics/evals on Python SDK:

  • Add additional evaluation metrics into the Python SDK based on classical NLP (ROUGE, etc), Embedding (BERTScore), Multi-Turn, and default presets for LLM-as-a-Judge (gEval).
  • Added support to map gEval to a ConversationMetric through a new GEvalConversationMetric
  • Resolved issues with use of GPT-5 giving errors with log_probs and temperature so using LiteLLM model supports to validate and drop with warnings.
  • LLM Judges are very slow and found some memory leaking issues which were patched, updated default model to gpt-5-nano for both speed and improvement in judge quality.
  • (for validation )Additional helper script to test and validate conversations can be found here: https://gist.github.com/vincentkoc/50deb524b6869cd511b31d4a56c568ec - run with python sdks/python/examples/multi_turn_conversation_evaluation.py --metrics all --single-metrics all --heuristic-metrics all

Metrics Added

Conversational (Multi-Turn Evals)

  • Conversation Degeneration (Detect repetition/degeneration patterns across assistant turns)
  • Knowledge Retention (Checks whether final assistant replies retain earlier user-provided facts)
  • GEvalConversationMetric (Allows the use of LLM Judges on Threads)

Heuristics (Non-LLM Based Scores)

  • BERTScore (Tokenizer based similarity score, good alternative to Levenstien)
  • chrF/chrF++ (Machine translation evaluation)
  • Distribution metrics (histogram-based text distribution metrics for comparisons)
  • GLEU (Estimate estimate fluency for gramatical errors)
  • Language Adherence (Checks whether text adheres to an expected language code)
  • METEOR (Computes the METEOR score between output and reference text)
  • Prompt Injection (Simple heuristic detector for prompt-injection or leakage attempts)
  • Readability (Deterministic readability scores)
  • ROUGE** (Improved existing ROUGE with rouge1, rouge2, rougeL, rougeLsum, rougeW)
  • Spearman (Spearman's rank correlation for two equal-length rankings)
  • Tone (Flags tone issues such as negativity, shouting, or forbidden phrases)
  • VADER Sentiment (Compute VADER sentiment scores for a piece of text)

LLM Judges (LLM Based Scores)

  • MMAD (LLM Jury, Aggregate multiple judge metrics (the 'juries') into a consensus score)
  • Agent Tool Correctness Judge (Evaluates whether an agent used tools correctly)
  • Agent Task Completion Judge (Scores whether an agent completed the assigned task)
  • Demographic Bias Judge (Scores demographic bias present in a model response)
  • Political Bias Judge (Scores political/ideological bias in a model response)
  • Compliance Risk Judge (Evaluates non-factual or non-compliant statements for regulated sectors)
  • Prompt Perplexity Judge (Rates how difficult a prompt is for an LLM to interpret)
  • Revis Eval Judge (LLM judge that revises answers using grounded evidence)
  • QA Summarization Consistency Judge (Scores how faithful a summary is to its source content)
  • QA Summarization Coherence Judge (Evaluates coherence and structure of generated summaries)
  • QA Dialogue Helpfulness Judge (Judges how helpful an assistant reply is within a dialogue)
  • QA Relevance Judge (Checks whether an answer directly addresses the user question)

Change checklist

  • User facing
  • Documentation update

Issues

Testing

Done. Using above generated conversations and validated the scores:

Screenshot 2025-10-01 at 07 24 16 Screenshot 2025-10-01 at 07 24 26

Documentation

Done


Note

Adds many new eval metrics (heuristics, LLM judges, conversation-level), GEval presets/conversation adapters, switches default judge model to gpt-5-nano, and hardens LiteLLM handling; comprehensive docs added.

  • SDK (Python):
    • Metrics Expansion:
      • Heuristics: BERTScore, GLEU, ChrF, METEOR, distribution metrics (JSDivergence, JSDistance, KLDivergence), SpearmanRanking, ReadabilityGuard, ToneGuard, PromptInjectionGuard, LanguageAdherence, VADERSentiment.
      • Conversation-level: RougeCMetric, BleuCMetric, MeteorCMetric, ConversationDegenerationMetric, KnowledgeRetentionMetric.
      • LLM Judges: ComplianceRiskJudge, PromptPerplexityJudge, PromptUncertaintyJudge, agent judges (AgentTaskCompletionJudge, AgentToolCorrectnessJudge), bias judges, QA suite (DialogueHelpfulnessJudge, QARelevanceJudge, SummarizationConsistencyJudge, SummarizationCoherenceJudge), RevisEvalJudge, LLMJuriesJudge, TrajectoryAccuracy.
    • GEval:
      • Add GEvalPreset (prebuilt rubrics) and conversation adapters (Conversation* metrics).
      • Cache CoT prompts; improved LiteLLM logprob parsing.
    • Core/Infra:
      • BaseMetric gains input preprocessor hook.
      • Default model -> gpt-5-nano; model factory caches instances.
      • LiteLLM model: drop unsupported params (e.g., temperature, logprobs on GPT‑5), warn, retries/backoff, robust content parsing.
      • BLEU/ROUGE improvements (warning suppression; add rougeW).
      • New text preprocessing utilities.
    • Tests: add/adjust unit/integration tests for new metrics, presets, adapters, and model changes.
  • Docs:
    • Add/expand pages for new metrics (heuristics, LLM judges, conversation, prompt diagnostics, compliance, agent/tool judges, summarization metrics, RevisEval, LLM Juries, trajectory accuracy) and advanced configuration.
    • Update overviews; default evaluator model updated to gpt-5-nano; new GEval conversation metrics doc.
  • Misc:
    • Minor YAML/backup doc tweaks.

Written by Cursor Bugbot for commit 9a0d822. This will update automatically on new commits. Configure here.

@vincentkoc vincentkoc marked this pull request as ready for review October 1, 2025 14:36
@vincentkoc vincentkoc requested a review from a team as a code owner October 1, 2025 14:36
Copilot AI review requested due to automatic review settings October 1, 2025 14:36
@comet-ml comet-ml deleted a comment from github-actions bot Oct 1, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Oct 1, 2025
Copy link
Member Author

@vincentkoc vincentkoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks for myself (resolved)

vincentkoc and others added 3 commits November 3, 2025 12:11
…o feat/new-evals

* 'feat/new-evals' of https://github.com/comet-ml/opik: (48 commits)
  [NA] [SDK] Opik Optimizer Cursor Rules and AGENT.md (#3895)
  [NA] [SDK] Opik Optimizer Fix GEPA e2e Tests and test all Python Versions (#3899)
  Fix broken links (#3923)
  [OPIK-2816] [P SDK] alexkuzmik / update sdk prompt docs (#3918)
  Update base version to 1.8.99
  Update TypeScript SDK version to 1.8.98
  [OPIK-2870] [P SDK] Implement an utility function to workaround the langgraph execution context disconnection when running in asyncio context (#3914)
  [NA] - Updating dev-runner.ps1 script to support --quick-restart (#3915)
  [NA] [BE] Upgrade MySQL container from Testcontainers (#3909)
  [OPIK-2786] [P SDK] Implement recording context manager that will allow to get Trace- and Span- models for the submitted data (#3880)
  [OPIK-2909][FE] Playground fe improvements (#3890)
  [NA] [DOCS] Add Google ADK section to agent graph logging documentation (#3911)
  [NA] [Docs] Align playground supported providers with AI providers configuration (#3908)
  [NA] [Docs] Add Amazon Bedrock as AI Provider configuration guide (#3903)
  Bump com.mysql:mysql-connector-j in /apps/opik-backend (#3906)
  Bump software.amazon.awssdk:bom in /apps/opik-backend (#3905)
  test names and tags for suites (#3888)
  Bump com.jayway.jsonpath:json-path in /apps/opik-backend (#3907)
  [NA] Fix help text alignment for --quick-restart mode in dev-runner.sh (#3810)
  [OPIK-2863] [HELM] Add initContainer wait-for-mysql for opik-backend (#3902)
  ...
@comet-ml comet-ml deleted a comment from github-actions bot Nov 3, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Nov 3, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Nov 3, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Nov 3, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 106 out of 116 changed files in this pull request and generated 4 comments.

@comet-ml comet-ml deleted a comment from github-actions bot Nov 3, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Nov 3, 2025
@vincentkoc
Copy link
Member Author

Tested all the judges and metrics, screenshots. One small issue found with prompt injection heuristics and patched. All tests passing and issues addressed

Screenshot 2025-11-03 at 12 49 53 Screenshot 2025-11-03 at 12 50 00

@comet-ml comet-ml deleted a comment from github-actions bot Nov 4, 2025
@vincentkoc vincentkoc merged commit c41fa7f into main Nov 4, 2025
227 of 254 checks passed
@vincentkoc vincentkoc deleted the feat/new-evals branch November 4, 2025 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants