-
Notifications
You must be signed in to change notification settings - Fork 241
docs: Model Performance Evaluation Guide #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
2e78bc2
docs: fix markdownlint
JaredforReal 658e294
docs: add Next in training-overview.md
JaredforReal 468fb15
fix deploy error
JaredforReal f4d7de1
docs: add pngs and examples in doc & add doc to sidebar
JaredforReal 299cde3
Merge branch 'main' into eval_doc
JaredforReal bf11d7e
Fix: docs-build error
JaredforReal 3dcfea4
Merge branch 'main' into eval_doc
rootfs 316caa4
Merge branch 'main' into eval_doc
rootfs 5416fbf
Merge branch 'main' into eval_doc
JaredforReal 32a78e0
fix: codespell in model-perf-eval
JaredforReal File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,217 @@ | ||
# Model Performance Evaluation | ||
## Why evaluate? | ||
Evaluation makes routing data-driven. By measuring per-category accuracy on MMLU-Pro (and doing a quick sanity check with ARC), you can: | ||
|
||
- Select the right model for each category and rank them into categories.model_scores | ||
- Pick a sensible default_model based on overall performance | ||
- Decide when CoT prompting is worth the latency/cost tradeoff | ||
- Catch regressions when models, prompts, or parameters change | ||
- Keep changes reproducible and auditable for CI and releases | ||
|
||
In short, evaluation converts anecdotes into measurable signals that improve quality, cost efficiency, and reliability of the router. | ||
|
||
--- | ||
|
||
This guide documents the automated workflow to evaluate models (MMLU-Pro and ARC Challenge) via a vLLM-compatible OpenAI endpoint, generate a performance-based routing config, and update categories.model_scores in config. | ||
|
||
see code in [/src/training/model_eval](https://github.com/vllm-project/semantic-router/tree/main/src/training/model_eval) | ||
|
||
### What you'll run end-to-end | ||
#### 1) Evaluate models: | ||
|
||
- per-category accuracies | ||
- ARC Challenge: overall accuracy | ||
|
||
#### 2) Visualize results | ||
|
||
- bar/heatmap plot of per-category accuracies | ||
|
||
**TODO** a picture needed | ||
#### 3) Generate an updated config.yaml: | ||
|
||
- Rank models per category into categories.model_scores | ||
- Set default_model to the best average performer | ||
- Keep or apply category-level reasioning settings | ||
|
||
## 1.Prerequisites | ||
|
||
- A running vLLM-compatible OpenAI endpoint serving your models | ||
- Endpoint URL like http://localhost:8000/v1 | ||
- Optional API key if your endpoint requires one | ||
- Python packages for evaluation scripts: | ||
- From the repo root: matplotlib | ||
- From `/src/training/model_eval`: [requirements.txt](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/requirements.txt) | ||
|
||
```bash | ||
cd /src/training/model_eval | ||
pip install -r requirements.txt | ||
``` | ||
|
||
**Optional tip:** | ||
|
||
- Ensure your `config/config.yaml` includes your deployed model names under `vllm_endpoints[].models` and any pricing/policy under `model_config` if you plan to use the generated config directly. | ||
|
||
## 2.Evaluate on MMLU-Pro | ||
see script in [mmul_pro_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/mmlu_pro_vllm_eval.py) | ||
|
||
### Example usage patterns: | ||
|
||
```bash | ||
# Evaluate a few models, few samples per category, direct prompting | ||
python mmlu_pro_vllm_eval.py \ | ||
--endpoint http://localhost:8000/v1 \ | ||
--models gemma3:27b phi4 mistral-small3.1 \ | ||
--samples-per-category 10 | ||
|
||
# Evaluate with CoT (results saved under *_cot) | ||
python mmlu_pro_vllm_eval.py \ | ||
--endpoint http://localhost:8000/v1 \ | ||
--models gemma3:27b phi4 mistral-small3.1 \ | ||
--samples-per-category 10 | ||
--use-cot | ||
``` | ||
|
||
### Key flags: | ||
|
||
- **--endpoint**: vLLM OpenAI URL (default http://localhost:8000/v1) | ||
- **--models**: space-separated list OR a single comma-separated string; if omitted, the script queries /models from the endpoint | ||
- **--categories**: restrict evaluation to specific categories; if omitted, uses all categories in the dataset | ||
- **--samples-per-category**: limit questions per category (useful for quick runs) | ||
- **--use-cot**: enables Chain-of-Thought prompting variant; results are saved in a separate subfolder suffix (_cot vs _direct) | ||
- **--concurrent-requests**: concurrency for throughput | ||
- **--output-dir**: where results are saved (default results) | ||
- **--max-tokens**, **--temperature**, **--seed**: generation and reproducibility knobs | ||
|
||
### What it outputs per model: | ||
|
||
- **results/Model_Name_(direct|cot)/** | ||
- **detailed_results.csv**: one row per question with is_correct and category | ||
- **analysis.json**: overall_accuracy, category_accuracy map, avg_response_time, counts | ||
- **summary.json**: condensed metrics | ||
- **mmlu_pro_vllm_eval.txt**: prompts and answers log (debug/inspection) | ||
|
||
### Notes: | ||
|
||
- Model naming: slashes are replaced with underscores for folder names; e.g., gemma3:27b -> gemma3:27b_direct directory. | ||
- Category accuracy is computed on successful queries only; failed requests are excluded. | ||
|
||
## 3.Evaluate on ARC Challenge (optional, overall sanity check) | ||
see script in [arc_challenge_vllm_eval.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/arc_challenge_vllm_eval.py) | ||
|
||
### Example usage patterns: | ||
|
||
``` bash | ||
python arc_challenge_vllm_eval.py \ | ||
--endpoint http://localhost:8000/v1\ | ||
--models gemma3:27b,phi4:latest | ||
``` | ||
|
||
### Key flags: | ||
|
||
- **--samples**: total questions to sample (default 20); ARC is not categorized in our script | ||
- Other flags mirror the MMLU-Pro script | ||
|
||
### What it outputs per model: | ||
|
||
- **results/Model_Name_(direct|cot)/** | ||
- **detailed_results.csv**: one row per question with is_correct and category | ||
- **analysis.json**: overall_accuracy, avg_response_time | ||
- **summary.json**: condensed metrics | ||
- **arc_challenge_vllm_eval.txt**: prompts and answers log (debug/inspection) | ||
|
||
### Note: | ||
ARC results do not feed categories[].model_scores directly, but they can help spot regressions. | ||
|
||
## 4.Visualize per-category performance | ||
see script in [plot_category_accuracies.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/plot_category_accuracies.py) | ||
|
||
### Example usage patterns: | ||
|
||
```bash | ||
# Use results/ to generate bar plot | ||
python src/training/model_eval/plot_category_accuracies.py \ | ||
--results-dir results \ | ||
--plot-type bar \ | ||
--output-file model_eval/category_accuracies.png | ||
|
||
# Use results/ to generate heatmap plot | ||
python src/training/model_eval/plot_category_accuracies.py \ | ||
--results-dir results \ | ||
--plot-type heatmap \ | ||
--output-file model_eval/category_accuracies.png | ||
|
||
# Use sample-data to generate example plot | ||
python src/training/model_eval/plot_category_accuracies.py \ | ||
--sample-data \ | ||
--plot-type heatmap \ | ||
--output-file model_eval/category_accuracies.png | ||
``` | ||
|
||
### Key flags: | ||
|
||
- **--results-dir**: where analysis.json files are | ||
- **--plot-type**: bar or heatmap | ||
- **--output-file**: output image path (default model_eval/category_accuracies.png) | ||
- **--sample-data**: if no results exist, generates fake data to preview the plot | ||
|
||
### What it does: | ||
|
||
- Finds all results/**/analysis.json, aggregates analysis["category_accuracy"] per model | ||
- Adds an Overall column representing the average across categories | ||
- Produces a figure to quickly compare model/category performance | ||
|
||
### Note: | ||
|
||
- It merges “direct” and “cot” as distinct model variants by appending :direct or :cot to the label; the legend hides “:direct” for brevity. | ||
|
||
## 5.Generate performance-based routing config | ||
see script in [result_to_config.py](https://github.com/vllm-project/semantic-router/blob/main/src/training/model_eval/result_to_config.py) | ||
|
||
### Example usage patterns: | ||
|
||
```bash | ||
# Use results/ to generate a new config file (not overridded) | ||
python src/training/model_eval/result_to_config.py \ | ||
--results-dir results \ | ||
--output-file config/config.eval.yaml | ||
|
||
# Modify similarity-thredshold | ||
python src/training/model_eval/result_to_config.py \ | ||
--results-dir results \ | ||
--output-file config/config.eval.yaml \ | ||
--similarity-threshold 0.85 | ||
|
||
# Generate from specific folder | ||
python src/training/model_eval/result_to_config.py \ | ||
--results-dir results/mmlu_run_2025_09_10 \ | ||
--output-file config/config.eval.yaml | ||
``` | ||
|
||
### Key flags: | ||
|
||
- **--results-dir**: points to the folder where analysis.json files live | ||
- **--output-file**: target config path (default config/config.yaml) | ||
- **--similarity-threshold**: semantic cache threshold to set in the generated config | ||
|
||
### What it does: | ||
|
||
- Reads all analysis.json files, extracting analysis["category_accuracy"] | ||
- Constructs a new config: | ||
- default_model: the best average performer across categories | ||
- categories: For each category present in results, ranks models by accuracy: | ||
- category.model_scores = `[{ model: "Model_Name", score: 0.87 }, ...]`, highest first | ||
- category reasoning settings: auto-filled from a built-in mapping (math, physics, chemistry, CS, engineering -> high reasoning; others default to low/medium; you can adjust after generation) | ||
- Leaves out any special “auto” placeholder models if present | ||
|
||
### Schema alignment: | ||
|
||
- **categories[].name**: the MMLU-Pro category string | ||
- **categories[].model_scores**: descending ranking by accuracy for that category | ||
- **default_model**: a top performer across categories (approach suffix removed, e.g., gemma3:27b from gemma3:27b:direct) | ||
- Keeps other config sections (semantic_cache, tools, classifier, prompt_guard) with reasonable defaults; you can edit them post-generation if your environment differs | ||
|
||
### Note: | ||
|
||
- Existing config.yaml can be overwritten. Consider writing to a temp file first and diffing: | ||
- --output-file config/config.eval.yaml | ||
- If your production config.yaml carries environment-specific settings (endpoints, pricing, policies), port the evaluated categories[].model_scores and default_model back into your canonical config. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.