|
| 1 | +# Semantic Router Quickstart |
| 2 | + |
| 3 | +This quickstart walks through the minimal set of commands needed to prove that |
| 4 | +the semantic router can classify incoming chat requests, route them through |
| 5 | +Envoy, and receive OpenAI-compatible completions. The flow is optimized for |
| 6 | +local laptops and uses a lightweight mock backend by default, so the entire |
| 7 | +loop finishes in a few minutes. |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +- Python environment with the project’s dependencies and virtualenv activated. |
| 12 | +- `make`, `curl`, `go`, `cargo`, `rustc`, and `python3` in `PATH`. |
| 13 | +- All commands below are run from the repository root. |
| 14 | + |
| 15 | +## Step-by-Step Runbook |
| 16 | + |
| 17 | +0. **Download router support models** |
| 18 | + |
| 19 | + These assets (ModernBERT classifiers, LoRA adapters, embeddings, etc.) are |
| 20 | + required before the router can start. |
| 21 | + |
| 22 | + ```bash |
| 23 | + make download-models |
| 24 | + ``` |
| 25 | + |
| 26 | +1. **Start the OpenAI-compatible backend** |
| 27 | + |
| 28 | + The router expects at least one endpoint that serves `/v1/chat/completions`. |
| 29 | + You can point to a real vLLM deployment, but the fastest option is the |
| 30 | + bundled mock server: |
| 31 | + |
| 32 | + ```bash |
| 33 | + pip install -r tools/mock-vllm/requirements.txt |
| 34 | + python -m uvicorn tools.mock_vllm.app:app --host 0.0.0.0 --port 8000 |
| 35 | + ``` |
| 36 | + |
| 37 | + Leave this process running; it provides instant canned responses for |
| 38 | + `openai/gpt-oss-20b`. |
| 39 | + |
| 40 | +2. **Launch Envoy** |
| 41 | + |
| 42 | + In a separate terminal, bring up the Envoy sidecar that listens on |
| 43 | + `http://127.0.0.1:8801/v1/*` and forwards traffic to the router’s gRPC |
| 44 | + ExtProc server. |
| 45 | + |
| 46 | + ```bash |
| 47 | + make run-envoy |
| 48 | + ``` |
| 49 | + |
| 50 | +3. **Start the router with the quickstart config** |
| 51 | + |
| 52 | + In another terminal, run the quickstart bootstrap. Point the health probe at |
| 53 | + the router’s local HTTP API (port 8080) so the script does not wait on the |
| 54 | + Envoy endpoint. |
| 55 | + |
| 56 | + ```bash |
| 57 | + QUICKSTART_HEALTH_URL=http://127.0.0.1:8080/health \ |
| 58 | + ./examples/quickstart/quickstart.sh --skip-download --skip-build |
| 59 | + ``` |
| 60 | + |
| 61 | + Keep this process alive; Ctrl+C will stop the router. |
| 62 | + |
| 63 | +4. **Run the quick evaluation** |
| 64 | + |
| 65 | + With Envoy, the router, and the mock backend running, execute the benchmark |
| 66 | + to send a small batch of MMLU questions through the routing pipeline. |
| 67 | + |
| 68 | + ```bash |
| 69 | + OPENAI_API_KEY="sk-test" \ |
| 70 | + ./examples/quickstart/quick-eval.sh \ |
| 71 | + --mode router \ |
| 72 | + --samples 5 \ |
| 73 | + --vllm-endpoint "" |
| 74 | + ``` |
| 75 | + |
| 76 | + - `--mode router` restricts the run to router-transparent requests. |
| 77 | + - `--vllm-endpoint ""` disables direct vLLM comparisons. |
| 78 | + |
| 79 | +5. **Inspect the results** |
| 80 | + |
| 81 | + The evaluator writes all artifacts under |
| 82 | + `examples/quickstart/results/<timestamp>/`: |
| 83 | + |
| 84 | + - `raw/` – individual JSON summaries per dataset/model combination. |
| 85 | + - `quickstart-summary.csv` – tabular metrics (accuracy, tokens, latency). |
| 86 | + - `quickstart-report.md` – Markdown report suitable for sharing. |
| 87 | + |
| 88 | + You can re-run the evaluator with different flags (e.g., `--samples 10`, |
| 89 | + `--dataset arc`) and the outputs will land in fresh timestamped folders. |
| 90 | + |
| 91 | +## Switching to a Real vLLM Backend |
| 92 | + |
| 93 | +If you prefer to exercise a real language model: |
| 94 | + |
| 95 | +1. Replace step 1 with a real vLLM launch (or any OpenAI-compatible server). |
| 96 | +2. Update `examples/quickstart/config-quickstart.yaml` so the `vllm_endpoints` |
| 97 | + block points to that service (IP, port, and model name). |
| 98 | +3. Re-run steps 2–4. No other changes to the quickstart scripts are needed. |
| 99 | + |
| 100 | +Keep the mock server documented for quick demos; swap to full vLLM when you |
| 101 | +want latency/quality signals from the actual model. |
0 commit comments