Skip to content

Commit 3a16fc2

Browse files
committed
docs: add semantic cache doc to explain how to use in-memory and milvus in the config
Signed-off-by: Huamin Chen <hchen@redhat.com>
1 parent 14cb752 commit 3a16fc2

File tree

1 file changed

+191
-0
lines changed

1 file changed

+191
-0
lines changed
Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
# Semantic Cache
2+
3+
Semantic Router provides intelligent caching that understands request similarity using semantic embeddings. Instead of exact string matching, it identifies semantically similar queries to serve cached responses, reducing latency and LLM inference costs.
4+
5+
## Architecture
6+
7+
```mermaid
8+
graph TB
9+
A[Client Request] --> B[Semantic Router]
10+
B --> C{Cache Enabled?}
11+
C -->|No| G[Route to LLM]
12+
C -->|Yes| D[Generate Embedding]
13+
D --> E{Similar Query in Cache?}
14+
E -->|Hit| F[Return Cached Response]
15+
E -->|Miss| G[Route to LLM]
16+
G --> H[LLM Response]
17+
H --> I[Store in Cache]
18+
H --> J[Return Response]
19+
I --> K[Update Metrics]
20+
F --> K
21+
22+
style F fill:#90EE90
23+
style I fill:#FFB6C1
24+
```
25+
26+
## Backend Options
27+
28+
### Memory Backend (Development)
29+
- **Use case**: Development, testing, single-instance deployments
30+
- **Pros**: Fast startup, no external dependencies
31+
- **Cons**: Data lost on restart, limited to single instance
32+
33+
### Milvus Backend (Production/Persistent)
34+
- **Use case**: Production, distributed deployments
35+
- **Pros**: Persistent storage, horizontally scalable, high availability
36+
- **Cons**: Requires Milvus cluster setup
37+
38+
## Configuration
39+
40+
### Memory Backend
41+
```yaml
42+
semantic_cache:
43+
enabled: true
44+
backend_type: "memory"
45+
similarity_threshold: 0.8
46+
max_entries: 1000
47+
ttl_seconds: 3600
48+
```
49+
50+
### Milvus Backend
51+
```yaml
52+
semantic_cache:
53+
enabled: true
54+
backend_type: "milvus"
55+
backend_config_path: "config/cache/milvus.yaml"
56+
similarity_threshold: 0.8
57+
ttl_seconds: 3600
58+
```
59+
60+
## Testing Cache Functionality
61+
62+
### Test Memory Backend
63+
64+
Start the router with memory cache:
65+
```bash
66+
# Run the router
67+
make run-router
68+
```
69+
70+
Test cache behavior:
71+
```bash
72+
# Send identical requests to see cache hits
73+
curl -X POST http://localhost:8080/v1/chat/completions \
74+
-H "Content-Type: application/json" \
75+
-d '{
76+
"model": "auto",
77+
"messages": [{"role": "user", "content": "What is machine learning?"}]
78+
}'
79+
80+
# Send similar request (should hit cache due to semantic similarity)
81+
curl -X POST http://localhost:8080/v1/chat/completions \
82+
-H "Content-Type: application/json" \
83+
-d '{
84+
"model": "auto",
85+
"messages": [{"role": "user", "content": "Explain machine learning"}]
86+
}'
87+
```
88+
89+
### Test Milvus Backend
90+
91+
Start Milvus container:
92+
```bash
93+
make start-milvus
94+
```
95+
96+
Update configuration to use Milvus:
97+
```bash
98+
# Edit config/config.yaml
99+
sed -i 's/backend_type: "memory"/backend_type: "milvus"/' config/config.yaml
100+
sed -i 's/# backend_config_path:/backend_config_path:/' config/config.yaml
101+
```
102+
103+
Run with Milvus support:
104+
```bash
105+
# Run the router
106+
make run-router
107+
```
108+
109+
# Stop Milvus when done
110+
```bash
111+
make stop-milvus
112+
```
113+
114+
## Monitoring Cache Performance
115+
116+
### Available Metrics
117+
118+
The router exposes Prometheus metrics for cache monitoring:
119+
120+
| Metric | Type | Description |
121+
|--------|------|-------------|
122+
| `llm_cache_hits_total` | Counter | Total cache hits |
123+
| `llm_cache_misses_total` | Counter | Total cache misses |
124+
| `llm_cache_operations_total` | Counter | Cache operations by backend, operation, and status |
125+
| `llm_cache_operation_duration_seconds` | Histogram | Duration of cache operations |
126+
| `llm_cache_entries_total` | Gauge | Current number of cache entries |
127+
128+
### Cache Metrics Dashboard
129+
130+
Access metrics via:
131+
- **Metrics endpoint**: `http://localhost:9190/metrics`
132+
- **Built-in stats**: Available via cache backend `GetStats()` method
133+
134+
Example Prometheus queries:
135+
```promql
136+
# Cache hit rate
137+
rate(llm_cache_hits_total[5m]) / (rate(llm_cache_hits_total[5m]) + rate(llm_cache_misses_total[5m]))
138+
139+
# Average cache operation duration
140+
rate(llm_cache_operation_duration_seconds_sum[5m]) / rate(llm_cache_operation_duration_seconds_count[5m])
141+
142+
# Cache operations by backend
143+
sum by (backend) (rate(llm_cache_operations_total[5m]))
144+
```
145+
146+
### Cache Performance Analysis
147+
148+
Monitor these key indicators:
149+
150+
1. **Hit Ratio**: Higher ratios indicate better cache effectiveness
151+
2. **Operation Latency**: Cache lookups should be significantly faster than LLM calls
152+
3. **Entry Count**: Monitor cache size for memory management
153+
4. **Backend Performance**: Compare memory vs Milvus operation times
154+
155+
## Configuration Best Practices
156+
157+
### Development Environment
158+
```yaml
159+
semantic_cache:
160+
enabled: true
161+
backend_type: "memory"
162+
similarity_threshold: 0.85 # Higher threshold for more precise matching
163+
max_entries: 500 # Smaller cache for testing
164+
```
165+
166+
### Production Environment
167+
```yaml
168+
semantic_cache:
169+
enabled: true
170+
backend_type: "milvus"
171+
backend_config_path: "config/cache/milvus.yaml"
172+
similarity_threshold: 0.8 # Balanced threshold
173+
```
174+
175+
### Milvus Production Configuration
176+
```yaml
177+
# config/cache/milvus.yaml
178+
connection:
179+
host: "milvus-cluster.prod.example.com" # Replace with your Milvus cluster endpoint
180+
port: 443
181+
auth:
182+
enabled: true
183+
username: "semantic-router" # Replace with your Milvus username
184+
password: "${MILVUS_PASSWORD}" # Replace with your Milvus password
185+
tls:
186+
enabled: true
187+
188+
development:
189+
drop_collection_on_startup: false # Preserve data
190+
auto_create_collection: false # Pre-create collections
191+
```

0 commit comments

Comments
 (0)