AI Proxy is an open-source project that provides a simple way to create a proxy server for LLM models.
Many existing solutions are pseudo-open-source with hidden features behind paywalls. AI Proxy aims to be fully open-source and free to use.
- CO2 emission tracking (CodeCarbon API)
- Monitor requests and responses
- api key model permission management
- Rate limiting
- Support for openai api endpoint
Requirements:
- Docker
Copy the example configuration file and edit it to your needs:
cp config.example.yaml config.yaml
Edit config.yaml
to set your OpenAI API key and other configurations.
global:
model_list:
- model_name: devstral
params:
model: devstral:latest
api_base: http://ollama-service.ollama.svc.cluster.local:11434/v1
drop_params: true
api_key: "no_token"
max_input_tokens: 25000
keys:
- name: "user"
token: "token"
models:
- "devstral"
Run the server:
docker-compose up -d
The server will be available at http://localhost:8000
.
And the docs at http://localhost:8000/docs
.
An prometheus endpoint is available at http://localhost:8001/metrics
.
exposed metrics:
- request_count
- request_latency
- request_tokens
- response_tokens
-
embedding routes
-
transcript routes
-
tts routes
-
streaming tts route
-
streaming transcript route
-
basic stats (hours/days/weeks/months)
- models
- requests per model (line chart)
- tokens per model (line chart)
- model repartition (pie chart)
- average response latency per model (line chart)
- nb or requests per min per model (bar chart)
- user
- requests per user (bar chart)
- tokens per user (line chart)
- user repartition per token (pie chart)
- user repartition per request (pie chart)
- average response latency per user (line chart)
- nb or requets per min per user (bar chart)
- max token per request (line chart)
- min token per request (line chart)
- totals
- requests (bar chart)
- tokens (line chart)
- user repartition (pie chart)
- models
-
FIX:
- support image upload
- Add cline support
-
gpu api agent