Benchmarks
Benchmarks for LiteLLM Gateway (Proxy Server) tested against a fake OpenAI endpoint.
Use this config for testing:
model_list:
  - model_name: "fake-openai-endpoint"
    litellm_params:
      model: openai/any
      api_base: https://your-fake-openai-endpoint.com/chat/completions
      api_key: "test"
2 Instance LiteLLM Proxy​
In these tests the baseline latency characteristics are measured against a fake-openai-endpoint.
Performance Metrics​
| Type | Name | Median (ms) | 95%ile (ms) | 99%ile (ms) | Average (ms) | Current RPS | 
|---|---|---|---|---|---|---|
| POST | /chat/completions | 200 | 630 | 1200 | 262.46 | 1035.7 | 
| Custom | LiteLLM Overhead Duration (ms) | 12 | 29 | 43 | 14.74 | 1035.7 | 
| Aggregated | 100 | 430 | 930 | 138.6 | 2071.4 | 
4 Instances​
| Type | Name | Median (ms) | 95%ile (ms) | 99%ile (ms) | Average (ms) | Current RPS | 
|---|---|---|---|---|---|---|
| POST | /chat/completions | 100 | 150 | 240 | 111.73 | 1170 | 
| Custom | LiteLLM Overhead Duration (ms) | 2 | 8 | 13 | 3.32 | 1170 | 
| Aggregated | 77 | 130 | 180 | 57.53 | 2340 | 
Key Findings​
- Doubling from 2 to 4 LiteLLM instances halves median latency: 200 ms → 100 ms.
- High-percentile latencies drop significantly: P95 630 ms → 150 ms, P99 1,200 ms → 240 ms.
- Setting workers equal to CPU count gives optimal performance.
Machine Spec used for testing​
Each machine deploying LiteLLM had the following specs:
- 4 CPU
- 8GB RAM
Locust Settings​
- 1000 Users
- 500 user Ramp Up
How to measure LiteLLM Overhead​
All responses from litellm will include the x-litellm-overhead-duration-ms header, this is the latency overhead in milliseconds added by LiteLLM Proxy.
If you want to measure this on locust you can use the following code:
Locust Code for measuring LiteLLM Overhead
import os
import uuid
from locust import HttpUser, task, between, events
# Custom metric to track LiteLLM overhead duration
overhead_durations = []
@events.request.add_listener
def on_request(request_type, name, response_time, response_length, response, context, exception, start_time, url, **kwargs):
    if response and hasattr(response, 'headers'):
        overhead_duration = response.headers.get('x-litellm-overhead-duration-ms')
        if overhead_duration:
            try:
                duration_ms = float(overhead_duration)
                overhead_durations.append(duration_ms)
                # Report as custom metric
                events.request.fire(
                    request_type="Custom",
                    name="LiteLLM Overhead Duration (ms)",
                    response_time=duration_ms,
                    response_length=0,
                )
            except (ValueError, TypeError):
                pass
class MyUser(HttpUser):
    wait_time = between(0.5, 1)  # Random wait time between requests
    def on_start(self):
        self.api_key = os.getenv('API_KEY', 'sk-1234567890')
        self.client.headers.update({'Authorization': f'Bearer {self.api_key}'})
    @task
    def litellm_completion(self):
        # no cache hits with this
        payload = {
            "model": "db-openai-endpoint",
            "messages": [{"role": "user", "content": f"{uuid.uuid4()} This is a test there will be no cache hits and we'll fill up the context" * 150}],
            "user": "my-new-end-user-1"
        }
        response = self.client.post("chat/completions", json=payload)
        
        if response.status_code != 200:
            # log the errors in error.txt
            with open("error.txt", "a") as error_log:
                error_log.write(response.text + "\n")
Logging Callbacks​
GCS Bucket Logging​
Using GCS Bucket has no impact on latency, RPS compared to Basic Litellm Proxy
| Metric | Basic Litellm Proxy | LiteLLM Proxy with GCS Bucket Logging | 
|---|---|---|
| RPS | 1133.2 | 1137.3 | 
| Median Latency (ms) | 140 | 138 | 
LangSmith logging​
Using LangSmith has no impact on latency, RPS compared to Basic Litellm Proxy
| Metric | Basic Litellm Proxy | LiteLLM Proxy with LangSmith | 
|---|---|---|
| RPS | 1133.2 | 1135 | 
| Median Latency (ms) | 140 | 132 |