LLM Gateway API (Agent Router)
Unified API gateway for AI service orchestration across multiple LLM providers.
Overview
Service: Agent Router (LLM Gateway)
Port: 3001
Domain: gateway.local.bluefly.io
Protocol: REST + Server-Sent Events
Version: 1.0.0
OpenAPI Spec: /technical-guide/openapi/agent-router/llm-gateway.openapi.yml
What It Does
Intelligent routing and load balancing for AI services: - Multi-provider routing: OpenAI, Anthropic, Google, Cohere, Ollama - Cost optimization: Automatic routing to cheapest available model - Load balancing: Distribute requests across providers - Failover: Automatic fallback to backup providers - Response caching: Intelligent caching for improved performance - Rate limiting: Per-key rate limits with provider-specific quotas
Supported Providers
| Provider | Models | Cost | Availability |
|---|---|---|---|
| Ollama | 26 local models | $0/month | Local (11434) |
| OpenAI | GPT-4, GPT-3.5, DALL-E | $10-30/1M tokens | API |
| Anthropic | Claude 3, Claude 2 | $15/1M tokens | API |
| Gemini Pro, PaLM | $7/1M tokens | API | |
| Cohere | Command, Embed | $5/1M tokens | API |
| Custom | GovRFP, BuildKit models | $0/month | Local |
Core Endpoints
1. Chat Completions
POST /api/v1/chat/completions
Request:
{
"messages": [
{
"role": "system",
"content": "You are a Drupal development expert."
},
{
"role": "user",
"content": "Create an authentication module with OAuth2."
}
],
"model": "auto-route",
"max_tokens": 2000,
"temperature": 0.7,
"stream": false
}
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1705320000,
"model": "qwen2.5-coder:32b",
"provider": "ollama",
"usage": {
"prompt_tokens": 45,
"completion_tokens": 523,
"total_tokens": 568,
"cost": 0
},
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Here's a complete Drupal OAuth2 authentication module:\n\n..."
},
"finish_reason": "stop"
}
]
}
2. Streaming Completions
POST /api/v1/chat/completions
Content-Type: application/json
{
"messages": [...],
"model": "auto-route",
"stream": true
}
Response (Server-Sent Events):
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"Here's"}}]}
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":" a"}}]}
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":" complete"}}]}
data: [DONE]
3. Model Selection
Auto-Routing (Recommended)
{
"model": "auto-route",
"messages": [...]
}
The gateway automatically selects the best model based on: - Task type (code, chat, summarization) - Cost optimization - Provider availability - Current load
Specific Models
{
"model": "qwen2.5-coder:32b", // Ollama local model
"messages": [...]
}
{
"model": "gpt-4", // OpenAI
"messages": [...]
}
{
"model": "claude-3-opus", // Anthropic
"messages": [...]
}
4. Health Check
GET /api/health
Response:
{
"status": "healthy",
"version": "1.0.0",
"uptime": 86400,
"providers": {
"ollama": {
"status": "healthy",
"models": 26,
"latency": 45,
"availability": 1.0
},
"openai": {
"status": "healthy",
"latency": 850,
"availability": 0.998
},
"anthropic": {
"status": "degraded",
"latency": 1200,
"availability": 0.95,
"error": "Rate limit approaching"
}
},
"cache": {
"hitRate": 0.45,
"size": "2.3GB",
"entries": 15234
}
}
5. Provider Status
GET /api/v1/providers
Response:
{
"providers": [
{
"name": "ollama",
"status": "healthy",
"endpoint": "http://localhost:11434",
"models": [
{
"name": "qwen2.5-coder:32b",
"size": "19GB",
"capabilities": ["code-generation", "completion"],
"costPerMToken": 0
}
],
"requestsPerMinute": 145,
"averageLatency": 2350
},
{
"name": "openai",
"status": "healthy",
"models": [
{
"name": "gpt-4",
"capabilities": ["chat", "code", "reasoning"],
"costPerMToken": 30
}
],
"requestsPerMinute": 23,
"averageLatency": 850,
"rateLimit": {
"remaining": 3500,
"reset": "2025-01-15T10:15:00Z"
}
}
]
}
Routing Strategies
1. Cost-Optimized (Default)
Routes to cheapest available model that meets requirements:
graph LR
A[Request] --> B{Task Type}
B -->|Code| C[Ollama qwen2.5-coder]
B -->|Chat| D[Ollama llama3.1]
B -->|Complex| E[OpenAI GPT-4]
2. Performance-Optimized
Routes to fastest available model:
{
"model": "auto-route",
"routing_strategy": "performance",
"messages": [...]
}
3. Quality-Optimized
Routes to highest quality model (ignores cost):
{
"model": "auto-route",
"routing_strategy": "quality",
"messages": [...]
}
Caching
The gateway caches responses for identical requests:
Cache Headers:
X-Cache: HIT
X-Cache-Age: 3600
X-Cache-Key: sha256:abc123...
Cache Control:
{
"model": "auto-route",
"messages": [...],
"cache": {
"enabled": true,
"ttl": 3600,
"key": "custom-cache-key"
}
}
Rate Limiting
Default Limits: - 1000 requests/minute per API key - Provider-specific limits enforced
Rate Limit Headers:
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 750
X-RateLimit-Reset: 1705320060
Exceeded Response:
{
"error": {
"type": "rate_limit_exceeded",
"message": "Rate limit exceeded. Retry after 60 seconds.",
"retry_after": 60
}
}
Authentication
See JWT Authentication for complete details.
Bearer Token:
curl -H "Authorization: Bearer <jwt-token>" \
-H "Content-Type: application/json" \
-d '{"model":"auto-route","messages":[...]}' \
https://gateway.local.bluefly.io/api/v1/chat/completions
Error Handling
Standard Error Response:
{
"error": {
"type": "invalid_request_error",
"message": "Invalid model specified",
"param": "model",
"code": "model_not_found"
}
}
Common Error Codes:
- invalid_request_error - Malformed request
- authentication_error - Invalid API key
- rate_limit_exceeded - Rate limit hit
- model_not_found - Model unavailable
- provider_error - Upstream provider error
Integration Examples
cURL
curl -X POST https://gateway.local.bluefly.io/api/v1/chat/completions \
-H "Authorization: Bearer <jwt-token>" \
-H "Content-Type: application/json" \
-d '{
"model": "auto-route",
"messages": [
{"role": "user", "content": "Hello, world!"}
]
}'
TypeScript
import { LLMGateway } from '@bluefly/agent-router';
const gateway = new LLMGateway({
apiKey: process.env.API_KEY,
endpoint: 'https://gateway.local.bluefly.io'
});
const response = await gateway.chat.completions.create({
model: 'auto-route',
messages: [
{ role: 'user', content: 'Create a TypeScript service' }
]
});
console.log(response.choices[0].message.content);
Python
from bluefly import LLMGateway
client = LLMGateway(
api_key=os.getenv("API_KEY"),
base_url="https://gateway.local.bluefly.io"
)
response = client.chat.completions.create(
model="auto-route",
messages=[
{"role": "user", "content": "Create a Python FastAPI service"}
]
)
print(response.choices[0].message.content)
Metrics
GET /api/v1/metrics
Response (Prometheus format):
# HELP gateway_requests_total Total requests processed
# TYPE gateway_requests_total counter
gateway_requests_total{provider="ollama",model="qwen2.5-coder:32b"} 15234
# HELP gateway_latency_seconds Request latency
# TYPE gateway_latency_seconds histogram
gateway_latency_seconds_bucket{provider="ollama",le="1.0"} 14500
gateway_latency_seconds_bucket{provider="ollama",le="2.0"} 15100
# HELP gateway_cost_dollars Total API costs
# TYPE gateway_cost_dollars counter
gateway_cost_dollars{provider="openai"} 45.23
gateway_cost_dollars{provider="ollama"} 0.00