LLM Gateway Architecture
Unified multi-provider AI routing and management
Overview
The LLM Gateway provides a unified interface for routing requests to multiple AI providers (Anthropic, OpenAI, Google, Cohere, local models) with intelligent failover, cost optimization, and comprehensive observability.
Port: 4000
Endpoint: http://localhost:4000/api/v1
Technology: Node.js, Express, TypeScript
Architecture
graph TB
subgraph "Client Layer"
Drupal[Drupal Modules]
CLI[BuildKit CLI]
Agents[AI Agents]
end
subgraph "Gateway Core"
API[REST API]
Router[Provider Router]
Cache[Response Cache]
RateLimit[Rate Limiter]
end
subgraph "Provider Layer"
Anthropic[Anthropic<br/>Claude 3.5]
OpenAI[OpenAI<br/>GPT-4]
Google[Google<br/>Gemini]
Cohere[Cohere]
Ollama[Ollama<br/>Local Models]
end
subgraph "Support Services"
Redis[(Redis<br/>Cache)]
Metrics[Prometheus<br/>Metrics]
Tracer[Phoenix Arize<br/>Tracing]
end
Drupal --> API
CLI --> API
Agents --> API
API --> Router
Router --> Cache
Cache --> RateLimit
RateLimit --> Anthropic
RateLimit --> OpenAI
RateLimit --> Google
RateLimit --> Cohere
RateLimit --> Ollama
Cache --> Redis
Router --> Metrics
Router --> Tracer
Core Features
Multi-Provider Support
Supported Providers
| Provider | Models | Use Cases |
|---|---|---|
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku | Code generation, reasoning, analysis |
| OpenAI | GPT-4 Turbo, GPT-4, GPT-3.5 Turbo | Text generation, embeddings, function calling |
| Gemini Pro, Gemini Ultra | Multimodal, long context | |
| Cohere | Command, Command Light | Classification, embeddings |
| Ollama | Llama 3, Mistral, CodeLlama | Local development, privacy-sensitive |
Provider Configuration
// config/providers.ts
export const providerConfig = {
anthropic: {
apiKey: process.env.ANTHROPIC_API_KEY,
baseUrl: 'https://api.anthropic.com/v1',
models: {
'claude-3-5-sonnet-20241022': {
maxTokens: 200000,
costPerInputToken: 0.003,
costPerOutputToken: 0.015
},
'claude-3-opus-20240229': {
maxTokens: 200000,
costPerInputToken: 0.015,
costPerOutputToken: 0.075
}
}
},
openai: {
apiKey: process.env.OPENAI_API_KEY,
baseUrl: 'https://api.openai.com/v1',
models: {
'gpt-4-turbo': {
maxTokens: 128000,
costPerInputToken: 0.01,
costPerOutputToken: 0.03
}
}
},
ollama: {
baseUrl: 'http://localhost:11434',
models: {
'llama3': {
maxTokens: 8192,
costPerInputToken: 0,
costPerOutputToken: 0
}
}
}
}
Intelligent Routing
Routing Strategies
Cost-Optimized
router.setStrategy('cost-optimized', {
preferLocal: true,
fallbackToCloud: true,
maxCostPerRequest: 0.10
})
// Routes to cheapest provider that meets requirements
// Llama3 (local, $0) → Claude Haiku ($0.25) → GPT-3.5 ($0.50)
Performance-Optimized
router.setStrategy('performance', {
maxLatency: 500, // milliseconds
preferCached: true
})
// Routes to fastest provider
// Cached response → Ollama (local) → Cloud (closest region)
Quality-Optimized
router.setStrategy('quality', {
preferredProviders: ['anthropic', 'openai'],
preferredModels: ['claude-3-5-sonnet', 'gpt-4-turbo']
})
// Routes to highest quality models
Failover
router.setStrategy('failover', {
primaryProvider: 'anthropic',
fallbackProviders: ['openai', 'ollama'],
retryAttempts: 3
})
// Routes: Anthropic → OpenAI (if fail) → Ollama (if fail)
Smart Routing Example
import { LLMGateway } from '@bluefly/llm-gateway'
const gateway = new LLMGateway({
routing: {
strategy: 'adaptive',
factors: {
cost: 0.3,
latency: 0.4,
quality: 0.3
}
}
})
// Gateway automatically selects best provider
const response = await gateway.complete({
prompt: 'Explain quantum computing',
requirements: {
maxCost: 0.05,
maxLatency: 1000,
minQuality: 0.8
}
})
// Selected provider: Claude Haiku (cost: $0.02, latency: 450ms, quality: 0.85)
Response Caching
import { CacheManager } from '@bluefly/llm-gateway/cache'
const cache = new CacheManager({
backend: 'redis',
ttl: 3600, // 1 hour
strategy: 'semantic'
})
// Semantic cache: Similar prompts return cached responses
await cache.set({
prompt: 'What is machine learning?',
response: '...',
embedding: embeddingVector
})
// Later request with similar prompt
const cached = await cache.get({
prompt: 'Explain ML to me',
similarityThreshold: 0.95
})
// Returns cached response (95% semantic similarity)
Cache Strategies
Exact Match
cache.setStrategy('exact', {
keyFields: ['prompt', 'model', 'temperature']
})
Semantic Match
cache.setStrategy('semantic', {
similarityThreshold: 0.90,
embeddingModel: 'text-embedding-3-small'
})
Prefix Match
cache.setStrategy('prefix', {
prefixLength: 100, // characters
ignoreCase: true
})
Rate Limiting
import { RateLimiter } from '@bluefly/llm-gateway/rate-limiter'
const limiter = new RateLimiter({
strategy: 'token-bucket',
limits: {
anthropic: {
requestsPerMinute: 50,
tokensPerMinute: 100000
},
openai: {
requestsPerMinute: 60,
tokensPerMinute: 90000
}
}
})
// Check if request is allowed
const allowed = await limiter.check({
provider: 'anthropic',
model: 'claude-3-5-sonnet-20241022',
estimatedTokens: 1500
})
if (!allowed) {
// Fallback to different provider
return await gateway.complete({
...request,
provider: 'openai'
})
}
Cost Tracking
import { CostTracker } from '@bluefly/llm-gateway/cost-tracker'
const costTracker = new CostTracker({
storage: 'postgres',
alertThresholds: {
hourly: 10.00,
daily: 100.00,
monthly: 2000.00
}
})
// Track request cost
await costTracker.track({
provider: 'anthropic',
model: 'claude-3-5-sonnet-20241022',
inputTokens: 1000,
outputTokens: 500,
cost: 0.0105,
userId: 'user-123',
projectId: 'project-456'
})
// Get usage report
const report = await costTracker.report({
period: 'monthly',
groupBy: ['provider', 'model', 'userId']
})
// {
// total: 1250.50,
// breakdown: {
// anthropic: 850.00,
// openai: 400.50
// },
// topUsers: [
// { userId: 'user-123', cost: 450.00 },
// { userId: 'user-456', cost: 350.00 }
// ]
// }
API Reference
Completion Endpoint
POST /api/v1/complete
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY
{
"prompt": "Explain quantum computing",
"provider": "anthropic",
"model": "claude-3-5-sonnet-20241022",
"temperature": 0.7,
"maxTokens": 1000,
"stream": false
}
Response:
{
"id": "req_abc123",
"provider": "anthropic",
"model": "claude-3-5-sonnet-20241022",
"choices": [
{
"text": "Quantum computing is...",
"finishReason": "stop"
}
],
"usage": {
"inputTokens": 10,
"outputTokens": 150,
"totalTokens": 160
},
"cost": 0.00255,
"latency": 450,
"cached": false
}
Streaming Endpoint
POST /api/v1/complete
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY
{
"prompt": "Write a Python function",
"model": "claude-3-5-sonnet-20241022",
"stream": true
}
Response (Server-Sent Events):
data: {"type":"start","id":"req_abc123"}
data: {"type":"content","delta":"def "}
data: {"type":"content","delta":"calculate"}
data: {"type":"content","delta":"_sum"}
data: {"type":"end","usage":{"inputTokens":10,"outputTokens":50}}
Chat Endpoint
POST /api/v1/chat
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY
{
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"model": "claude-3-5-sonnet-20241022",
"temperature": 0.7
}
Embeddings Endpoint
POST /api/v1/embeddings
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY
{
"input": ["text to embed"],
"model": "text-embedding-3-small",
"provider": "openai"
}
Response:
{
"embeddings": [
{
"object": "embedding",
"embedding": [0.123, -0.456, ...],
"index": 0
}
],
"model": "text-embedding-3-small",
"usage": {
"inputTokens": 5,
"totalTokens": 5
}
}
Provider Health
GET /api/v1/providers/health
Response:
{
"providers": {
"anthropic": {
"status": "healthy",
"latency": 250,
"successRate": 0.998
},
"openai": {
"status": "healthy",
"latency": 300,
"successRate": 0.995
},
"ollama": {
"status": "healthy",
"latency": 50,
"successRate": 1.0
}
}
}
Cost Report
GET /api/v1/cost/report?period=monthly&groupBy=provider
Client SDKs
TypeScript/JavaScript
import { LLMGatewayClient } from '@bluefly/llm-gateway-client'
const client = new LLMGatewayClient({
apiKey: process.env.LLM_GATEWAY_API_KEY,
baseUrl: 'http://localhost:4000/api/v1'
})
// Completion
const response = await client.complete({
prompt: 'Explain TypeScript',
model: 'claude-3-5-sonnet-20241022'
})
// Streaming
const stream = await client.completeStream({
prompt: 'Write a function',
model: 'claude-3-5-sonnet-20241022'
})
for await (const chunk of stream) {
process.stdout.write(chunk.delta)
}
// Chat
const chatResponse = await client.chat({
messages: [
{ role: 'user', content: 'Hello!' }
]
})
PHP (Drupal)
<?php
use Drupal\llm\Client\LLMGatewayClient;
$client = new LLMGatewayClient([
'api_key' => \Drupal::config('llm.settings')->get('api_key'),
'base_url' => 'http://llm-gateway:4000/api/v1',
]);
// Completion
$response = $client->complete([
'prompt' => 'Explain Drupal',
'model' => 'claude-3-5-sonnet-20241022',
]);
$text = $response['choices'][0]['text'];
$cost = $response['cost'];
Python
from llm_gateway import LLMGatewayClient
client = LLMGatewayClient(
api_key=os.getenv('LLM_GATEWAY_API_KEY'),
base_url='http://localhost:4000/api/v1'
)
# Completion
response = client.complete(
prompt='Explain machine learning',
model='claude-3-5-sonnet-20241022'
)
# Streaming
for chunk in client.complete_stream(
prompt='Write a Python function',
model='claude-3-5-sonnet-20241022'
):
print(chunk.delta, end='', flush=True)
Configuration
Environment Variables
# Server
LLM_GATEWAY_PORT=4000
LLM_GATEWAY_HOST=0.0.0.0
# Providers
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AIza...
COHERE_API_KEY=...
# Ollama (local)
OLLAMA_BASE_URL=http://localhost:11434
# Caching
REDIS_URL=redis://localhost:6379
CACHE_TTL=3600
# Rate limiting
RATE_LIMIT_STRATEGY=token-bucket
RATE_LIMIT_REQUESTS_PER_MINUTE=50
# Cost tracking
COST_ALERT_HOURLY=10
COST_ALERT_DAILY=100
COST_ALERT_MONTHLY=2000
# Observability
PHOENIX_ENDPOINT=http://localhost:6006
PROMETHEUS_PORT=9090
Gateway Configuration
# config/gateway.yaml
gateway:
port: 4000
cors:
enabled: true
origins: ['*']
routing:
defaultStrategy: adaptive
factors:
cost: 0.3
latency: 0.4
quality: 0.3
providers:
anthropic:
enabled: true
priority: 10
timeout: 30000
openai:
enabled: true
priority: 8
timeout: 30000
ollama:
enabled: true
priority: 5
timeout: 10000
caching:
enabled: true
backend: redis
strategy: semantic
ttl: 3600
rateLimiting:
enabled: true
strategy: token-bucket
global:
requestsPerMinute: 100
tokensPerMinute: 200000
costTracking:
enabled: true
storage: postgres
alerts:
enabled: true
channels: ['email', 'slack']
Monitoring
Prometheus Metrics
# Request metrics
llm_gateway_requests_total{provider="anthropic",model="claude-3-5-sonnet",status="success"} 1250
llm_gateway_request_duration_seconds{provider="anthropic",model="claude-3-5-sonnet"} 0.45
# Token metrics
llm_gateway_tokens_total{provider="anthropic",model="claude-3-5-sonnet",type="input"} 125000
llm_gateway_tokens_total{provider="anthropic",model="claude-3-5-sonnet",type="output"} 75000
# Cost metrics
llm_gateway_cost_total{provider="anthropic",model="claude-3-5-sonnet"} 850.50
# Cache metrics
llm_gateway_cache_hits_total 450
llm_gateway_cache_misses_total 50
llm_gateway_cache_hit_rate 0.90
# Rate limit metrics
llm_gateway_rate_limit_exceeded_total{provider="anthropic"} 5
Grafana Dashboard
Pre-built dashboard includes: - Request rate by provider - Latency percentiles (p50, p95, p99) - Cost breakdown by provider/model - Cache hit rate - Error rate by provider - Token usage over time
Related Documentation
- System Overview
- Agent Tracer - Cost tracking and observability
- DDEV Development
- API Reference