LLM Gateway Architecture

Unified multi-provider AI routing and management

Overview

The LLM Gateway provides a unified interface for routing requests to multiple AI providers (Anthropic, OpenAI, Google, Cohere, local models) with intelligent failover, cost optimization, and comprehensive observability.

Port: 4000 Endpoint: http://localhost:4000/api/v1 Technology: Node.js, Express, TypeScript

Architecture

graph TB
    subgraph "Client Layer"
        Drupal[Drupal Modules]
        CLI[BuildKit CLI]
        Agents[AI Agents]
    end

    subgraph "Gateway Core"
        API[REST API]
        Router[Provider Router]
        Cache[Response Cache]
        RateLimit[Rate Limiter]
    end

    subgraph "Provider Layer"
        Anthropic[Anthropic<br/>Claude 3.5]
        OpenAI[OpenAI<br/>GPT-4]
        Google[Google<br/>Gemini]
        Cohere[Cohere]
        Ollama[Ollama<br/>Local Models]
    end

    subgraph "Support Services"
        Redis[(Redis<br/>Cache)]
        Metrics[Prometheus<br/>Metrics]
        Tracer[Phoenix Arize<br/>Tracing]
    end

    Drupal --> API
    CLI --> API
    Agents --> API

    API --> Router
    Router --> Cache
    Cache --> RateLimit

    RateLimit --> Anthropic
    RateLimit --> OpenAI
    RateLimit --> Google
    RateLimit --> Cohere
    RateLimit --> Ollama

    Cache --> Redis
    Router --> Metrics
    Router --> Tracer

Core Features

Multi-Provider Support

Supported Providers

Provider	Models	Use Cases
Anthropic	Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku	Code generation, reasoning, analysis
OpenAI	GPT-4 Turbo, GPT-4, GPT-3.5 Turbo	Text generation, embeddings, function calling
Google	Gemini Pro, Gemini Ultra	Multimodal, long context
Cohere	Command, Command Light	Classification, embeddings
Ollama	Llama 3, Mistral, CodeLlama	Local development, privacy-sensitive

Provider Configuration

// config/providers.ts
export const providerConfig = {
  anthropic: {
    apiKey: process.env.ANTHROPIC_API_KEY,
    baseUrl: 'https://api.anthropic.com/v1',
    models: {
      'claude-3-5-sonnet-20241022': {
        maxTokens: 200000,
        costPerInputToken: 0.003,
        costPerOutputToken: 0.015
      },
      'claude-3-opus-20240229': {
        maxTokens: 200000,
        costPerInputToken: 0.015,
        costPerOutputToken: 0.075
      }
    }
  },
  openai: {
    apiKey: process.env.OPENAI_API_KEY,
    baseUrl: 'https://api.openai.com/v1',
    models: {
      'gpt-4-turbo': {
        maxTokens: 128000,
        costPerInputToken: 0.01,
        costPerOutputToken: 0.03
      }
    }
  },
  ollama: {
    baseUrl: 'http://localhost:11434',
    models: {
      'llama3': {
        maxTokens: 8192,
        costPerInputToken: 0,
        costPerOutputToken: 0
      }
    }
  }
}

Intelligent Routing

Routing Strategies

Cost-Optimized

router.setStrategy('cost-optimized', {
  preferLocal: true,
  fallbackToCloud: true,
  maxCostPerRequest: 0.10
})

// Routes to cheapest provider that meets requirements
// Llama3 (local, $0) → Claude Haiku ($0.25) → GPT-3.5 ($0.50)

Performance-Optimized

router.setStrategy('performance', {
  maxLatency: 500,  // milliseconds
  preferCached: true
})

// Routes to fastest provider
// Cached response → Ollama (local) → Cloud (closest region)

Quality-Optimized

router.setStrategy('quality', {
  preferredProviders: ['anthropic', 'openai'],
  preferredModels: ['claude-3-5-sonnet', 'gpt-4-turbo']
})

// Routes to highest quality models

Failover

router.setStrategy('failover', {
  primaryProvider: 'anthropic',
  fallbackProviders: ['openai', 'ollama'],
  retryAttempts: 3
})

// Routes: Anthropic → OpenAI (if fail) → Ollama (if fail)

Smart Routing Example

import { LLMGateway } from '@bluefly/llm-gateway'

const gateway = new LLMGateway({
  routing: {
    strategy: 'adaptive',
    factors: {
      cost: 0.3,
      latency: 0.4,
      quality: 0.3
    }
  }
})

// Gateway automatically selects best provider
const response = await gateway.complete({
  prompt: 'Explain quantum computing',
  requirements: {
    maxCost: 0.05,
    maxLatency: 1000,
    minQuality: 0.8
  }
})

// Selected provider: Claude Haiku (cost: $0.02, latency: 450ms, quality: 0.85)

Response Caching

import { CacheManager } from '@bluefly/llm-gateway/cache'

const cache = new CacheManager({
  backend: 'redis',
  ttl: 3600,  // 1 hour
  strategy: 'semantic'
})

// Semantic cache: Similar prompts return cached responses
await cache.set({
  prompt: 'What is machine learning?',
  response: '...',
  embedding: embeddingVector
})

// Later request with similar prompt
const cached = await cache.get({
  prompt: 'Explain ML to me',
  similarityThreshold: 0.95
})

// Returns cached response (95% semantic similarity)

Cache Strategies

Exact Match

cache.setStrategy('exact', {
  keyFields: ['prompt', 'model', 'temperature']
})

Semantic Match

cache.setStrategy('semantic', {
  similarityThreshold: 0.90,
  embeddingModel: 'text-embedding-3-small'
})

Prefix Match

cache.setStrategy('prefix', {
  prefixLength: 100,  // characters
  ignoreCase: true
})

Rate Limiting

import { RateLimiter } from '@bluefly/llm-gateway/rate-limiter'

const limiter = new RateLimiter({
  strategy: 'token-bucket',
  limits: {
    anthropic: {
      requestsPerMinute: 50,
      tokensPerMinute: 100000
    },
    openai: {
      requestsPerMinute: 60,
      tokensPerMinute: 90000
    }
  }
})

// Check if request is allowed
const allowed = await limiter.check({
  provider: 'anthropic',
  model: 'claude-3-5-sonnet-20241022',
  estimatedTokens: 1500
})

if (!allowed) {
  // Fallback to different provider
  return await gateway.complete({
    ...request,
    provider: 'openai'
  })
}

Cost Tracking

import { CostTracker } from '@bluefly/llm-gateway/cost-tracker'

const costTracker = new CostTracker({
  storage: 'postgres',
  alertThresholds: {
    hourly: 10.00,
    daily: 100.00,
    monthly: 2000.00
  }
})

// Track request cost
await costTracker.track({
  provider: 'anthropic',
  model: 'claude-3-5-sonnet-20241022',
  inputTokens: 1000,
  outputTokens: 500,
  cost: 0.0105,
  userId: 'user-123',
  projectId: 'project-456'
})

// Get usage report
const report = await costTracker.report({
  period: 'monthly',
  groupBy: ['provider', 'model', 'userId']
})

// {
//   total: 1250.50,
//   breakdown: {
//     anthropic: 850.00,
//     openai: 400.50
//   },
//   topUsers: [
//     { userId: 'user-123', cost: 450.00 },
//     { userId: 'user-456', cost: 350.00 }
//   ]
// }

API Reference

Completion Endpoint

POST /api/v1/complete
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "prompt": "Explain quantum computing",
  "provider": "anthropic",
  "model": "claude-3-5-sonnet-20241022",
  "temperature": 0.7,
  "maxTokens": 1000,
  "stream": false
}

Response:

{
  "id": "req_abc123",
  "provider": "anthropic",
  "model": "claude-3-5-sonnet-20241022",
  "choices": [
    {
      "text": "Quantum computing is...",
      "finishReason": "stop"
    }
  ],
  "usage": {
    "inputTokens": 10,
    "outputTokens": 150,
    "totalTokens": 160
  },
  "cost": 0.00255,
  "latency": 450,
  "cached": false
}

Streaming Endpoint

POST /api/v1/complete
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "prompt": "Write a Python function",
  "model": "claude-3-5-sonnet-20241022",
  "stream": true
}

Response (Server-Sent Events):

data: {"type":"start","id":"req_abc123"}

data: {"type":"content","delta":"def "}

data: {"type":"content","delta":"calculate"}

data: {"type":"content","delta":"_sum"}

data: {"type":"end","usage":{"inputTokens":10,"outputTokens":50}}

Chat Endpoint

POST /api/v1/chat
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "model": "claude-3-5-sonnet-20241022",
  "temperature": 0.7
}

Embeddings Endpoint

POST /api/v1/embeddings
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "input": ["text to embed"],
  "model": "text-embedding-3-small",
  "provider": "openai"
}

Response:

{
  "embeddings": [
    {
      "object": "embedding",
      "embedding": [0.123, -0.456, ...],
      "index": 0
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "inputTokens": 5,
    "totalTokens": 5
  }
}

Provider Health

GET /api/v1/providers/health

Response:

{
  "providers": {
    "anthropic": {
      "status": "healthy",
      "latency": 250,
      "successRate": 0.998
    },
    "openai": {
      "status": "healthy",
      "latency": 300,
      "successRate": 0.995
    },
    "ollama": {
      "status": "healthy",
      "latency": 50,
      "successRate": 1.0
    }
  }
}

Cost Report

GET /api/v1/cost/report?period=monthly&groupBy=provider

Client SDKs

TypeScript/JavaScript

import { LLMGatewayClient } from '@bluefly/llm-gateway-client'

const client = new LLMGatewayClient({
  apiKey: process.env.LLM_GATEWAY_API_KEY,
  baseUrl: 'http://localhost:4000/api/v1'
})

// Completion
const response = await client.complete({
  prompt: 'Explain TypeScript',
  model: 'claude-3-5-sonnet-20241022'
})

// Streaming
const stream = await client.completeStream({
  prompt: 'Write a function',
  model: 'claude-3-5-sonnet-20241022'
})

for await (const chunk of stream) {
  process.stdout.write(chunk.delta)
}

// Chat
const chatResponse = await client.chat({
  messages: [
    { role: 'user', content: 'Hello!' }
  ]
})

PHP (Drupal)

<?php

use Drupal\llm\Client\LLMGatewayClient;

$client = new LLMGatewayClient([
  'api_key' => \Drupal::config('llm.settings')->get('api_key'),
  'base_url' => 'http://llm-gateway:4000/api/v1',
]);

// Completion
$response = $client->complete([
  'prompt' => 'Explain Drupal',
  'model' => 'claude-3-5-sonnet-20241022',
]);

$text = $response['choices'][0]['text'];
$cost = $response['cost'];

Python

from llm_gateway import LLMGatewayClient

client = LLMGatewayClient(
    api_key=os.getenv('LLM_GATEWAY_API_KEY'),
    base_url='http://localhost:4000/api/v1'
)

# Completion
response = client.complete(
    prompt='Explain machine learning',
    model='claude-3-5-sonnet-20241022'
)

# Streaming
for chunk in client.complete_stream(
    prompt='Write a Python function',
    model='claude-3-5-sonnet-20241022'
):
    print(chunk.delta, end='', flush=True)

Configuration

Environment Variables

# Server
LLM_GATEWAY_PORT=4000
LLM_GATEWAY_HOST=0.0.0.0

# Providers
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AIza...
COHERE_API_KEY=...

# Ollama (local)
OLLAMA_BASE_URL=http://localhost:11434

# Caching
REDIS_URL=redis://localhost:6379
CACHE_TTL=3600

# Rate limiting
RATE_LIMIT_STRATEGY=token-bucket
RATE_LIMIT_REQUESTS_PER_MINUTE=50

# Cost tracking
COST_ALERT_HOURLY=10
COST_ALERT_DAILY=100
COST_ALERT_MONTHLY=2000

# Observability
PHOENIX_ENDPOINT=http://localhost:6006
PROMETHEUS_PORT=9090

Gateway Configuration

# config/gateway.yaml
gateway:
  port: 4000
  cors:
    enabled: true
    origins: ['*']

routing:
  defaultStrategy: adaptive
  factors:
    cost: 0.3
    latency: 0.4
    quality: 0.3

providers:
  anthropic:
    enabled: true
    priority: 10
    timeout: 30000

  openai:
    enabled: true
    priority: 8
    timeout: 30000

  ollama:
    enabled: true
    priority: 5
    timeout: 10000

caching:
  enabled: true
  backend: redis
  strategy: semantic
  ttl: 3600

rateLimiting:
  enabled: true
  strategy: token-bucket
  global:
    requestsPerMinute: 100
    tokensPerMinute: 200000

costTracking:
  enabled: true
  storage: postgres
  alerts:
    enabled: true
    channels: ['email', 'slack']

Monitoring

Prometheus Metrics

# Request metrics
llm_gateway_requests_total{provider="anthropic",model="claude-3-5-sonnet",status="success"} 1250
llm_gateway_request_duration_seconds{provider="anthropic",model="claude-3-5-sonnet"} 0.45

# Token metrics
llm_gateway_tokens_total{provider="anthropic",model="claude-3-5-sonnet",type="input"} 125000
llm_gateway_tokens_total{provider="anthropic",model="claude-3-5-sonnet",type="output"} 75000

# Cost metrics
llm_gateway_cost_total{provider="anthropic",model="claude-3-5-sonnet"} 850.50

# Cache metrics
llm_gateway_cache_hits_total 450
llm_gateway_cache_misses_total 50
llm_gateway_cache_hit_rate 0.90

# Rate limit metrics
llm_gateway_rate_limit_exceeded_total{provider="anthropic"} 5

Grafana Dashboard

Pre-built dashboard includes: - Request rate by provider - Latency percentiles (p50, p95, p99) - Cost breakdown by provider/model - Cache hit rate - Error rate by provider - Token usage over time

System Overview
Agent Tracer - Cost tracking and observability
DDEV Development
API Reference

LLM Gateway Architecture

Overview

Architecture

Core Features

Multi-Provider Support

Supported Providers

Provider Configuration

Intelligent Routing

Routing Strategies

Smart Routing Example

Response Caching

Cache Strategies

Rate Limiting

Cost Tracking

API Reference

Completion Endpoint

Streaming Endpoint

Chat Endpoint

Embeddings Endpoint

Provider Health

Cost Report

Client SDKs

TypeScript/JavaScript

PHP (Drupal)

Python

Configuration

Environment Variables

Gateway Configuration

Monitoring

Prometheus Metrics

Grafana Dashboard

Related Documentation