← Documentation Home

LLM Gateway Architecture

Unified multi-provider AI routing and management

Overview

The LLM Gateway provides a unified interface for routing requests to multiple AI providers (Anthropic, OpenAI, Google, Cohere, local models) with intelligent failover, cost optimization, and comprehensive observability.

Port: 4000 Endpoint: http://localhost:4000/api/v1 Technology: Node.js, Express, TypeScript

Architecture

graph TB
    subgraph "Client Layer"
        Drupal[Drupal Modules]
        CLI[BuildKit CLI]
        Agents[AI Agents]
    end

    subgraph "Gateway Core"
        API[REST API]
        Router[Provider Router]
        Cache[Response Cache]
        RateLimit[Rate Limiter]
    end

    subgraph "Provider Layer"
        Anthropic[Anthropic<br/>Claude 3.5]
        OpenAI[OpenAI<br/>GPT-4]
        Google[Google<br/>Gemini]
        Cohere[Cohere]
        Ollama[Ollama<br/>Local Models]
    end

    subgraph "Support Services"
        Redis[(Redis<br/>Cache)]
        Metrics[Prometheus<br/>Metrics]
        Tracer[Phoenix Arize<br/>Tracing]
    end

    Drupal --> API
    CLI --> API
    Agents --> API

    API --> Router
    Router --> Cache
    Cache --> RateLimit

    RateLimit --> Anthropic
    RateLimit --> OpenAI
    RateLimit --> Google
    RateLimit --> Cohere
    RateLimit --> Ollama

    Cache --> Redis
    Router --> Metrics
    Router --> Tracer

Core Features

Multi-Provider Support

Supported Providers

Provider Models Use Cases
Anthropic Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku Code generation, reasoning, analysis
OpenAI GPT-4 Turbo, GPT-4, GPT-3.5 Turbo Text generation, embeddings, function calling
Google Gemini Pro, Gemini Ultra Multimodal, long context
Cohere Command, Command Light Classification, embeddings
Ollama Llama 3, Mistral, CodeLlama Local development, privacy-sensitive

Provider Configuration

// config/providers.ts
export const providerConfig = {
  anthropic: {
    apiKey: process.env.ANTHROPIC_API_KEY,
    baseUrl: 'https://api.anthropic.com/v1',
    models: {
      'claude-3-5-sonnet-20241022': {
        maxTokens: 200000,
        costPerInputToken: 0.003,
        costPerOutputToken: 0.015
      },
      'claude-3-opus-20240229': {
        maxTokens: 200000,
        costPerInputToken: 0.015,
        costPerOutputToken: 0.075
      }
    }
  },
  openai: {
    apiKey: process.env.OPENAI_API_KEY,
    baseUrl: 'https://api.openai.com/v1',
    models: {
      'gpt-4-turbo': {
        maxTokens: 128000,
        costPerInputToken: 0.01,
        costPerOutputToken: 0.03
      }
    }
  },
  ollama: {
    baseUrl: 'http://localhost:11434',
    models: {
      'llama3': {
        maxTokens: 8192,
        costPerInputToken: 0,
        costPerOutputToken: 0
      }
    }
  }
}

Intelligent Routing

Routing Strategies

Cost-Optimized

router.setStrategy('cost-optimized', {
  preferLocal: true,
  fallbackToCloud: true,
  maxCostPerRequest: 0.10
})

// Routes to cheapest provider that meets requirements
// Llama3 (local, $0) → Claude Haiku ($0.25) → GPT-3.5 ($0.50)

Performance-Optimized

router.setStrategy('performance', {
  maxLatency: 500,  // milliseconds
  preferCached: true
})

// Routes to fastest provider
// Cached response → Ollama (local) → Cloud (closest region)

Quality-Optimized

router.setStrategy('quality', {
  preferredProviders: ['anthropic', 'openai'],
  preferredModels: ['claude-3-5-sonnet', 'gpt-4-turbo']
})

// Routes to highest quality models

Failover

router.setStrategy('failover', {
  primaryProvider: 'anthropic',
  fallbackProviders: ['openai', 'ollama'],
  retryAttempts: 3
})

// Routes: Anthropic → OpenAI (if fail) → Ollama (if fail)

Smart Routing Example

import { LLMGateway } from '@bluefly/llm-gateway'

const gateway = new LLMGateway({
  routing: {
    strategy: 'adaptive',
    factors: {
      cost: 0.3,
      latency: 0.4,
      quality: 0.3
    }
  }
})

// Gateway automatically selects best provider
const response = await gateway.complete({
  prompt: 'Explain quantum computing',
  requirements: {
    maxCost: 0.05,
    maxLatency: 1000,
    minQuality: 0.8
  }
})

// Selected provider: Claude Haiku (cost: $0.02, latency: 450ms, quality: 0.85)

Response Caching

import { CacheManager } from '@bluefly/llm-gateway/cache'

const cache = new CacheManager({
  backend: 'redis',
  ttl: 3600,  // 1 hour
  strategy: 'semantic'
})

// Semantic cache: Similar prompts return cached responses
await cache.set({
  prompt: 'What is machine learning?',
  response: '...',
  embedding: embeddingVector
})

// Later request with similar prompt
const cached = await cache.get({
  prompt: 'Explain ML to me',
  similarityThreshold: 0.95
})

// Returns cached response (95% semantic similarity)

Cache Strategies

Exact Match

cache.setStrategy('exact', {
  keyFields: ['prompt', 'model', 'temperature']
})

Semantic Match

cache.setStrategy('semantic', {
  similarityThreshold: 0.90,
  embeddingModel: 'text-embedding-3-small'
})

Prefix Match

cache.setStrategy('prefix', {
  prefixLength: 100,  // characters
  ignoreCase: true
})

Rate Limiting

import { RateLimiter } from '@bluefly/llm-gateway/rate-limiter'

const limiter = new RateLimiter({
  strategy: 'token-bucket',
  limits: {
    anthropic: {
      requestsPerMinute: 50,
      tokensPerMinute: 100000
    },
    openai: {
      requestsPerMinute: 60,
      tokensPerMinute: 90000
    }
  }
})

// Check if request is allowed
const allowed = await limiter.check({
  provider: 'anthropic',
  model: 'claude-3-5-sonnet-20241022',
  estimatedTokens: 1500
})

if (!allowed) {
  // Fallback to different provider
  return await gateway.complete({
    ...request,
    provider: 'openai'
  })
}

Cost Tracking

import { CostTracker } from '@bluefly/llm-gateway/cost-tracker'

const costTracker = new CostTracker({
  storage: 'postgres',
  alertThresholds: {
    hourly: 10.00,
    daily: 100.00,
    monthly: 2000.00
  }
})

// Track request cost
await costTracker.track({
  provider: 'anthropic',
  model: 'claude-3-5-sonnet-20241022',
  inputTokens: 1000,
  outputTokens: 500,
  cost: 0.0105,
  userId: 'user-123',
  projectId: 'project-456'
})

// Get usage report
const report = await costTracker.report({
  period: 'monthly',
  groupBy: ['provider', 'model', 'userId']
})

// {
//   total: 1250.50,
//   breakdown: {
//     anthropic: 850.00,
//     openai: 400.50
//   },
//   topUsers: [
//     { userId: 'user-123', cost: 450.00 },
//     { userId: 'user-456', cost: 350.00 }
//   ]
// }

API Reference

Completion Endpoint

POST /api/v1/complete
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "prompt": "Explain quantum computing",
  "provider": "anthropic",
  "model": "claude-3-5-sonnet-20241022",
  "temperature": 0.7,
  "maxTokens": 1000,
  "stream": false
}

Response:

{
  "id": "req_abc123",
  "provider": "anthropic",
  "model": "claude-3-5-sonnet-20241022",
  "choices": [
    {
      "text": "Quantum computing is...",
      "finishReason": "stop"
    }
  ],
  "usage": {
    "inputTokens": 10,
    "outputTokens": 150,
    "totalTokens": 160
  },
  "cost": 0.00255,
  "latency": 450,
  "cached": false
}

Streaming Endpoint

POST /api/v1/complete
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "prompt": "Write a Python function",
  "model": "claude-3-5-sonnet-20241022",
  "stream": true
}

Response (Server-Sent Events):

data: {"type":"start","id":"req_abc123"}

data: {"type":"content","delta":"def "}

data: {"type":"content","delta":"calculate"}

data: {"type":"content","delta":"_sum"}

data: {"type":"end","usage":{"inputTokens":10,"outputTokens":50}}

Chat Endpoint

POST /api/v1/chat
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "model": "claude-3-5-sonnet-20241022",
  "temperature": 0.7
}

Embeddings Endpoint

POST /api/v1/embeddings
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "input": ["text to embed"],
  "model": "text-embedding-3-small",
  "provider": "openai"
}

Response:

{
  "embeddings": [
    {
      "object": "embedding",
      "embedding": [0.123, -0.456, ...],
      "index": 0
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "inputTokens": 5,
    "totalTokens": 5
  }
}

Provider Health

GET /api/v1/providers/health

Response:

{
  "providers": {
    "anthropic": {
      "status": "healthy",
      "latency": 250,
      "successRate": 0.998
    },
    "openai": {
      "status": "healthy",
      "latency": 300,
      "successRate": 0.995
    },
    "ollama": {
      "status": "healthy",
      "latency": 50,
      "successRate": 1.0
    }
  }
}

Cost Report

GET /api/v1/cost/report?period=monthly&groupBy=provider

Client SDKs

TypeScript/JavaScript

import { LLMGatewayClient } from '@bluefly/llm-gateway-client'

const client = new LLMGatewayClient({
  apiKey: process.env.LLM_GATEWAY_API_KEY,
  baseUrl: 'http://localhost:4000/api/v1'
})

// Completion
const response = await client.complete({
  prompt: 'Explain TypeScript',
  model: 'claude-3-5-sonnet-20241022'
})

// Streaming
const stream = await client.completeStream({
  prompt: 'Write a function',
  model: 'claude-3-5-sonnet-20241022'
})

for await (const chunk of stream) {
  process.stdout.write(chunk.delta)
}

// Chat
const chatResponse = await client.chat({
  messages: [
    { role: 'user', content: 'Hello!' }
  ]
})

PHP (Drupal)

<?php

use Drupal\llm\Client\LLMGatewayClient;

$client = new LLMGatewayClient([
  'api_key' => \Drupal::config('llm.settings')->get('api_key'),
  'base_url' => 'http://llm-gateway:4000/api/v1',
]);

// Completion
$response = $client->complete([
  'prompt' => 'Explain Drupal',
  'model' => 'claude-3-5-sonnet-20241022',
]);

$text = $response['choices'][0]['text'];
$cost = $response['cost'];

Python

from llm_gateway import LLMGatewayClient

client = LLMGatewayClient(
    api_key=os.getenv('LLM_GATEWAY_API_KEY'),
    base_url='http://localhost:4000/api/v1'
)

# Completion
response = client.complete(
    prompt='Explain machine learning',
    model='claude-3-5-sonnet-20241022'
)

# Streaming
for chunk in client.complete_stream(
    prompt='Write a Python function',
    model='claude-3-5-sonnet-20241022'
):
    print(chunk.delta, end='', flush=True)

Configuration

Environment Variables

# Server
LLM_GATEWAY_PORT=4000
LLM_GATEWAY_HOST=0.0.0.0

# Providers
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AIza...
COHERE_API_KEY=...

# Ollama (local)
OLLAMA_BASE_URL=http://localhost:11434

# Caching
REDIS_URL=redis://localhost:6379
CACHE_TTL=3600

# Rate limiting
RATE_LIMIT_STRATEGY=token-bucket
RATE_LIMIT_REQUESTS_PER_MINUTE=50

# Cost tracking
COST_ALERT_HOURLY=10
COST_ALERT_DAILY=100
COST_ALERT_MONTHLY=2000

# Observability
PHOENIX_ENDPOINT=http://localhost:6006
PROMETHEUS_PORT=9090

Gateway Configuration

# config/gateway.yaml
gateway:
  port: 4000
  cors:
    enabled: true
    origins: ['*']

routing:
  defaultStrategy: adaptive
  factors:
    cost: 0.3
    latency: 0.4
    quality: 0.3

providers:
  anthropic:
    enabled: true
    priority: 10
    timeout: 30000

  openai:
    enabled: true
    priority: 8
    timeout: 30000

  ollama:
    enabled: true
    priority: 5
    timeout: 10000

caching:
  enabled: true
  backend: redis
  strategy: semantic
  ttl: 3600

rateLimiting:
  enabled: true
  strategy: token-bucket
  global:
    requestsPerMinute: 100
    tokensPerMinute: 200000

costTracking:
  enabled: true
  storage: postgres
  alerts:
    enabled: true
    channels: ['email', 'slack']

Monitoring

Prometheus Metrics

# Request metrics
llm_gateway_requests_total{provider="anthropic",model="claude-3-5-sonnet",status="success"} 1250
llm_gateway_request_duration_seconds{provider="anthropic",model="claude-3-5-sonnet"} 0.45

# Token metrics
llm_gateway_tokens_total{provider="anthropic",model="claude-3-5-sonnet",type="input"} 125000
llm_gateway_tokens_total{provider="anthropic",model="claude-3-5-sonnet",type="output"} 75000

# Cost metrics
llm_gateway_cost_total{provider="anthropic",model="claude-3-5-sonnet"} 850.50

# Cache metrics
llm_gateway_cache_hits_total 450
llm_gateway_cache_misses_total 50
llm_gateway_cache_hit_rate 0.90

# Rate limit metrics
llm_gateway_rate_limit_exceeded_total{provider="anthropic"} 5

Grafana Dashboard

Pre-built dashboard includes: - Request rate by provider - Latency percentiles (p50, p95, p99) - Cost breakdown by provider/model - Cache hit rate - Error rate by provider - Token usage over time