LLM Gateway API (Agent Router)

Unified API gateway for AI service orchestration across multiple LLM providers.

Overview

Service: Agent Router (LLM Gateway) Port: 3001 Domain: gateway.local.bluefly.io Protocol: REST + Server-Sent Events Version: 1.0.0 OpenAPI Spec: /technical-guide/openapi/agent-router/llm-gateway.openapi.yml

What It Does

Intelligent routing and load balancing for AI services: - Multi-provider routing: OpenAI, Anthropic, Google, Cohere, Ollama - Cost optimization: Automatic routing to cheapest available model - Load balancing: Distribute requests across providers - Failover: Automatic fallback to backup providers - Response caching: Intelligent caching for improved performance - Rate limiting: Per-key rate limits with provider-specific quotas

Supported Providers

Provider	Models	Cost	Availability
Ollama	26 local models	$0/month	Local (11434)
OpenAI	GPT-4, GPT-3.5, DALL-E	$10-30/1M tokens	API
Anthropic	Claude 3, Claude 2	$15/1M tokens	API
Google	Gemini Pro, PaLM	$7/1M tokens	API
Cohere	Command, Embed	$5/1M tokens	API
Custom	GovRFP, BuildKit models	$0/month	Local

Core Endpoints

1. Chat Completions

POST /api/v1/chat/completions

Request:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a Drupal development expert."
    },
    {
      "role": "user",
      "content": "Create an authentication module with OAuth2."
    }
  ],
  "model": "auto-route",
  "max_tokens": 2000,
  "temperature": 0.7,
  "stream": false
}

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1705320000,
  "model": "qwen2.5-coder:32b",
  "provider": "ollama",
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 523,
    "total_tokens": 568,
    "cost": 0
  },
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's a complete Drupal OAuth2 authentication module:\n\n..."
      },
      "finish_reason": "stop"
    }
  ]
}

2. Streaming Completions

POST /api/v1/chat/completions
Content-Type: application/json

{
  "messages": [...],
  "model": "auto-route",
  "stream": true
}

Response (Server-Sent Events):

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"Here's"}}]}

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":" a"}}]}

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":" complete"}}]}

data: [DONE]

3. Model Selection

Auto-Routing (Recommended)

{
  "model": "auto-route",
  "messages": [...]
}

The gateway automatically selects the best model based on: - Task type (code, chat, summarization) - Cost optimization - Provider availability - Current load

Specific Models

{
  "model": "qwen2.5-coder:32b",  // Ollama local model
  "messages": [...]
}

{
  "model": "gpt-4",  // OpenAI
  "messages": [...]
}

{
  "model": "claude-3-opus",  // Anthropic
  "messages": [...]
}

4. Health Check

GET /api/health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": 86400,
  "providers": {
    "ollama": {
      "status": "healthy",
      "models": 26,
      "latency": 45,
      "availability": 1.0
    },
    "openai": {
      "status": "healthy",
      "latency": 850,
      "availability": 0.998
    },
    "anthropic": {
      "status": "degraded",
      "latency": 1200,
      "availability": 0.95,
      "error": "Rate limit approaching"
    }
  },
  "cache": {
    "hitRate": 0.45,
    "size": "2.3GB",
    "entries": 15234
  }
}

5. Provider Status

GET /api/v1/providers

Response:

{
  "providers": [
    {
      "name": "ollama",
      "status": "healthy",
      "endpoint": "http://localhost:11434",
      "models": [
        {
          "name": "qwen2.5-coder:32b",
          "size": "19GB",
          "capabilities": ["code-generation", "completion"],
          "costPerMToken": 0
        }
      ],
      "requestsPerMinute": 145,
      "averageLatency": 2350
    },
    {
      "name": "openai",
      "status": "healthy",
      "models": [
        {
          "name": "gpt-4",
          "capabilities": ["chat", "code", "reasoning"],
          "costPerMToken": 30
        }
      ],
      "requestsPerMinute": 23,
      "averageLatency": 850,
      "rateLimit": {
        "remaining": 3500,
        "reset": "2025-01-15T10:15:00Z"
      }
    }
  ]
}

Routing Strategies

1. Cost-Optimized (Default)

Routes to cheapest available model that meets requirements:

graph LR
    A[Request] --> B{Task Type}
    B -->|Code| C[Ollama qwen2.5-coder]
    B -->|Chat| D[Ollama llama3.1]
    B -->|Complex| E[OpenAI GPT-4]

2. Performance-Optimized

Routes to fastest available model:

{
  "model": "auto-route",
  "routing_strategy": "performance",
  "messages": [...]
}

3. Quality-Optimized

Routes to highest quality model (ignores cost):

{
  "model": "auto-route",
  "routing_strategy": "quality",
  "messages": [...]
}

Caching

The gateway caches responses for identical requests:

Cache Headers:

X-Cache: HIT
X-Cache-Age: 3600
X-Cache-Key: sha256:abc123...

Cache Control:

{
  "model": "auto-route",
  "messages": [...],
  "cache": {
    "enabled": true,
    "ttl": 3600,
    "key": "custom-cache-key"
  }
}

Rate Limiting

Default Limits: - 1000 requests/minute per API key - Provider-specific limits enforced

Rate Limit Headers:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 750
X-RateLimit-Reset: 1705320060

Exceeded Response:

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Retry after 60 seconds.",
    "retry_after": 60
  }
}

Authentication

See JWT Authentication for complete details.

Bearer Token:

curl -H "Authorization: Bearer <jwt-token>" \
  -H "Content-Type: application/json" \
  -d '{"model":"auto-route","messages":[...]}' \
  https://gateway.local.bluefly.io/api/v1/chat/completions

Error Handling

Standard Error Response:

{
  "error": {
    "type": "invalid_request_error",
    "message": "Invalid model specified",
    "param": "model",
    "code": "model_not_found"
  }
}

Common Error Codes: - invalid_request_error - Malformed request - authentication_error - Invalid API key - rate_limit_exceeded - Rate limit hit - model_not_found - Model unavailable - provider_error - Upstream provider error

Integration Examples

cURL

curl -X POST https://gateway.local.bluefly.io/api/v1/chat/completions \
  -H "Authorization: Bearer <jwt-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto-route",
    "messages": [
      {"role": "user", "content": "Hello, world!"}
    ]
  }'

TypeScript

import { LLMGateway } from '@bluefly/agent-router';

const gateway = new LLMGateway({
  apiKey: process.env.API_KEY,
  endpoint: 'https://gateway.local.bluefly.io'
});

const response = await gateway.chat.completions.create({
  model: 'auto-route',
  messages: [
    { role: 'user', content: 'Create a TypeScript service' }
  ]
});

console.log(response.choices[0].message.content);

Python

from bluefly import LLMGateway

client = LLMGateway(
    api_key=os.getenv("API_KEY"),
    base_url="https://gateway.local.bluefly.io"
)

response = client.chat.completions.create(
    model="auto-route",
    messages=[
        {"role": "user", "content": "Create a Python FastAPI service"}
    ]
)

print(response.choices[0].message.content)

Metrics

GET /api/v1/metrics

Response (Prometheus format):

# HELP gateway_requests_total Total requests processed
# TYPE gateway_requests_total counter
gateway_requests_total{provider="ollama",model="qwen2.5-coder:32b"} 15234

# HELP gateway_latency_seconds Request latency
# TYPE gateway_latency_seconds histogram
gateway_latency_seconds_bucket{provider="ollama",le="1.0"} 14500
gateway_latency_seconds_bucket{provider="ollama",le="2.0"} 15100

# HELP gateway_cost_dollars Total API costs
# TYPE gateway_cost_dollars counter
gateway_cost_dollars{provider="openai"} 45.23
gateway_cost_dollars{provider="ollama"} 0.00

LLM Gateway API (Agent Router)

Overview

What It Does

Supported Providers

Core Endpoints

1. Chat Completions

2. Streaming Completions

3. Model Selection

Auto-Routing (Recommended)

Specific Models

4. Health Check

5. Provider Status

Routing Strategies

1. Cost-Optimized (Default)

2. Performance-Optimized

3. Quality-Optimized

Caching

Rate Limiting

Authentication

Error Handling

Integration Examples

cURL

TypeScript

Python

Metrics

Next Steps