Agent Tracer Architecture

AI Operations Intelligence - Unified observability for AI agents and LLM workflows

Overview

Agent Tracer provides complete observability for AI agent workflows and LLM operations, combining AI-specific tracing (Phoenix Arize), distributed tracing (Jaeger/Tempo), correlation analysis (Neo4j), and performance benchmarking into a unified intelligence platform.

OSSA Compliance: 1.0 License: MIT Ports: 3007 (API), 3008 (ACE), 3009 (ATLAS)

Architecture

graph TB
    subgraph "Agent Layer"
        Workers[Worker Agents]
        Governors[Governor Agents]
        Critics[Critic Agents]
    end

    subgraph "Tracer Core"
        API[Agent Tracer API<br/>Port 3007]
        ACE[ACE Engine<br/>Port 3008]
        ATLAS[ATLAS Analytics<br/>Port 3009]
    end

    subgraph "Tracing Backends"
        Phoenix[Phoenix Arize<br/>LLM Tracing<br/>Port 6006]
        OTLP[OpenTelemetry<br/>OTLP Exporter<br/>Port 4317]
        Jaeger[Jaeger<br/>Distributed Tracing<br/>Port 16686]
    end

    subgraph "Correlation Layer"
        Neo4j[(Neo4j<br/>Knowledge Graph<br/>Port 7687)]
        Engine[Correlation Engine]
    end

    subgraph "Metrics & Analytics"
        Prometheus[Prometheus<br/>Metrics<br/>Port 9090]
        Grafana[Grafana<br/>Dashboards<br/>Port 3000]
    end

    subgraph "Storage"
        Qdrant[(Qdrant<br/>Trace Embeddings)]
        Loki[(Loki<br/>Log Aggregation)]
    end

    Workers --> API
    Governors --> API
    Critics --> API

    API --> ACE
    API --> ATLAS
    API --> Phoenix
    API --> OTLP

    OTLP --> Jaeger
    Phoenix --> Prometheus

    API --> Engine
    Engine --> Neo4j

    ACE --> Prometheus
    ATLAS --> Qdrant

    Prometheus --> Grafana
    API --> Loki

Core Components

ACE (AI Capabilities Engine)

Purpose: Performance scoring and capability benchmarking

import { ACE } from '@bluefly/agent-tracer/ace'

const ace = new ACE({
  scoring: {
    qualityWeight: 0.4,
    efficiencyWeight: 0.3,
    reliabilityWeight: 0.3
  }
})

// Score agent performance
const score = await ace.scoreAgent({
  agentId: 'tdd-enforcer-001',
  period: '24h',
  metrics: {
    tasksCompleted: 150,
    tasksSuccessful: 145,
    avgDuration: 2500,  // milliseconds
    coverageAchieved: 85,
    testsGenerated: 450
  }
})

// {
//   overall: 0.87,
//   quality: 0.90,    // Coverage, test quality
//   efficiency: 0.85,  // Speed, resource usage
//   reliability: 0.87  // Success rate, stability
// }

Capability Benchmarking

// Benchmark agent against standard tasks
const benchmark = await ace.benchmark({
  agentId: 'tdd-enforcer-001',
  tasks: [
    { type: 'test-generation', count: 10 },
    { type: 'coverage-analysis', count: 5 },
    { type: 'mutation-testing', count: 3 }
  ],
  compareAgainst: ['tdd-enforcer-002', 'tdd-enforcer-003']
})

// {
//   agentId: 'tdd-enforcer-001',
//   scores: {
//     'test-generation': { score: 0.92, rank: 1 },
//     'coverage-analysis': { score: 0.88, rank: 2 },
//     'mutation-testing': { score: 0.95, rank: 1 }
//   },
//   overallRank: 1
// }

ATLAS (Agent Tracing & Learning Analytics System)

Purpose: Learning analytics and workflow optimization

import { ATLAS } from '@bluefly/agent-tracer/atlas'

const atlas = new ATLAS({
  storage: 'qdrant',
  analytics: {
    learningCurve: true,
    workflowOptimization: true,
    resourceUtilization: true
  }
})

// Analyze agent learning progress
const learning = await atlas.analyzeLearning({
  agentId: 'api-builder-001',
  period: '30d'
})

// {
//   improvementRate: 0.15,  // 15% improvement
//   tasks: {
//     initial: { avgDuration: 5000, successRate: 0.75 },
//     current: { avgDuration: 4250, successRate: 0.90 }
//   },
//   insights: [
//     'Agent shows consistent improvement over time',
//     'Success rate increased by 20%',
//     'Duration decreased by 15%'
//   ]
// }

Workflow Optimization

// Analyze and optimize workflow
const optimization = await atlas.optimizeWorkflow({
  workflowId: 'feature-development',
  period: '7d'
})

// {
//   bottlenecks: [
//     {
//       stage: 'test-generation',
//       avgDuration: 8000,
//       recommendation: 'Parallelize test generation across 3 agents'
//     }
//   ],
//   estimatedImprovement: 0.35,  // 35% faster
//   suggestedChanges: [
//     'Increase test-generator agents from 1 to 3',
//     'Enable caching for OpenAPI spec parsing'
//   ]
// }

Correlation Engine

Purpose: Correlate traces, metrics, logs, and events using Neo4j

import { CorrelationEngine } from '@bluefly/agent-tracer/correlation-engine'

const engine = new CorrelationEngine({
  neo4jUri: 'bolt://localhost:7687',
  neo4jUser: 'neo4j',
  neo4jPassword: process.env.NEO4J_PASSWORD
})

// Find correlations for a failed trace
const correlations = await engine.findCorrelations({
  traceId: 'trace-abc123',
  includeMetrics: true,
  includeLogs: true,
  includeEvents: true,
  timeWindow: '5m'
})

// {
//   trace: { id: 'trace-abc123', status: 'error' },
//   relatedTraces: [
//     { id: 'trace-def456', correlation: 0.95 }
//   ],
//   metrics: [
//     { name: 'cpu_usage', value: 95, correlation: 0.88 }
//   ],
//   logs: [
//     { level: 'error', message: 'Connection timeout', correlation: 1.0 }
//   ],
//   rootCause: {
//     type: 'resource_exhaustion',
//     confidence: 0.92
//   }
// }

Root Cause Analysis

// Perform root cause analysis
const rootCause = await engine.analyzeRootCause({
  incidentId: 'incident-789',
  timeWindow: '1h',
  depth: 3  // How many levels deep to search
})

// {
//   rootCause: {
//     component: 'postgresql',
//     issue: 'connection_pool_exhausted',
//     confidence: 0.95
//   },
//   causalChain: [
//     'High request volume → Slow queries → Connection pool exhaustion → Service timeout'
//   ],
//   affectedServices: [
//     'api-builder',
//     'doc-sync',
//     'tdd-enforcer'
//   ],
//   recommendations: [
//     'Increase PostgreSQL connection pool size to 50',
//     'Add index on users.created_at column',
//     'Enable query result caching'
//   ]
// }

Tracing Integration

Phoenix Arize Integration

import { PhoenixTracer } from '@bluefly/agent-tracer/integrations/phoenix'

const phoenix = new PhoenixTracer({
  endpoint: 'http://localhost:6006',
  project: 'llm-agents'
})

// Trace LLM call
const result = await phoenix.traceLLMCall({
  model: 'claude-3-5-sonnet-20241022',
  prompt: 'Generate unit tests for this function',
  provider: 'anthropic',
  metadata: {
    agentId: 'test-generator-001',
    taskId: 'task-123'
  }
})

// {
//   traceId: 'trace-abc123',
//   usage: {
//     inputTokens: 1500,
//     outputTokens: 2500,
//     totalTokens: 4000
//   },
//   cost: 0.0645,  // $0.0645
//   latency: 2450,  // milliseconds
//   quality: {
//     coherence: 0.92,
//     relevance: 0.95
//   }
// }

OpenTelemetry Export

import { OTLPExporter } from '@bluefly/agent-tracer/integrations/otlp'

const otlp = new OTLPExporter({
  endpoint: 'http://localhost:4317',
  serviceName: 'agent-mesh',
  attributes: {
    'service.version': '1.0.0',
    'deployment.environment': 'production'
  }
})

// Start span
const span = otlp.startSpan('agent.execute_task', {
  attributes: {
    'agent.id': 'tdd-enforcer-001',
    'agent.type': 'governor',
    'task.id': 'task-123',
    'task.type': 'test-validation'
  }
})

try {
  // Execute task
  const result = await executeTask()
  span.setStatus({ code: SpanStatusCode.OK })
  return result
} catch (error) {
  span.recordException(error)
  span.setStatus({ code: SpanStatusCode.ERROR })
  throw error
} finally {
  span.end()
}

Jaeger Visualization

Access Jaeger UI at http://localhost:16686 to: - View distributed traces - Analyze service dependencies - Identify performance bottlenecks - Track request flows

Metrics Collection

Prometheus Metrics

import { PrometheusExporter } from '@bluefly/agent-tracer/metrics'

const metrics = new PrometheusExporter({
  port: 9090,
  prefix: 'agent_tracer_'
})

// Register metrics
metrics.counter('llm_requests_total', {
  help: 'Total LLM requests',
  labelNames: ['provider', 'model', 'status']
})

metrics.histogram('llm_duration_seconds', {
  help: 'LLM request duration',
  labelNames: ['provider', 'model'],
  buckets: [0.1, 0.5, 1, 2, 5, 10]
})

metrics.gauge('llm_cost_dollars', {
  help: 'LLM cost in dollars',
  labelNames: ['provider', 'model']
})

// Increment counter
metrics.inc('llm_requests_total', {
  provider: 'anthropic',
  model: 'claude-3-5-sonnet',
  status: 'success'
})

// Observe histogram
metrics.observe('llm_duration_seconds', 2.45, {
  provider: 'anthropic',
  model: 'claude-3-5-sonnet'
})

// Set gauge
metrics.set('llm_cost_dollars', 125.50, {
  provider: 'anthropic',
  model: 'claude-3-5-sonnet'
})

Example Metrics

# LLM request metrics
agent_tracer_llm_requests_total{provider="anthropic",model="claude-3-5-sonnet",status="success"} 1250
agent_tracer_llm_duration_seconds{provider="anthropic",model="claude-3-5-sonnet",quantile="0.99"} 2.5
agent_tracer_llm_cost_dollars{provider="anthropic",model="claude-3-5-sonnet"} 125.50

# Agent metrics
agent_tracer_agent_tasks_total{agent_type="governor",status="success"} 450
agent_tracer_agent_duration_seconds{agent_type="worker",quantile="0.95"} 1.2

# ACE metrics
agent_tracer_ace_score{agent_id="tdd-enforcer-001",dimension="quality"} 0.90
agent_tracer_ace_score{agent_id="tdd-enforcer-001",dimension="efficiency"} 0.85

API Reference

Tracing Endpoints

POST /api/v1/traces
Content-Type: application/json

{
  "traceId": "trace-abc123",
  "agentId": "tdd-enforcer-001",
  "operation": "test-validation",
  "duration": 2500,
  "status": "success",
  "metadata": {
    "coverage": 85,
    "tests_run": 150
  }
}

GET /api/v1/traces/:traceId

ACE Endpoints

POST /api/v1/ace/score
Content-Type: application/json

{
  "agentId": "tdd-enforcer-001",
  "period": "24h"
}

GET /api/v1/ace/benchmarks

ATLAS Endpoints

GET /api/v1/atlas/analytics/:agentId?period=30d

POST /api/v1/atlas/optimize
Content-Type: application/json

{
  "workflowId": "feature-development",
  "period": "7d"
}

Metrics Endpoint

GET /metrics
Accept: text/plain

CLI Commands

# Start Agent Tracer
agent-tracer start

# View traces
agent-tracer traces list --limit 10
agent-tracer traces get --trace-id abc123

# ACE operations
ace start
ace score --agent-id agent-123
ace benchmark --agents agent-1,agent-2,agent-3

# ATLAS operations
agent-tracer atlas start
agent-tracer atlas analyze --agent-id agent-123
agent-tracer atlas optimize --workflow-id workflow-456

# Correlation analysis
agent-tracer correlate --trace-id abc123
agent-tracer rca --incident-id incident-789

Configuration

# config/tracer.yaml
tracer:
  port: 3007
  ace_port: 3008
  atlas_port: 3009

exporters:
  phoenix:
    enabled: true
    endpoint: http://localhost:6006
    project: llm-agents

  jaeger:
    enabled: true
    endpoint: http://localhost:14268/api/traces

  tempo:
    enabled: true
    endpoint: http://localhost:4317

metrics:
  prometheus:
    enabled: true
    port: 9090

correlation:
  neo4j:
    uri: bolt://localhost:7687
    user: neo4j
    password: ${NEO4J_PASSWORD}

analytics:
  qdrant:
    url: http://localhost:6333
    collection: trace-embeddings

Grafana Dashboards

Pre-built dashboards include:

Agent Overview: High-level agent metrics
LLM Performance: Model usage, costs, latency
Trace Analysis: Distributed trace visualization
ACE Scores: Agent capability scores
ATLAS Analytics: Learning curves, optimization
Infrastructure: System health, resources

Integration Examples

With Agent Mesh

// Agent Mesh sends all traces to Agent Tracer
import { AgentMesh } from '@bluefly/agent-mesh'
import { AgentTracer } from '@bluefly/agent-tracer'

const tracer = new AgentTracer({
  endpoint: 'http://localhost:3007'
})

const mesh = new AgentMesh({
  tracer: tracer
})

// All agent communication is automatically traced

With BuildKit

// BuildKit agents report to Agent Tracer
import { BuildKit } from '@bluefly/agent-buildkit'

const buildkit = new BuildKit({
  observability: {
    tracer: 'http://localhost:3007',
    phoenix: 'http://localhost:6006'
  }
})