Agent Tracer Architecture
AI Operations Intelligence - Unified observability for AI agents and LLM workflows
Overview
Agent Tracer provides complete observability for AI agent workflows and LLM operations, combining AI-specific tracing (Phoenix Arize), distributed tracing (Jaeger/Tempo), correlation analysis (Neo4j), and performance benchmarking into a unified intelligence platform.
OSSA Compliance: 1.0 License: MIT Ports: 3007 (API), 3008 (ACE), 3009 (ATLAS)
Architecture
graph TB
subgraph "Agent Layer"
Workers[Worker Agents]
Governors[Governor Agents]
Critics[Critic Agents]
end
subgraph "Tracer Core"
API[Agent Tracer API<br/>Port 3007]
ACE[ACE Engine<br/>Port 3008]
ATLAS[ATLAS Analytics<br/>Port 3009]
end
subgraph "Tracing Backends"
Phoenix[Phoenix Arize<br/>LLM Tracing<br/>Port 6006]
OTLP[OpenTelemetry<br/>OTLP Exporter<br/>Port 4317]
Jaeger[Jaeger<br/>Distributed Tracing<br/>Port 16686]
end
subgraph "Correlation Layer"
Neo4j[(Neo4j<br/>Knowledge Graph<br/>Port 7687)]
Engine[Correlation Engine]
end
subgraph "Metrics & Analytics"
Prometheus[Prometheus<br/>Metrics<br/>Port 9090]
Grafana[Grafana<br/>Dashboards<br/>Port 3000]
end
subgraph "Storage"
Qdrant[(Qdrant<br/>Trace Embeddings)]
Loki[(Loki<br/>Log Aggregation)]
end
Workers --> API
Governors --> API
Critics --> API
API --> ACE
API --> ATLAS
API --> Phoenix
API --> OTLP
OTLP --> Jaeger
Phoenix --> Prometheus
API --> Engine
Engine --> Neo4j
ACE --> Prometheus
ATLAS --> Qdrant
Prometheus --> Grafana
API --> Loki
Core Components
ACE (AI Capabilities Engine)
Purpose: Performance scoring and capability benchmarking
import { ACE } from '@bluefly/agent-tracer/ace'
const ace = new ACE({
scoring: {
qualityWeight: 0.4,
efficiencyWeight: 0.3,
reliabilityWeight: 0.3
}
})
// Score agent performance
const score = await ace.scoreAgent({
agentId: 'tdd-enforcer-001',
period: '24h',
metrics: {
tasksCompleted: 150,
tasksSuccessful: 145,
avgDuration: 2500, // milliseconds
coverageAchieved: 85,
testsGenerated: 450
}
})
// {
// overall: 0.87,
// quality: 0.90, // Coverage, test quality
// efficiency: 0.85, // Speed, resource usage
// reliability: 0.87 // Success rate, stability
// }
Capability Benchmarking
// Benchmark agent against standard tasks
const benchmark = await ace.benchmark({
agentId: 'tdd-enforcer-001',
tasks: [
{ type: 'test-generation', count: 10 },
{ type: 'coverage-analysis', count: 5 },
{ type: 'mutation-testing', count: 3 }
],
compareAgainst: ['tdd-enforcer-002', 'tdd-enforcer-003']
})
// {
// agentId: 'tdd-enforcer-001',
// scores: {
// 'test-generation': { score: 0.92, rank: 1 },
// 'coverage-analysis': { score: 0.88, rank: 2 },
// 'mutation-testing': { score: 0.95, rank: 1 }
// },
// overallRank: 1
// }
ATLAS (Agent Tracing & Learning Analytics System)
Purpose: Learning analytics and workflow optimization
import { ATLAS } from '@bluefly/agent-tracer/atlas'
const atlas = new ATLAS({
storage: 'qdrant',
analytics: {
learningCurve: true,
workflowOptimization: true,
resourceUtilization: true
}
})
// Analyze agent learning progress
const learning = await atlas.analyzeLearning({
agentId: 'api-builder-001',
period: '30d'
})
// {
// improvementRate: 0.15, // 15% improvement
// tasks: {
// initial: { avgDuration: 5000, successRate: 0.75 },
// current: { avgDuration: 4250, successRate: 0.90 }
// },
// insights: [
// 'Agent shows consistent improvement over time',
// 'Success rate increased by 20%',
// 'Duration decreased by 15%'
// ]
// }
Workflow Optimization
// Analyze and optimize workflow
const optimization = await atlas.optimizeWorkflow({
workflowId: 'feature-development',
period: '7d'
})
// {
// bottlenecks: [
// {
// stage: 'test-generation',
// avgDuration: 8000,
// recommendation: 'Parallelize test generation across 3 agents'
// }
// ],
// estimatedImprovement: 0.35, // 35% faster
// suggestedChanges: [
// 'Increase test-generator agents from 1 to 3',
// 'Enable caching for OpenAPI spec parsing'
// ]
// }
Correlation Engine
Purpose: Correlate traces, metrics, logs, and events using Neo4j
import { CorrelationEngine } from '@bluefly/agent-tracer/correlation-engine'
const engine = new CorrelationEngine({
neo4jUri: 'bolt://localhost:7687',
neo4jUser: 'neo4j',
neo4jPassword: process.env.NEO4J_PASSWORD
})
// Find correlations for a failed trace
const correlations = await engine.findCorrelations({
traceId: 'trace-abc123',
includeMetrics: true,
includeLogs: true,
includeEvents: true,
timeWindow: '5m'
})
// {
// trace: { id: 'trace-abc123', status: 'error' },
// relatedTraces: [
// { id: 'trace-def456', correlation: 0.95 }
// ],
// metrics: [
// { name: 'cpu_usage', value: 95, correlation: 0.88 }
// ],
// logs: [
// { level: 'error', message: 'Connection timeout', correlation: 1.0 }
// ],
// rootCause: {
// type: 'resource_exhaustion',
// confidence: 0.92
// }
// }
Root Cause Analysis
// Perform root cause analysis
const rootCause = await engine.analyzeRootCause({
incidentId: 'incident-789',
timeWindow: '1h',
depth: 3 // How many levels deep to search
})
// {
// rootCause: {
// component: 'postgresql',
// issue: 'connection_pool_exhausted',
// confidence: 0.95
// },
// causalChain: [
// 'High request volume → Slow queries → Connection pool exhaustion → Service timeout'
// ],
// affectedServices: [
// 'api-builder',
// 'doc-sync',
// 'tdd-enforcer'
// ],
// recommendations: [
// 'Increase PostgreSQL connection pool size to 50',
// 'Add index on users.created_at column',
// 'Enable query result caching'
// ]
// }
Tracing Integration
Phoenix Arize Integration
import { PhoenixTracer } from '@bluefly/agent-tracer/integrations/phoenix'
const phoenix = new PhoenixTracer({
endpoint: 'http://localhost:6006',
project: 'llm-agents'
})
// Trace LLM call
const result = await phoenix.traceLLMCall({
model: 'claude-3-5-sonnet-20241022',
prompt: 'Generate unit tests for this function',
provider: 'anthropic',
metadata: {
agentId: 'test-generator-001',
taskId: 'task-123'
}
})
// {
// traceId: 'trace-abc123',
// usage: {
// inputTokens: 1500,
// outputTokens: 2500,
// totalTokens: 4000
// },
// cost: 0.0645, // $0.0645
// latency: 2450, // milliseconds
// quality: {
// coherence: 0.92,
// relevance: 0.95
// }
// }
OpenTelemetry Export
import { OTLPExporter } from '@bluefly/agent-tracer/integrations/otlp'
const otlp = new OTLPExporter({
endpoint: 'http://localhost:4317',
serviceName: 'agent-mesh',
attributes: {
'service.version': '1.0.0',
'deployment.environment': 'production'
}
})
// Start span
const span = otlp.startSpan('agent.execute_task', {
attributes: {
'agent.id': 'tdd-enforcer-001',
'agent.type': 'governor',
'task.id': 'task-123',
'task.type': 'test-validation'
}
})
try {
// Execute task
const result = await executeTask()
span.setStatus({ code: SpanStatusCode.OK })
return result
} catch (error) {
span.recordException(error)
span.setStatus({ code: SpanStatusCode.ERROR })
throw error
} finally {
span.end()
}
Jaeger Visualization
Access Jaeger UI at http://localhost:16686 to:
- View distributed traces
- Analyze service dependencies
- Identify performance bottlenecks
- Track request flows
Metrics Collection
Prometheus Metrics
import { PrometheusExporter } from '@bluefly/agent-tracer/metrics'
const metrics = new PrometheusExporter({
port: 9090,
prefix: 'agent_tracer_'
})
// Register metrics
metrics.counter('llm_requests_total', {
help: 'Total LLM requests',
labelNames: ['provider', 'model', 'status']
})
metrics.histogram('llm_duration_seconds', {
help: 'LLM request duration',
labelNames: ['provider', 'model'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
})
metrics.gauge('llm_cost_dollars', {
help: 'LLM cost in dollars',
labelNames: ['provider', 'model']
})
// Increment counter
metrics.inc('llm_requests_total', {
provider: 'anthropic',
model: 'claude-3-5-sonnet',
status: 'success'
})
// Observe histogram
metrics.observe('llm_duration_seconds', 2.45, {
provider: 'anthropic',
model: 'claude-3-5-sonnet'
})
// Set gauge
metrics.set('llm_cost_dollars', 125.50, {
provider: 'anthropic',
model: 'claude-3-5-sonnet'
})
Example Metrics
# LLM request metrics
agent_tracer_llm_requests_total{provider="anthropic",model="claude-3-5-sonnet",status="success"} 1250
agent_tracer_llm_duration_seconds{provider="anthropic",model="claude-3-5-sonnet",quantile="0.99"} 2.5
agent_tracer_llm_cost_dollars{provider="anthropic",model="claude-3-5-sonnet"} 125.50
# Agent metrics
agent_tracer_agent_tasks_total{agent_type="governor",status="success"} 450
agent_tracer_agent_duration_seconds{agent_type="worker",quantile="0.95"} 1.2
# ACE metrics
agent_tracer_ace_score{agent_id="tdd-enforcer-001",dimension="quality"} 0.90
agent_tracer_ace_score{agent_id="tdd-enforcer-001",dimension="efficiency"} 0.85
API Reference
Tracing Endpoints
POST /api/v1/traces
Content-Type: application/json
{
"traceId": "trace-abc123",
"agentId": "tdd-enforcer-001",
"operation": "test-validation",
"duration": 2500,
"status": "success",
"metadata": {
"coverage": 85,
"tests_run": 150
}
}
GET /api/v1/traces/:traceId
ACE Endpoints
POST /api/v1/ace/score
Content-Type: application/json
{
"agentId": "tdd-enforcer-001",
"period": "24h"
}
GET /api/v1/ace/benchmarks
ATLAS Endpoints
GET /api/v1/atlas/analytics/:agentId?period=30d
POST /api/v1/atlas/optimize
Content-Type: application/json
{
"workflowId": "feature-development",
"period": "7d"
}
Metrics Endpoint
GET /metrics
Accept: text/plain
CLI Commands
# Start Agent Tracer
agent-tracer start
# View traces
agent-tracer traces list --limit 10
agent-tracer traces get --trace-id abc123
# ACE operations
ace start
ace score --agent-id agent-123
ace benchmark --agents agent-1,agent-2,agent-3
# ATLAS operations
agent-tracer atlas start
agent-tracer atlas analyze --agent-id agent-123
agent-tracer atlas optimize --workflow-id workflow-456
# Correlation analysis
agent-tracer correlate --trace-id abc123
agent-tracer rca --incident-id incident-789
Configuration
# config/tracer.yaml
tracer:
port: 3007
ace_port: 3008
atlas_port: 3009
exporters:
phoenix:
enabled: true
endpoint: http://localhost:6006
project: llm-agents
jaeger:
enabled: true
endpoint: http://localhost:14268/api/traces
tempo:
enabled: true
endpoint: http://localhost:4317
metrics:
prometheus:
enabled: true
port: 9090
correlation:
neo4j:
uri: bolt://localhost:7687
user: neo4j
password: ${NEO4J_PASSWORD}
analytics:
qdrant:
url: http://localhost:6333
collection: trace-embeddings
Grafana Dashboards
Pre-built dashboards include:
- Agent Overview: High-level agent metrics
- LLM Performance: Model usage, costs, latency
- Trace Analysis: Distributed trace visualization
- ACE Scores: Agent capability scores
- ATLAS Analytics: Learning curves, optimization
- Infrastructure: System health, resources
Integration Examples
With Agent Mesh
// Agent Mesh sends all traces to Agent Tracer
import { AgentMesh } from '@bluefly/agent-mesh'
import { AgentTracer } from '@bluefly/agent-tracer'
const tracer = new AgentTracer({
endpoint: 'http://localhost:3007'
})
const mesh = new AgentMesh({
tracer: tracer
})
// All agent communication is automatically traced
With BuildKit
// BuildKit agents report to Agent Tracer
import { BuildKit } from '@bluefly/agent-buildkit'
const buildkit = new BuildKit({
observability: {
tracer: 'http://localhost:3007',
phoenix: 'http://localhost:6006'
}
})