Agent Tracer API
Operations intelligence and distributed tracing for agent communication.
Overview
Service: Phoenix Arise Tracer
Port: 3100
Domain: tracer.local.bluefly.io
Protocol: REST + WebSocket
Version: 1.0.0
OpenAPI Spec: /technical-guide/openapi/agent-tracer/tracer.openapi.yaml
What It Does
Agent Tracer provides comprehensive distributed tracing and performance monitoring: - Distributed tracing: Track requests across agent mesh - Performance metrics: Latency, throughput, error rates - Bottleneck analysis: Identify performance bottlenecks - Critical path analysis: Find slowest execution paths - Real-time monitoring: Live metrics and alerts
Architecture
graph LR
A[Agent Mesh] -->|Traces| T[Agent Tracer]
B[Workflow Engine] -->|Traces| T
C[LLM Gateway] -->|Traces| T
T --> D[(TimescaleDB)]
T --> E[Real-time Analytics]
T --> F[Alerting]
Core Features
1. Trace Collection
List Traces
GET /api/v1/traces?limit=100&agentId=agent-brain-001&status=success
Query Parameters:
- limit (integer, default: 100) - Maximum traces to return
- agentId (string) - Filter by agent ID
- status (enum: success, error, timeout) - Filter by status
- startTime (ISO 8601) - Start time filter
- endTime (ISO 8601) - End time filter
Response:
{
"traces": [
{
"traceId": "trace-abc123",
"name": "code-generation-workflow",
"startTime": "2025-01-15T10:00:00Z",
"endTime": "2025-01-15T10:00:02.5Z",
"duration": 2500,
"status": "success",
"spans": 12,
"agents": ["agent-brain-001", "agent-mesh-001"],
"tags": {
"workflow": "code-generation",
"priority": "normal"
}
}
],
"total": 523,
"page": 1,
"pageSize": 100
}
Get Trace Details
GET /api/v1/traces/{traceId}
Response:
{
"traceId": "trace-abc123",
"name": "code-generation-workflow",
"startTime": "2025-01-15T10:00:00Z",
"endTime": "2025-01-15T10:00:02.5Z",
"duration": 2500,
"status": "success",
"spans": [
{
"spanId": "span-001",
"name": "task-routing",
"startTime": "2025-01-15T10:00:00Z",
"duration": 50,
"agent": "agent-mesh-001",
"tags": {
"operation": "discover-agent"
}
},
{
"spanId": "span-002",
"name": "code-generation",
"startTime": "2025-01-15T10:00:00.05Z",
"duration": 2400,
"agent": "agent-brain-001",
"tags": {
"model": "qwen2.5-coder:32b",
"tokens": 1523
}
},
{
"spanId": "span-003",
"name": "response-formatting",
"startTime": "2025-01-15T10:00:02.45Z",
"duration": 50,
"agent": "agent-mesh-001",
"tags": {
"format": "json"
}
}
],
"metrics": {
"totalDuration": 2500,
"networkTime": 100,
"processingTime": 2400,
"overhead": 4
}
}
2. Performance Analysis
Get Critical Path
Find the slowest execution path through a trace.
GET /api/v1/traces/{traceId}/critical-path
Response:
{
"traceId": "trace-abc123",
"criticalPath": [
{
"spanId": "span-002",
"name": "code-generation",
"duration": 2400,
"percentOfTotal": 96
},
{
"spanId": "span-001",
"name": "task-routing",
"duration": 50,
"percentOfTotal": 2
},
{
"spanId": "span-003",
"name": "response-formatting",
"duration": 50,
"percentOfTotal": 2
}
],
"totalDuration": 2500,
"criticalPathDuration": 2500,
"percentOfTotal": 100
}
Identify Bottlenecks
GET /api/v1/traces/{traceId}/bottlenecks
Response:
{
"traceId": "trace-abc123",
"bottlenecks": [
{
"spanId": "span-002",
"name": "code-generation",
"duration": 2400,
"expectedDuration": 1500,
"slowdownFactor": 1.6,
"severity": "medium",
"recommendations": [
"Consider using smaller model (deepseek-coder-v2:16b)",
"Enable response streaming",
"Add caching layer"
]
}
],
"totalBottlenecks": 1,
"overallPerformance": "acceptable"
}
Get Aggregated Metrics
GET /api/v1/metrics/aggregate?timeRange=1h&groupBy=agent
Query Parameters:
- timeRange (string) - Time range (1h, 24h, 7d, 30d)
- groupBy (enum: agent, operation, status) - Aggregation key
- metric (enum: duration, throughput, errors) - Metric to aggregate
Response:
{
"timeRange": "1h",
"startTime": "2025-01-15T09:00:00Z",
"endTime": "2025-01-15T10:00:00Z",
"aggregations": [
{
"key": "agent-brain-001",
"metrics": {
"totalRequests": 523,
"successRate": 0.987,
"averageDuration": 2350,
"p50Duration": 2200,
"p95Duration": 3500,
"p99Duration": 4200,
"errorRate": 0.013,
"throughput": 145.3
}
},
{
"key": "agent-mesh-001",
"metrics": {
"totalRequests": 1046,
"successRate": 0.999,
"averageDuration": 45,
"p50Duration": 40,
"p95Duration": 80,
"p99Duration": 120,
"errorRate": 0.001,
"throughput": 290.5
}
}
]
}
3. Real-time Monitoring
WebSocket Stream
Subscribe to real-time trace events:
WebSocket: ws://tracer.local.bluefly.io/api/v1/traces/stream
Subscribe Message:
{
"action": "subscribe",
"filters": {
"agents": ["agent-brain-001"],
"minDuration": 1000,
"status": ["error", "timeout"]
}
}
Stream Events:
{
"type": "trace-completed",
"traceId": "trace-xyz789",
"timestamp": "2025-01-15T10:05:00Z",
"data": {
"name": "code-generation-workflow",
"duration": 3200,
"status": "success",
"spans": 8,
"bottlenecks": 1
}
}
Get Live Metrics
GET /api/v1/metrics/live
Response:
{
"timestamp": "2025-01-15T10:05:00Z",
"metrics": {
"requestsPerSecond": 145.3,
"averageLatency": 2350,
"errorRate": 0.013,
"activeTraces": 23,
"queuedRequests": 5
},
"topAgents": [
{
"agentId": "agent-brain-001",
"requestsPerSecond": 87.2,
"averageLatency": 2400
},
{
"agentId": "agent-mesh-001",
"requestsPerSecond": 174.4,
"averageLatency": 45
}
]
}
4. Error Analysis
Get Error Traces
GET /api/v1/traces?status=error&limit=50
Response:
{
"traces": [
{
"traceId": "trace-err123",
"name": "code-generation-workflow",
"startTime": "2025-01-15T10:00:00Z",
"endTime": "2025-01-15T10:00:01Z",
"duration": 1000,
"status": "error",
"error": {
"code": "MODEL_TIMEOUT",
"message": "Model inference timed out after 30000ms",
"spanId": "span-002",
"agent": "agent-brain-001"
}
}
],
"total": 7,
"errorRate": 0.013
}
Error Distribution
GET /api/v1/metrics/errors?timeRange=24h
Response:
{
"timeRange": "24h",
"totalErrors": 152,
"errorRate": 0.012,
"errorsByType": [
{
"errorCode": "MODEL_TIMEOUT",
"count": 87,
"percentage": 57.2,
"affectedAgents": ["agent-brain-001", "agent-brain-002"]
},
{
"errorCode": "AGENT_UNAVAILABLE",
"count": 42,
"percentage": 27.6,
"affectedAgents": ["agent-mesh-001"]
},
{
"errorCode": "VALIDATION_ERROR",
"count": 23,
"percentage": 15.1,
"affectedAgents": ["agent-router-001"]
}
],
"recommendations": [
"Increase MODEL_TIMEOUT threshold to 60000ms",
"Add health check retries for agent-mesh-001",
"Improve input validation schemas"
]
}
5. Service Dependencies
Get Dependency Graph
GET /api/v1/dependencies?timeRange=1h
Response:
{
"timeRange": "1h",
"nodes": [
{
"serviceId": "agent-mesh-001",
"serviceName": "Agent Mesh",
"requestsHandled": 1046,
"errorRate": 0.001
},
{
"serviceId": "agent-brain-001",
"serviceName": "Agent Brain",
"requestsHandled": 523,
"errorRate": 0.013
},
{
"serviceId": "ollama-mcp",
"serviceName": "Ollama MCP",
"requestsHandled": 523,
"errorRate": 0.015
}
],
"edges": [
{
"from": "agent-mesh-001",
"to": "agent-brain-001",
"requests": 523,
"averageLatency": 2400,
"errorRate": 0.013
},
{
"from": "agent-brain-001",
"to": "ollama-mcp",
"requests": 523,
"averageLatency": 2350,
"errorRate": 0.015
}
]
}
Storage Backend
Agent Tracer uses TimescaleDB for efficient time-series storage:
Database: PostgreSQL with TimescaleDB extension Retention: - Raw traces: 7 days - Aggregated metrics: 90 days - Error logs: 30 days
Automatic Downsampling: - 1-minute aggregations: 7 days - 1-hour aggregations: 30 days - 1-day aggregations: 90 days
Alerting
Configure Alerts
POST /api/v1/alerts
Request:
{
"name": "High Error Rate",
"condition": {
"metric": "errorRate",
"threshold": 0.05,
"operator": "greaterThan",
"timeWindow": "5m"
},
"notifications": [
{
"type": "slack",
"webhook": "https://hooks.slack.com/...",
"channel": "#alerts"
},
{
"type": "email",
"recipients": ["ops@bluefly.io"]
}
],
"severity": "critical"
}
Response:
{
"alertId": "alert-123",
"name": "High Error Rate",
"status": "active",
"createdAt": "2025-01-15T10:10:00Z"
}
Integration
Instrument Your Agent
Add tracing to your agent code:
TypeScript:
import { AgentTracer } from '@bluefly/agent-tracer';
const tracer = new AgentTracer({
serviceName: 'my-custom-agent',
endpoint: 'http://tracer.local.bluefly.io/api/v1'
});
async function executeTask(task: Task) {
const span = tracer.startSpan('execute-task', {
tags: { taskType: task.type }
});
try {
const result = await processTask(task);
span.setTag('status', 'success');
return result;
} catch (error) {
span.setTag('error', true);
span.log({ errorMessage: error.message });
throw error;
} finally {
span.finish();
}
}
Health & Metrics
GET /api/v1/health
Response:
{
"status": "healthy",
"version": "1.0.0",
"uptime": 86400,
"database": {
"status": "healthy",
"latency": 3,
"connections": 12
},
"storage": {
"tracesStored": 1523422,
"spansStored": 15234220,
"diskUsage": "42.3GB"
}
}