← Documentation Home

Agent Tracer API

Operations intelligence and distributed tracing for agent communication.

Overview

Service: Phoenix Arise Tracer Port: 3100 Domain: tracer.local.bluefly.io Protocol: REST + WebSocket Version: 1.0.0 OpenAPI Spec: /technical-guide/openapi/agent-tracer/tracer.openapi.yaml

What It Does

Agent Tracer provides comprehensive distributed tracing and performance monitoring: - Distributed tracing: Track requests across agent mesh - Performance metrics: Latency, throughput, error rates - Bottleneck analysis: Identify performance bottlenecks - Critical path analysis: Find slowest execution paths - Real-time monitoring: Live metrics and alerts

Architecture

graph LR
    A[Agent Mesh] -->|Traces| T[Agent Tracer]
    B[Workflow Engine] -->|Traces| T
    C[LLM Gateway] -->|Traces| T
    T --> D[(TimescaleDB)]
    T --> E[Real-time Analytics]
    T --> F[Alerting]

Core Features

1. Trace Collection

List Traces

GET /api/v1/traces?limit=100&agentId=agent-brain-001&status=success

Query Parameters: - limit (integer, default: 100) - Maximum traces to return - agentId (string) - Filter by agent ID - status (enum: success, error, timeout) - Filter by status - startTime (ISO 8601) - Start time filter - endTime (ISO 8601) - End time filter

Response:

{
  "traces": [
    {
      "traceId": "trace-abc123",
      "name": "code-generation-workflow",
      "startTime": "2025-01-15T10:00:00Z",
      "endTime": "2025-01-15T10:00:02.5Z",
      "duration": 2500,
      "status": "success",
      "spans": 12,
      "agents": ["agent-brain-001", "agent-mesh-001"],
      "tags": {
        "workflow": "code-generation",
        "priority": "normal"
      }
    }
  ],
  "total": 523,
  "page": 1,
  "pageSize": 100
}

Get Trace Details

GET /api/v1/traces/{traceId}

Response:

{
  "traceId": "trace-abc123",
  "name": "code-generation-workflow",
  "startTime": "2025-01-15T10:00:00Z",
  "endTime": "2025-01-15T10:00:02.5Z",
  "duration": 2500,
  "status": "success",
  "spans": [
    {
      "spanId": "span-001",
      "name": "task-routing",
      "startTime": "2025-01-15T10:00:00Z",
      "duration": 50,
      "agent": "agent-mesh-001",
      "tags": {
        "operation": "discover-agent"
      }
    },
    {
      "spanId": "span-002",
      "name": "code-generation",
      "startTime": "2025-01-15T10:00:00.05Z",
      "duration": 2400,
      "agent": "agent-brain-001",
      "tags": {
        "model": "qwen2.5-coder:32b",
        "tokens": 1523
      }
    },
    {
      "spanId": "span-003",
      "name": "response-formatting",
      "startTime": "2025-01-15T10:00:02.45Z",
      "duration": 50,
      "agent": "agent-mesh-001",
      "tags": {
        "format": "json"
      }
    }
  ],
  "metrics": {
    "totalDuration": 2500,
    "networkTime": 100,
    "processingTime": 2400,
    "overhead": 4
  }
}

2. Performance Analysis

Get Critical Path

Find the slowest execution path through a trace.

GET /api/v1/traces/{traceId}/critical-path

Response:

{
  "traceId": "trace-abc123",
  "criticalPath": [
    {
      "spanId": "span-002",
      "name": "code-generation",
      "duration": 2400,
      "percentOfTotal": 96
    },
    {
      "spanId": "span-001",
      "name": "task-routing",
      "duration": 50,
      "percentOfTotal": 2
    },
    {
      "spanId": "span-003",
      "name": "response-formatting",
      "duration": 50,
      "percentOfTotal": 2
    }
  ],
  "totalDuration": 2500,
  "criticalPathDuration": 2500,
  "percentOfTotal": 100
}

Identify Bottlenecks

GET /api/v1/traces/{traceId}/bottlenecks

Response:

{
  "traceId": "trace-abc123",
  "bottlenecks": [
    {
      "spanId": "span-002",
      "name": "code-generation",
      "duration": 2400,
      "expectedDuration": 1500,
      "slowdownFactor": 1.6,
      "severity": "medium",
      "recommendations": [
        "Consider using smaller model (deepseek-coder-v2:16b)",
        "Enable response streaming",
        "Add caching layer"
      ]
    }
  ],
  "totalBottlenecks": 1,
  "overallPerformance": "acceptable"
}

Get Aggregated Metrics

GET /api/v1/metrics/aggregate?timeRange=1h&groupBy=agent

Query Parameters: - timeRange (string) - Time range (1h, 24h, 7d, 30d) - groupBy (enum: agent, operation, status) - Aggregation key - metric (enum: duration, throughput, errors) - Metric to aggregate

Response:

{
  "timeRange": "1h",
  "startTime": "2025-01-15T09:00:00Z",
  "endTime": "2025-01-15T10:00:00Z",
  "aggregations": [
    {
      "key": "agent-brain-001",
      "metrics": {
        "totalRequests": 523,
        "successRate": 0.987,
        "averageDuration": 2350,
        "p50Duration": 2200,
        "p95Duration": 3500,
        "p99Duration": 4200,
        "errorRate": 0.013,
        "throughput": 145.3
      }
    },
    {
      "key": "agent-mesh-001",
      "metrics": {
        "totalRequests": 1046,
        "successRate": 0.999,
        "averageDuration": 45,
        "p50Duration": 40,
        "p95Duration": 80,
        "p99Duration": 120,
        "errorRate": 0.001,
        "throughput": 290.5
      }
    }
  ]
}

3. Real-time Monitoring

WebSocket Stream

Subscribe to real-time trace events:

WebSocket: ws://tracer.local.bluefly.io/api/v1/traces/stream

Subscribe Message:

{
  "action": "subscribe",
  "filters": {
    "agents": ["agent-brain-001"],
    "minDuration": 1000,
    "status": ["error", "timeout"]
  }
}

Stream Events:

{
  "type": "trace-completed",
  "traceId": "trace-xyz789",
  "timestamp": "2025-01-15T10:05:00Z",
  "data": {
    "name": "code-generation-workflow",
    "duration": 3200,
    "status": "success",
    "spans": 8,
    "bottlenecks": 1
  }
}

Get Live Metrics

GET /api/v1/metrics/live

Response:

{
  "timestamp": "2025-01-15T10:05:00Z",
  "metrics": {
    "requestsPerSecond": 145.3,
    "averageLatency": 2350,
    "errorRate": 0.013,
    "activeTraces": 23,
    "queuedRequests": 5
  },
  "topAgents": [
    {
      "agentId": "agent-brain-001",
      "requestsPerSecond": 87.2,
      "averageLatency": 2400
    },
    {
      "agentId": "agent-mesh-001",
      "requestsPerSecond": 174.4,
      "averageLatency": 45
    }
  ]
}

4. Error Analysis

Get Error Traces

GET /api/v1/traces?status=error&limit=50

Response:

{
  "traces": [
    {
      "traceId": "trace-err123",
      "name": "code-generation-workflow",
      "startTime": "2025-01-15T10:00:00Z",
      "endTime": "2025-01-15T10:00:01Z",
      "duration": 1000,
      "status": "error",
      "error": {
        "code": "MODEL_TIMEOUT",
        "message": "Model inference timed out after 30000ms",
        "spanId": "span-002",
        "agent": "agent-brain-001"
      }
    }
  ],
  "total": 7,
  "errorRate": 0.013
}

Error Distribution

GET /api/v1/metrics/errors?timeRange=24h

Response:

{
  "timeRange": "24h",
  "totalErrors": 152,
  "errorRate": 0.012,
  "errorsByType": [
    {
      "errorCode": "MODEL_TIMEOUT",
      "count": 87,
      "percentage": 57.2,
      "affectedAgents": ["agent-brain-001", "agent-brain-002"]
    },
    {
      "errorCode": "AGENT_UNAVAILABLE",
      "count": 42,
      "percentage": 27.6,
      "affectedAgents": ["agent-mesh-001"]
    },
    {
      "errorCode": "VALIDATION_ERROR",
      "count": 23,
      "percentage": 15.1,
      "affectedAgents": ["agent-router-001"]
    }
  ],
  "recommendations": [
    "Increase MODEL_TIMEOUT threshold to 60000ms",
    "Add health check retries for agent-mesh-001",
    "Improve input validation schemas"
  ]
}

5. Service Dependencies

Get Dependency Graph

GET /api/v1/dependencies?timeRange=1h

Response:

{
  "timeRange": "1h",
  "nodes": [
    {
      "serviceId": "agent-mesh-001",
      "serviceName": "Agent Mesh",
      "requestsHandled": 1046,
      "errorRate": 0.001
    },
    {
      "serviceId": "agent-brain-001",
      "serviceName": "Agent Brain",
      "requestsHandled": 523,
      "errorRate": 0.013
    },
    {
      "serviceId": "ollama-mcp",
      "serviceName": "Ollama MCP",
      "requestsHandled": 523,
      "errorRate": 0.015
    }
  ],
  "edges": [
    {
      "from": "agent-mesh-001",
      "to": "agent-brain-001",
      "requests": 523,
      "averageLatency": 2400,
      "errorRate": 0.013
    },
    {
      "from": "agent-brain-001",
      "to": "ollama-mcp",
      "requests": 523,
      "averageLatency": 2350,
      "errorRate": 0.015
    }
  ]
}

Storage Backend

Agent Tracer uses TimescaleDB for efficient time-series storage:

Database: PostgreSQL with TimescaleDB extension Retention: - Raw traces: 7 days - Aggregated metrics: 90 days - Error logs: 30 days

Automatic Downsampling: - 1-minute aggregations: 7 days - 1-hour aggregations: 30 days - 1-day aggregations: 90 days


Alerting

Configure Alerts

POST /api/v1/alerts

Request:

{
  "name": "High Error Rate",
  "condition": {
    "metric": "errorRate",
    "threshold": 0.05,
    "operator": "greaterThan",
    "timeWindow": "5m"
  },
  "notifications": [
    {
      "type": "slack",
      "webhook": "https://hooks.slack.com/...",
      "channel": "#alerts"
    },
    {
      "type": "email",
      "recipients": ["ops@bluefly.io"]
    }
  ],
  "severity": "critical"
}

Response:

{
  "alertId": "alert-123",
  "name": "High Error Rate",
  "status": "active",
  "createdAt": "2025-01-15T10:10:00Z"
}

Integration

Instrument Your Agent

Add tracing to your agent code:

TypeScript:

import { AgentTracer } from '@bluefly/agent-tracer';

const tracer = new AgentTracer({
  serviceName: 'my-custom-agent',
  endpoint: 'http://tracer.local.bluefly.io/api/v1'
});

async function executeTask(task: Task) {
  const span = tracer.startSpan('execute-task', {
    tags: { taskType: task.type }
  });

  try {
    const result = await processTask(task);
    span.setTag('status', 'success');
    return result;
  } catch (error) {
    span.setTag('error', true);
    span.log({ errorMessage: error.message });
    throw error;
  } finally {
    span.finish();
  }
}

Health & Metrics

GET /api/v1/health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": 86400,
  "database": {
    "status": "healthy",
    "latency": 3,
    "connections": 12
  },
  "storage": {
    "tracesStored": 1523422,
    "spansStored": 15234220,
    "diskUsage": "42.3GB"
  }
}

Next Steps