Agent Tracer API

Operations intelligence and distributed tracing for agent communication.

Overview

Service: Phoenix Arise Tracer Port: 3100 Domain: tracer.local.bluefly.io Protocol: REST + WebSocket Version: 1.0.0 OpenAPI Spec: /technical-guide/openapi/agent-tracer/tracer.openapi.yaml

What It Does

Agent Tracer provides comprehensive distributed tracing and performance monitoring: - Distributed tracing: Track requests across agent mesh - Performance metrics: Latency, throughput, error rates - Bottleneck analysis: Identify performance bottlenecks - Critical path analysis: Find slowest execution paths - Real-time monitoring: Live metrics and alerts

Architecture

graph LR
    A[Agent Mesh] -->|Traces| T[Agent Tracer]
    B[Workflow Engine] -->|Traces| T
    C[LLM Gateway] -->|Traces| T
    T --> D[(TimescaleDB)]
    T --> E[Real-time Analytics]
    T --> F[Alerting]

Core Features

1. Trace Collection

List Traces

GET /api/v1/traces?limit=100&agentId=agent-brain-001&status=success

Query Parameters: - limit (integer, default: 100) - Maximum traces to return - agentId (string) - Filter by agent ID - status (enum: success, error, timeout) - Filter by status - startTime (ISO 8601) - Start time filter - endTime (ISO 8601) - End time filter

Response:

{
  "traces": [
    {
      "traceId": "trace-abc123",
      "name": "code-generation-workflow",
      "startTime": "2025-01-15T10:00:00Z",
      "endTime": "2025-01-15T10:00:02.5Z",
      "duration": 2500,
      "status": "success",
      "spans": 12,
      "agents": ["agent-brain-001", "agent-mesh-001"],
      "tags": {
        "workflow": "code-generation",
        "priority": "normal"
      }
    }
  ],
  "total": 523,
  "page": 1,
  "pageSize": 100
}

Get Trace Details

GET /api/v1/traces/{traceId}

Response:

{
  "traceId": "trace-abc123",
  "name": "code-generation-workflow",
  "startTime": "2025-01-15T10:00:00Z",
  "endTime": "2025-01-15T10:00:02.5Z",
  "duration": 2500,
  "status": "success",
  "spans": [
    {
      "spanId": "span-001",
      "name": "task-routing",
      "startTime": "2025-01-15T10:00:00Z",
      "duration": 50,
      "agent": "agent-mesh-001",
      "tags": {
        "operation": "discover-agent"
      }
    },
    {
      "spanId": "span-002",
      "name": "code-generation",
      "startTime": "2025-01-15T10:00:00.05Z",
      "duration": 2400,
      "agent": "agent-brain-001",
      "tags": {
        "model": "qwen2.5-coder:32b",
        "tokens": 1523
      }
    },
    {
      "spanId": "span-003",
      "name": "response-formatting",
      "startTime": "2025-01-15T10:00:02.45Z",
      "duration": 50,
      "agent": "agent-mesh-001",
      "tags": {
        "format": "json"
      }
    }
  ],
  "metrics": {
    "totalDuration": 2500,
    "networkTime": 100,
    "processingTime": 2400,
    "overhead": 4
  }
}

2. Performance Analysis

Get Critical Path

Find the slowest execution path through a trace.

GET /api/v1/traces/{traceId}/critical-path

Response:

{
  "traceId": "trace-abc123",
  "criticalPath": [
    {
      "spanId": "span-002",
      "name": "code-generation",
      "duration": 2400,
      "percentOfTotal": 96
    },
    {
      "spanId": "span-001",
      "name": "task-routing",
      "duration": 50,
      "percentOfTotal": 2
    },
    {
      "spanId": "span-003",
      "name": "response-formatting",
      "duration": 50,
      "percentOfTotal": 2
    }
  ],
  "totalDuration": 2500,
  "criticalPathDuration": 2500,
  "percentOfTotal": 100
}

Identify Bottlenecks

GET /api/v1/traces/{traceId}/bottlenecks

Response:

{
  "traceId": "trace-abc123",
  "bottlenecks": [
    {
      "spanId": "span-002",
      "name": "code-generation",
      "duration": 2400,
      "expectedDuration": 1500,
      "slowdownFactor": 1.6,
      "severity": "medium",
      "recommendations": [
        "Consider using smaller model (deepseek-coder-v2:16b)",
        "Enable response streaming",
        "Add caching layer"
      ]
    }
  ],
  "totalBottlenecks": 1,
  "overallPerformance": "acceptable"
}

Get Aggregated Metrics

GET /api/v1/metrics/aggregate?timeRange=1h&groupBy=agent

Query Parameters: - timeRange (string) - Time range (1h, 24h, 7d, 30d) - groupBy (enum: agent, operation, status) - Aggregation key - metric (enum: duration, throughput, errors) - Metric to aggregate

Response:

{
  "timeRange": "1h",
  "startTime": "2025-01-15T09:00:00Z",
  "endTime": "2025-01-15T10:00:00Z",
  "aggregations": [
    {
      "key": "agent-brain-001",
      "metrics": {
        "totalRequests": 523,
        "successRate": 0.987,
        "averageDuration": 2350,
        "p50Duration": 2200,
        "p95Duration": 3500,
        "p99Duration": 4200,
        "errorRate": 0.013,
        "throughput": 145.3
      }
    },
    {
      "key": "agent-mesh-001",
      "metrics": {
        "totalRequests": 1046,
        "successRate": 0.999,
        "averageDuration": 45,
        "p50Duration": 40,
        "p95Duration": 80,
        "p99Duration": 120,
        "errorRate": 0.001,
        "throughput": 290.5
      }
    }
  ]
}

3. Real-time Monitoring

WebSocket Stream

Subscribe to real-time trace events:

WebSocket: ws://tracer.local.bluefly.io/api/v1/traces/stream

Subscribe Message:

{
  "action": "subscribe",
  "filters": {
    "agents": ["agent-brain-001"],
    "minDuration": 1000,
    "status": ["error", "timeout"]
  }
}

Stream Events:

{
  "type": "trace-completed",
  "traceId": "trace-xyz789",
  "timestamp": "2025-01-15T10:05:00Z",
  "data": {
    "name": "code-generation-workflow",
    "duration": 3200,
    "status": "success",
    "spans": 8,
    "bottlenecks": 1
  }
}

Get Live Metrics

GET /api/v1/metrics/live

Response:

{
  "timestamp": "2025-01-15T10:05:00Z",
  "metrics": {
    "requestsPerSecond": 145.3,
    "averageLatency": 2350,
    "errorRate": 0.013,
    "activeTraces": 23,
    "queuedRequests": 5
  },
  "topAgents": [
    {
      "agentId": "agent-brain-001",
      "requestsPerSecond": 87.2,
      "averageLatency": 2400
    },
    {
      "agentId": "agent-mesh-001",
      "requestsPerSecond": 174.4,
      "averageLatency": 45
    }
  ]
}

4. Error Analysis

Get Error Traces

GET /api/v1/traces?status=error&limit=50

Response:

{
  "traces": [
    {
      "traceId": "trace-err123",
      "name": "code-generation-workflow",
      "startTime": "2025-01-15T10:00:00Z",
      "endTime": "2025-01-15T10:00:01Z",
      "duration": 1000,
      "status": "error",
      "error": {
        "code": "MODEL_TIMEOUT",
        "message": "Model inference timed out after 30000ms",
        "spanId": "span-002",
        "agent": "agent-brain-001"
      }
    }
  ],
  "total": 7,
  "errorRate": 0.013
}

Error Distribution

GET /api/v1/metrics/errors?timeRange=24h

Response:

{
  "timeRange": "24h",
  "totalErrors": 152,
  "errorRate": 0.012,
  "errorsByType": [
    {
      "errorCode": "MODEL_TIMEOUT",
      "count": 87,
      "percentage": 57.2,
      "affectedAgents": ["agent-brain-001", "agent-brain-002"]
    },
    {
      "errorCode": "AGENT_UNAVAILABLE",
      "count": 42,
      "percentage": 27.6,
      "affectedAgents": ["agent-mesh-001"]
    },
    {
      "errorCode": "VALIDATION_ERROR",
      "count": 23,
      "percentage": 15.1,
      "affectedAgents": ["agent-router-001"]
    }
  ],
  "recommendations": [
    "Increase MODEL_TIMEOUT threshold to 60000ms",
    "Add health check retries for agent-mesh-001",
    "Improve input validation schemas"
  ]
}

5. Service Dependencies

Get Dependency Graph

GET /api/v1/dependencies?timeRange=1h

Response:

{
  "timeRange": "1h",
  "nodes": [
    {
      "serviceId": "agent-mesh-001",
      "serviceName": "Agent Mesh",
      "requestsHandled": 1046,
      "errorRate": 0.001
    },
    {
      "serviceId": "agent-brain-001",
      "serviceName": "Agent Brain",
      "requestsHandled": 523,
      "errorRate": 0.013
    },
    {
      "serviceId": "ollama-mcp",
      "serviceName": "Ollama MCP",
      "requestsHandled": 523,
      "errorRate": 0.015
    }
  ],
  "edges": [
    {
      "from": "agent-mesh-001",
      "to": "agent-brain-001",
      "requests": 523,
      "averageLatency": 2400,
      "errorRate": 0.013
    },
    {
      "from": "agent-brain-001",
      "to": "ollama-mcp",
      "requests": 523,
      "averageLatency": 2350,
      "errorRate": 0.015
    }
  ]
}

Storage Backend

Agent Tracer uses TimescaleDB for efficient time-series storage:

Database: PostgreSQL with TimescaleDB extension Retention: - Raw traces: 7 days - Aggregated metrics: 90 days - Error logs: 30 days

Automatic Downsampling: - 1-minute aggregations: 7 days - 1-hour aggregations: 30 days - 1-day aggregations: 90 days

Alerting

Configure Alerts

POST /api/v1/alerts

Request:

{
  "name": "High Error Rate",
  "condition": {
    "metric": "errorRate",
    "threshold": 0.05,
    "operator": "greaterThan",
    "timeWindow": "5m"
  },
  "notifications": [
    {
      "type": "slack",
      "webhook": "https://hooks.slack.com/...",
      "channel": "#alerts"
    },
    {
      "type": "email",
      "recipients": ["ops@bluefly.io"]
    }
  ],
  "severity": "critical"
}

Response:

{
  "alertId": "alert-123",
  "name": "High Error Rate",
  "status": "active",
  "createdAt": "2025-01-15T10:10:00Z"
}

Integration

Instrument Your Agent

Add tracing to your agent code:

TypeScript:

import { AgentTracer } from '@bluefly/agent-tracer';

const tracer = new AgentTracer({
  serviceName: 'my-custom-agent',
  endpoint: 'http://tracer.local.bluefly.io/api/v1'
});

async function executeTask(task: Task) {
  const span = tracer.startSpan('execute-task', {
    tags: { taskType: task.type }
  });

  try {
    const result = await processTask(task);
    span.setTag('status', 'success');
    return result;
  } catch (error) {
    span.setTag('error', true);
    span.log({ errorMessage: error.message });
    throw error;
  } finally {
    span.finish();
  }
}

Health & Metrics

GET /api/v1/health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": 86400,
  "database": {
    "status": "healthy",
    "latency": 3,
    "connections": 12
  },
  "storage": {
    "tracesStored": 1523422,
    "spansStored": 15234220,
    "diskUsage": "42.3GB"
  }
}

Agent Tracer API

Overview

What It Does

Architecture

Core Features

1. Trace Collection

List Traces

Get Trace Details

2. Performance Analysis

Get Critical Path

Identify Bottlenecks

Get Aggregated Metrics

3. Real-time Monitoring

WebSocket Stream

Get Live Metrics

4. Error Analysis

Get Error Traces

Error Distribution

5. Service Dependencies

Get Dependency Graph

Storage Backend

Alerting

Configure Alerts

Integration

Instrument Your Agent

Health & Metrics

Next Steps