← Documentation Home

Agent Mesh - Architecture

Mesh Networking, Service Discovery, and Distributed Coordination Architecture

Overview

Agent Mesh implements a distributed, peer-to-peer architecture for coordinating autonomous agents across networks. Built on Tailscale's encrypted mesh networking with MagicDNS for zero-configuration service discovery, the architecture enables secure, scalable, and fault-tolerant agent coordination.

System Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         Agent Mesh Control Plane                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     Coordinator Service                            │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │  │
│  │  │ Task Queue   │  │ Load Balancer│  │ Fault Handler│            │  │
│  │  │ Management   │  │  - RoundRobin│  │ - Retry Logic│            │  │
│  │  │              │  │  - LeastLoad │  │ - Failover   │            │  │
│  │  │              │  │  - CapMatch  │  │ - Timeout    │            │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘            │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     Discovery Service                              │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │  │
│  │  │  Agent       │  │  Health      │  │  Namespace   │            │  │
│  │  │  Registry    │  │  Monitor     │  │  Isolation   │            │  │
│  │  │              │  │  - Heartbeat │  │  - Groups    │            │  │
│  │  │              │  │  - Status    │  │  - ACLs      │            │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘            │  │
│  └───────────────────��───────────────────────────────────────────────┘  │
│                                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     Transport Service                              │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │  │
│  │  │  Message     │  │  RPC         │  │  Streaming   │            │  │
│  │  │  Delivery    │  │  Handler     │  │  Support     │            │  │
│  │  │              │  │              │  │              │            │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘            │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     Auth Service                                   │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │  │
│  │  │  JWT         │  │  ACL         │  │  Audit       │            │  │
│  │  │  Management  │  │  Policies    │  │  Logging     │            │  │
│  │  │              │  │              │  │              │            │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘            │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                           │
└───────────────────────────────┬───────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    Tailscale Network Layer                              │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  MagicDNS: agent-mesh-production-worker-1.tailnet.ts.net         │  │
│  │  WireGuard Encryption: End-to-end encrypted mesh network         │  │
│  │  Automatic NAT Traversal: Direct peer-to-peer connections        │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────���────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         Agent Data Plane                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐       │
│  │Orchestrator│  │  Worker 1  │  │  Worker 2  │  │  Monitor   │       │
│  │   Agent    │  │   Agent    │  │   Agent    │  │   Agent    │       │
│  │            │  │            │  │            │  │            │       │
│  │ - Planning │  │ - Execution│  │ - Execution│  │ - Metrics  │       │
│  │ - Workflow │  │ - Results  │  │ - Results  │  │ - Health   │       │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘       │
│                                                                           │
└─────────────────────────────────────────────────────────────────────────┘

Core Components

1. Discovery Service

Purpose: Automatic agent registration and discovery without manual configuration.

Key Features: - Tailscale MagicDNS integration for service discovery - Agent identity management with capabilities - Namespace-based isolation - Health checking and heartbeat monitoring - Agent lifecycle management

Architecture:

interface MeshAgent {
  identity: {
    agentId: string;
    agentName: string;
    agentType: 'orchestrator' | 'worker' | 'monitor' | 'integrator' | 'governor' | 'critic';
    namespace: string;
    capabilities: string[];
    version: string;
  };
  tailscaleHostname: string;
  tailscaleIP: string;
  port: number;
  endpoint: string;
  healthStatus: 'healthy' | 'degraded' | 'unreachable';
  lastSeen: Date;
  registeredAt: Date;
}

Discovery Flow: 1. Agent registers with Discovery Service 2. Service registers agent with Tailscale MagicDNS 3. Agent receives unique hostname (e.g., agent-mesh-prod-worker-1.tailnet.ts.net) 4. Heartbeat monitoring starts automatically 5. Other agents discover via capability-based queries

Health Monitoring: - Healthy: Last seen < 1 minute ago - Degraded: Last seen 1-5 minutes ago - Unreachable: Last seen > 5 minutes ago

2. Coordinator Service

Purpose: Intelligent task distribution and load balancing across agent mesh.

Key Features: - Task queue management with priority levels - Agent capability matching - Multiple load balancing strategies - Fault tolerance with automatic retry - Task status tracking and result retrieval - Workload analytics per agent

Architecture:

interface Task {
  taskId: string;
  taskType: string;
  payload: any;
  requiredCapabilities: string[];
  priority: 'low' | 'medium' | 'high' | 'critical';
  timeout: number;
  retries: number;
  metadata?: Record<string, any>;
}

interface TaskAssignment {
  taskId: string;
  agentId: string;
  assignedAt: Date;
  status: 'pending' | 'running' | 'completed' | 'failed' | 'timeout';
  result?: any;
  error?: string;
}

Task Routing Flow: 1. Task submitted to Coordinator 2. Coordinator analyzes required capabilities 3. Discovery Service queried for capable agents 4. Load balancing strategy applied 5. Task assigned to selected agent 6. Transport Service delivers task 7. Agent executes and returns result 8. Coordinator updates task status

Load Balancing Strategies:

Round-Robin

Least-Loaded

Capability-Match

Random

3. Transport Service

Purpose: Secure, reliable agent-to-agent communication over Tailscale network.

Key Features: - HTTP/HTTPS over Tailscale encrypted mesh - Request/response messaging patterns - Streaming support for large payloads - Broadcast messaging to multiple agents - Connection pooling and reuse - Automatic retry with exponential backoff - Connection statistics and monitoring

Architecture:

interface Message {
  messageId: string;
  from: string;        // Source agent ID
  to: string;          // Target agent ID
  type: 'request' | 'response' | 'event' | 'stream';
  payload: any;
  timestamp: Date;
  correlationId?: string;
  metadata?: Record<string, any>;
}

interface TransportConfig {
  timeout: number;              // Default: 30000ms
  retries: number;              // Default: 3
  retryDelay: number;           // Default: 1000ms
  maxConcurrentConnections: number;  // Default: 10
  keepAlive: boolean;           // Default: true
  useTLS: boolean;              // Default: false (Tailscale provides encryption)
}

Communication Patterns:

Request/Response

Agent A → [REQUEST] → Agent B
Agent A ← [RESPONSE] ← Agent B

Event Broadcasting

Coordinator → [EVENT] → Agent 1
              [EVENT] → Agent 2
              [EVENT] → Agent 3

Streaming

Agent A → [CHUNK 1] → Agent B
        → [CHUNK 2] →
        → [CHUNK 3] →
        → [COMPLETE] →

Transport Flow: 1. Coordinator creates message with metadata 2. Transport Service resolves target agent endpoint 3. HTTP client created/reused from connection pool 4. Message sent over Tailscale encrypted network 5. Target agent receives and processes 6. Response returned via same channel 7. Connection statistics updated

4. Auth Service

Purpose: Zero-trust authentication and authorization for agent-to-agent communication.

Key Features: - JWT-based agent authentication - Role-based access control (RBAC) - ACL policies per agent type - Permission checking at API level - Security audit logging - Token revocation support

Architecture:

interface AuthToken {
  agentId: string;
  agentType: string;
  namespace: string;
  capabilities: string[];
  issuedAt: number;
  expiresAt: number;
}

interface Permission {
  resource: string;
  action: 'read' | 'write' | 'execute' | 'admin';
  namespace?: string;
}

interface ACLPolicy {
  agentType: string;
  allowedActions: string[];
  allowedResources: string[];
  deniedActions?: string[];
  deniedResources?: string[];
}

Default ACL Policies:

Agent Type Allowed Actions Allowed Resources Denied Actions
orchestrator * (all) * (all) None
worker read, execute tasks, results admin
monitor read * (all) write, execute, admin
integrator read, write integrations, data admin
governor read, write, admin policies, acl, audit None
critic read, write reviews, feedback, metrics admin

Authentication Flow: 1. Agent requests auth token with identity 2. Auth Service generates JWT with claims 3. Token includes agent type and capabilities 4. Token signed with secret key 5. Agent includes token in mesh operations 6. Auth Service verifies token on each request 7. ACL policy checked for permission 8. Audit log entry created

Permission Check Flow:

Request → Extract Token → Verify Token → Check ACL Policy
          ↓               ↓               ↓
          Token Valid?    Not Expired?    Permission Granted?
          ↓               ↓               ↓
          Success         Success         Success / Deny

Tailscale Integration

MagicDNS Service Discovery

How It Works: 1. Agent Mesh registers service with Tailscale 2. Tailscale creates DNS entry: agent-mesh-{namespace}-{agentId}.tailnet.ts.net 3. DNS automatically resolves to agent's Tailscale IP 4. No configuration files or service registry needed 5. Agents discover each other via DNS queries

Example DNS Entries:

agent-mesh-production-worker-1.tailnet.ts.net → 100.64.0.1
agent-mesh-production-worker-2.tailnet.ts.net → 100.64.0.2
agent-mesh-staging-worker-1.tailnet.ts.net    → 100.64.0.3

WireGuard Encryption

Security Features: - End-to-end encryption for all agent communication - Perfect forward secrecy - Automatic key rotation - No certificate management required - Zero-trust network model

NAT Traversal

Automatic Direct Connections: - Tailscale handles NAT traversal automatically - Direct peer-to-peer connections when possible - DERP relay fallback for complex NAT scenarios - No port forwarding or firewall configuration needed

Data Flow Patterns

Task Execution Flow

┌──────────────┐
│ Orchestrator │
│   Agent      │
└──────┬───────┘
       │ 1. Submit Task
       ▼
┌──────────────┐
│ Coordinator  │
│   Service    │
└──────┬───────┘
       │ 2. Query Capable Agents
       ▼
┌──────────────┐
│  Discovery   │
│   Service    │
└──────┬───────┘
       │ 3. Return Agent List
       ▼
┌──────────────┐
│ Coordinator  │ 4. Apply Load Balancing
│   Service    │ 5. Select Agent
└──────┬───────┘
       │ 6. Send Task
       ▼
┌──────────────┐
│  Transport   │
│   Service    │
└──────┬───────┘
       │ 7. Deliver via HTTP
       ▼
┌──────────────┐
│   Worker     │
│   Agent      │ 8. Execute Task
└──────┬───────┘
       │ 9. Return Result
       ▼
┌──────────────┐
│  Transport   │
│   Service    │
└──────┬───────┘
       │ 10. Store Result
       ▼
┌──────────────┐
│ Coordinator  │
│   Service    │ 11. Update Status
└──────┬───────┘
       │ 12. Notify Orchestrator
       ▼
┌──────────────┐
│ Orchestrator │
│   Agent      │
└──────────────┘

Agent Registration Flow

┌──────────────┐
│  New Agent   │
└──────┬───────┘
       │ 1. Register Request
       ▼
┌──────────────┐
│  Discovery   │
│   Service    │
└──────┬───────┘
       │ 2. Validate Identity
       │ 3. Check Tailscale Status
       ▼
┌──────────────┐
│  Tailscale   │
│   Service    │
└──────┬───────┘
       │ 4. Register DNS Entry
       │ 5. Get Tailscale IP
       ▼
┌──────────────┐
│  Discovery   │
│   Service    │ 6. Create Agent Entry
└──────┬───────┘ 7. Start Heartbeat
       │ 8. Return Endpoint
       ▼
┌──────────────┐
│  New Agent   │
│   (Active)   │
└──────────────┘

Health Monitoring Flow

┌──────────────┐
│  Discovery   │
│   Service    │ ◄─── Heartbeat Timer (30s)
└──────┬───────┘
       │
       │ For Each Registered Agent:
       │
       ├─ Check Last Seen Timestamp
       │  │
       │  ├─ < 1 min  → Healthy
       │  ├─ 1-5 min  → Degraded
       │  └─ > 5 min  → Unreachable
       │
       ├─ Update Agent Status
       │
       └─ Log Status Changes

Scalability Considerations

Horizontal Scaling

Coordinator Service: - Stateless design allows multiple instances - Task queue can be shared via Redis - Load balancer distributes requests - Capacity: 10,000+ tasks/second per instance

Discovery Service: - Agent registry can be distributed - Health checks can be partitioned - Capacity: 1,000+ agents per instance

Transport Service: - Connection pooling per agent - Async I/O for concurrent requests - Capacity: 10,000+ concurrent connections

Vertical Scaling

Memory Requirements: - Discovery Service: ~100MB + (10KB per agent) - Coordinator Service: ~200MB + (5KB per active task) - Transport Service: ~150MB + (2KB per connection)

CPU Requirements: - Task routing: ~0.1ms per task - Health checking: ~1ms per agent per check - Transport overhead: ~0.5ms per message

Fault Tolerance

Agent Failure Handling

Detection: - Heartbeat timeout (30 seconds) - Failed health checks - Transport errors

Recovery: 1. Mark agent as unreachable 2. Reassign active tasks to healthy agents 3. Update agent discovery cache 4. Notify monitors

Task Failure Handling

Retry Strategy: - Automatic retry with exponential backoff - Configurable retry count (default: 3) - Task timeout enforcement - Alternative agent selection on retry

Timeout Handling: - Default timeout: 5 minutes - Configurable per task - Automatic cleanup of timed-out tasks - Result storage for forensics

Network Partition Handling

Tailscale Resilience: - Automatic reconnection - Multiple relay servers - Direct connection fallback - Health status propagation

Performance Optimization

Connection Pooling

Request Batching

Caching

Monitoring and Observability

Metrics Collected

Coordinator Metrics: - Pending tasks count - Running tasks count - Completed tasks count - Failed tasks count - Average execution time

Discovery Metrics: - Total agents count - Healthy agents count - Degraded agents count - Unreachable agents count

Transport Metrics: - Active connections count - Total requests count - Failed requests count - Average latency

Auth Metrics: - Tokens generated count - Tokens verified count - Permission grants count - Permission denials count

Logging

Structured Logging: - JSON format - Log levels: DEBUG, INFO, WARN, ERROR - Contextual metadata per log entry

Audit Logging: - All authentication events - All permission checks - Agent registration/deregistration - Task assignments and completions

Security Architecture

Threat Model

Network Layer: - Threat: Man-in-the-middle attacks - Mitigation: Tailscale WireGuard encryption

Application Layer: - Threat: Unauthorized agent access - Mitigation: JWT authentication + ACL policies

Data Layer: - Threat: Data leakage via compromised agent - Mitigation: Namespace isolation + least privilege

Security Best Practices

  1. Rotate JWT secrets regularly (recommended: monthly)
  2. Review ACL policies for each agent type
  3. Monitor audit logs for suspicious activity
  4. Use namespace isolation for different environments
  5. Limit agent capabilities to minimum required
  6. Enable Tailscale ACLs for network-level filtering

Extension Points

Custom Agent Types

// Define new agent type with custom ACL
await authService.setACLPolicy('custom-processor', {
  agentType: 'custom-processor',
  allowedActions: ['read', 'execute'],
  allowedResources: ['custom-data', 'results'],
  deniedActions: ['admin']
});

Custom Load Balancing

// Implement custom strategy
class CustomLoadBalancer implements LoadBalancingStrategy {
  selectAgent(agents: MeshAgent[], task: Task): MeshAgent {
    // Custom logic
    return agents[0];
  }
}

Custom Transport Protocols

// Add WebSocket support
class WebSocketTransport implements IMeshTransportService {
  // Implementation
}

Future Enhancements


Related Documentation: - Home - Overview and quick start - Deployment - Production deployment guide - Development - Development and contribution guide