Agent Mesh - Architecture

Mesh Networking, Service Discovery, and Distributed Coordination Architecture

Overview

Agent Mesh implements a distributed, peer-to-peer architecture for coordinating autonomous agents across networks. Built on Tailscale's encrypted mesh networking with MagicDNS for zero-configuration service discovery, the architecture enables secure, scalable, and fault-tolerant agent coordination.

System Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         Agent Mesh Control Plane                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     Coordinator Service                            │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │  │
│  │  │ Task Queue   │  │ Load Balancer│  │ Fault Handler│            │  │
│  │  │ Management   │  │  - RoundRobin│  │ - Retry Logic│            │  │
│  │  │              │  │  - LeastLoad │  │ - Failover   │            │  │
│  │  │              │  │  - CapMatch  │  │ - Timeout    │            │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘            │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     Discovery Service                              │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │  │
│  │  │  Agent       │  │  Health      │  │  Namespace   │            │  │
│  │  │  Registry    │  │  Monitor     │  │  Isolation   │            │  │
│  │  │              │  │  - Heartbeat │  │  - Groups    │            │  │
│  │  │              │  │  - Status    │  │  - ACLs      │            │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘            │  │
│  └───────────────────��───────────────────────────────────────────────┘  │
│                                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     Transport Service                              │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │  │
│  │  │  Message     │  │  RPC         │  │  Streaming   │            │  │
│  │  │  Delivery    │  │  Handler     │  │  Support     │            │  │
│  │  │              │  │              │  │              │            │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘            │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                     Auth Service                                   │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │  │
│  │  │  JWT         │  │  ACL         │  │  Audit       │            │  │
│  │  │  Management  │  │  Policies    │  │  Logging     │            │  │
│  │  │              │  │              │  │              │            │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘            │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                           │
└───────────────────────────────┬───────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    Tailscale Network Layer                              │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  MagicDNS: agent-mesh-production-worker-1.tailnet.ts.net         │  │
│  │  WireGuard Encryption: End-to-end encrypted mesh network         │  │
│  │  Automatic NAT Traversal: Direct peer-to-peer connections        │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────���────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         Agent Data Plane                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐       │
│  │Orchestrator│  │  Worker 1  │  │  Worker 2  │  │  Monitor   │       │
│  │   Agent    │  │   Agent    │  │   Agent    │  │   Agent    │       │
│  │            │  │            │  │            │  │            │       │
│  │ - Planning │  │ - Execution│  │ - Execution│  │ - Metrics  │       │
│  │ - Workflow │  │ - Results  │  │ - Results  │  │ - Health   │       │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘       │
│                                                                           │
└─────────────────────────────────────────────────────────────────────────┘

Core Components

1. Discovery Service

Purpose: Automatic agent registration and discovery without manual configuration.

Key Features: - Tailscale MagicDNS integration for service discovery - Agent identity management with capabilities - Namespace-based isolation - Health checking and heartbeat monitoring - Agent lifecycle management

Architecture:

interface MeshAgent {
  identity: {
    agentId: string;
    agentName: string;
    agentType: 'orchestrator' | 'worker' | 'monitor' | 'integrator' | 'governor' | 'critic';
    namespace: string;
    capabilities: string[];
    version: string;
  };
  tailscaleHostname: string;
  tailscaleIP: string;
  port: number;
  endpoint: string;
  healthStatus: 'healthy' | 'degraded' | 'unreachable';
  lastSeen: Date;
  registeredAt: Date;
}

Discovery Flow: 1. Agent registers with Discovery Service 2. Service registers agent with Tailscale MagicDNS 3. Agent receives unique hostname (e.g., agent-mesh-prod-worker-1.tailnet.ts.net) 4. Heartbeat monitoring starts automatically 5. Other agents discover via capability-based queries

Health Monitoring: - Healthy: Last seen < 1 minute ago - Degraded: Last seen 1-5 minutes ago - Unreachable: Last seen > 5 minutes ago

2. Coordinator Service

Purpose: Intelligent task distribution and load balancing across agent mesh.

Key Features: - Task queue management with priority levels - Agent capability matching - Multiple load balancing strategies - Fault tolerance with automatic retry - Task status tracking and result retrieval - Workload analytics per agent

Architecture:

interface Task {
  taskId: string;
  taskType: string;
  payload: any;
  requiredCapabilities: string[];
  priority: 'low' | 'medium' | 'high' | 'critical';
  timeout: number;
  retries: number;
  metadata?: Record<string, any>;
}

interface TaskAssignment {
  taskId: string;
  agentId: string;
  assignedAt: Date;
  status: 'pending' | 'running' | 'completed' | 'failed' | 'timeout';
  result?: any;
  error?: string;
}

Task Routing Flow: 1. Task submitted to Coordinator 2. Coordinator analyzes required capabilities 3. Discovery Service queried for capable agents 4. Load balancing strategy applied 5. Task assigned to selected agent 6. Transport Service delivers task 7. Agent executes and returns result 8. Coordinator updates task status

Load Balancing Strategies:

Round-Robin

Algorithm: Sequential distribution across agents
Use Case: Homogeneous agent pools
Characteristics: Simple, predictable, stateless

Least-Loaded

Algorithm: Select agent with minimum active tasks
Use Case: Heterogeneous performance characteristics
Characteristics: Balances workload dynamically

Capability-Match

Algorithm: Prefer agents with exact capability matches
Use Case: Specialized task requirements
Characteristics: Optimizes for task-agent affinity

Random

Algorithm: Random selection from capable agents
Use Case: Testing and development
Characteristics: Stateless, no coordination overhead

3. Transport Service

Purpose: Secure, reliable agent-to-agent communication over Tailscale network.

Key Features: - HTTP/HTTPS over Tailscale encrypted mesh - Request/response messaging patterns - Streaming support for large payloads - Broadcast messaging to multiple agents - Connection pooling and reuse - Automatic retry with exponential backoff - Connection statistics and monitoring

Architecture:

interface Message {
  messageId: string;
  from: string;        // Source agent ID
  to: string;          // Target agent ID
  type: 'request' | 'response' | 'event' | 'stream';
  payload: any;
  timestamp: Date;
  correlationId?: string;
  metadata?: Record<string, any>;
}

interface TransportConfig {
  timeout: number;              // Default: 30000ms
  retries: number;              // Default: 3
  retryDelay: number;           // Default: 1000ms
  maxConcurrentConnections: number;  // Default: 10
  keepAlive: boolean;           // Default: true
  useTLS: boolean;              // Default: false (Tailscale provides encryption)
}

Communication Patterns:

Request/Response

Agent A → [REQUEST] → Agent B
Agent A ← [RESPONSE] ← Agent B

Event Broadcasting

Coordinator → [EVENT] → Agent 1
              [EVENT] → Agent 2
              [EVENT] → Agent 3

Streaming

Agent A → [CHUNK 1] → Agent B
        → [CHUNK 2] →
        → [CHUNK 3] →
        → [COMPLETE] →

Transport Flow: 1. Coordinator creates message with metadata 2. Transport Service resolves target agent endpoint 3. HTTP client created/reused from connection pool 4. Message sent over Tailscale encrypted network 5. Target agent receives and processes 6. Response returned via same channel 7. Connection statistics updated

4. Auth Service

Purpose: Zero-trust authentication and authorization for agent-to-agent communication.

Key Features: - JWT-based agent authentication - Role-based access control (RBAC) - ACL policies per agent type - Permission checking at API level - Security audit logging - Token revocation support

Architecture:

interface AuthToken {
  agentId: string;
  agentType: string;
  namespace: string;
  capabilities: string[];
  issuedAt: number;
  expiresAt: number;
}

interface Permission {
  resource: string;
  action: 'read' | 'write' | 'execute' | 'admin';
  namespace?: string;
}

interface ACLPolicy {
  agentType: string;
  allowedActions: string[];
  allowedResources: string[];
  deniedActions?: string[];
  deniedResources?: string[];
}

Default ACL Policies:

Agent Type	Allowed Actions	Allowed Resources	Denied Actions
orchestrator	* (all)	* (all)	None
worker	read, execute	tasks, results	admin
monitor	read	* (all)	write, execute, admin
integrator	read, write	integrations, data	admin
governor	read, write, admin	policies, acl, audit	None
critic	read, write	reviews, feedback, metrics	admin

Authentication Flow: 1. Agent requests auth token with identity 2. Auth Service generates JWT with claims 3. Token includes agent type and capabilities 4. Token signed with secret key 5. Agent includes token in mesh operations 6. Auth Service verifies token on each request 7. ACL policy checked for permission 8. Audit log entry created

Permission Check Flow:

Request → Extract Token → Verify Token → Check ACL Policy
          ↓               ↓               ↓
          Token Valid?    Not Expired?    Permission Granted?
          ↓               ↓               ↓
          Success         Success         Success / Deny

Tailscale Integration

MagicDNS Service Discovery

How It Works: 1. Agent Mesh registers service with Tailscale 2. Tailscale creates DNS entry: agent-mesh-{namespace}-{agentId}.tailnet.ts.net 3. DNS automatically resolves to agent's Tailscale IP 4. No configuration files or service registry needed 5. Agents discover each other via DNS queries

Example DNS Entries:

agent-mesh-production-worker-1.tailnet.ts.net → 100.64.0.1
agent-mesh-production-worker-2.tailnet.ts.net → 100.64.0.2
agent-mesh-staging-worker-1.tailnet.ts.net    → 100.64.0.3

WireGuard Encryption

Security Features: - End-to-end encryption for all agent communication - Perfect forward secrecy - Automatic key rotation - No certificate management required - Zero-trust network model

NAT Traversal

Automatic Direct Connections: - Tailscale handles NAT traversal automatically - Direct peer-to-peer connections when possible - DERP relay fallback for complex NAT scenarios - No port forwarding or firewall configuration needed

Data Flow Patterns

Task Execution Flow

┌──────────────┐
│ Orchestrator │
│   Agent      │
└──────┬───────┘
       │ 1. Submit Task
       ▼
┌──────────────┐
│ Coordinator  │
│   Service    │
└──────┬───────┘
       │ 2. Query Capable Agents
       ▼
┌──────────────┐
│  Discovery   │
│   Service    │
└──────┬───────┘
       │ 3. Return Agent List
       ▼
┌──────────────┐
│ Coordinator  │ 4. Apply Load Balancing
│   Service    │ 5. Select Agent
└──────┬───────┘
       │ 6. Send Task
       ▼
┌──────────────┐
│  Transport   │
│   Service    │
└──────┬───────┘
       │ 7. Deliver via HTTP
       ▼
┌──────────────┐
│   Worker     │
│   Agent      │ 8. Execute Task
└──────┬───────┘
       │ 9. Return Result
       ▼
┌──────────────┐
│  Transport   │
│   Service    │
└──────┬───────┘
       │ 10. Store Result
       ▼
┌──────────────┐
│ Coordinator  │
│   Service    │ 11. Update Status
└──────┬───────┘
       │ 12. Notify Orchestrator
       ▼
┌──────────────┐
│ Orchestrator │
│   Agent      │
└──────────────┘

Agent Registration Flow

┌──────────────┐
│  New Agent   │
└──────┬───────┘
       │ 1. Register Request
       ▼
┌──────────────┐
│  Discovery   │
│   Service    │
└──────┬───────┘
       │ 2. Validate Identity
       │ 3. Check Tailscale Status
       ▼
┌──────────────┐
│  Tailscale   │
│   Service    │
└──────┬───────┘
       │ 4. Register DNS Entry
       │ 5. Get Tailscale IP
       ▼
┌──────────────┐
│  Discovery   │
│   Service    │ 6. Create Agent Entry
└──────┬───────┘ 7. Start Heartbeat
       │ 8. Return Endpoint
       ▼
┌──────────────┐
│  New Agent   │
│   (Active)   │
└──────────────┘

Health Monitoring Flow

┌──────────────┐
│  Discovery   │
│   Service    │ ◄─── Heartbeat Timer (30s)
└──────┬───────┘
       │
       │ For Each Registered Agent:
       │
       ├─ Check Last Seen Timestamp
       │  │
       │  ├─ < 1 min  → Healthy
       │  ├─ 1-5 min  → Degraded
       │  └─ > 5 min  → Unreachable
       │
       ├─ Update Agent Status
       │
       └─ Log Status Changes

Scalability Considerations

Horizontal Scaling

Coordinator Service: - Stateless design allows multiple instances - Task queue can be shared via Redis - Load balancer distributes requests - Capacity: 10,000+ tasks/second per instance

Discovery Service: - Agent registry can be distributed - Health checks can be partitioned - Capacity: 1,000+ agents per instance

Transport Service: - Connection pooling per agent - Async I/O for concurrent requests - Capacity: 10,000+ concurrent connections

Vertical Scaling

Memory Requirements: - Discovery Service: ~100MB + (10KB per agent) - Coordinator Service: ~200MB + (5KB per active task) - Transport Service: ~150MB + (2KB per connection)

CPU Requirements: - Task routing: ~0.1ms per task - Health checking: ~1ms per agent per check - Transport overhead: ~0.5ms per message

Fault Tolerance

Agent Failure Handling

Detection: - Heartbeat timeout (30 seconds) - Failed health checks - Transport errors

Recovery: 1. Mark agent as unreachable 2. Reassign active tasks to healthy agents 3. Update agent discovery cache 4. Notify monitors

Task Failure Handling

Retry Strategy: - Automatic retry with exponential backoff - Configurable retry count (default: 3) - Task timeout enforcement - Alternative agent selection on retry

Timeout Handling: - Default timeout: 5 minutes - Configurable per task - Automatic cleanup of timed-out tasks - Result storage for forensics

Network Partition Handling

Tailscale Resilience: - Automatic reconnection - Multiple relay servers - Direct connection fallback - Health status propagation

Performance Optimization

Connection Pooling

HTTP clients cached per agent
Keep-alive connections
Automatic cleanup of stale connections
Max pool size: 10 connections per agent

Request Batching

Broadcast messages sent in parallel
Task submissions queued and batched
Status checks aggregated

Caching

Agent discovery results cached (60s TTL)
ACL policies cached in memory
Tailscale status cached (30s TTL)

Monitoring and Observability

Metrics Collected

Coordinator Metrics: - Pending tasks count - Running tasks count - Completed tasks count - Failed tasks count - Average execution time

Discovery Metrics: - Total agents count - Healthy agents count - Degraded agents count - Unreachable agents count

Transport Metrics: - Active connections count - Total requests count - Failed requests count - Average latency

Auth Metrics: - Tokens generated count - Tokens verified count - Permission grants count - Permission denials count

Logging

Structured Logging: - JSON format - Log levels: DEBUG, INFO, WARN, ERROR - Contextual metadata per log entry

Audit Logging: - All authentication events - All permission checks - Agent registration/deregistration - Task assignments and completions

Security Architecture

Threat Model

Network Layer: - Threat: Man-in-the-middle attacks - Mitigation: Tailscale WireGuard encryption

Application Layer: - Threat: Unauthorized agent access - Mitigation: JWT authentication + ACL policies

Data Layer: - Threat: Data leakage via compromised agent - Mitigation: Namespace isolation + least privilege

Security Best Practices

Rotate JWT secrets regularly (recommended: monthly)
Review ACL policies for each agent type
Monitor audit logs for suspicious activity
Use namespace isolation for different environments
Limit agent capabilities to minimum required
Enable Tailscale ACLs for network-level filtering

Extension Points

Custom Agent Types

// Define new agent type with custom ACL
await authService.setACLPolicy('custom-processor', {
  agentType: 'custom-processor',
  allowedActions: ['read', 'execute'],
  allowedResources: ['custom-data', 'results'],
  deniedActions: ['admin']
});

Custom Load Balancing

// Implement custom strategy
class CustomLoadBalancer implements LoadBalancingStrategy {
  selectAgent(agents: MeshAgent[], task: Task): MeshAgent {
    // Custom logic
    return agents[0];
  }
}

Custom Transport Protocols

// Add WebSocket support
class WebSocketTransport implements IMeshTransportService {
  // Implementation
}

Future Enhancements

GraphQL subscription support for real-time updates
gRPC transport option for high-performance RPC
Multi-region mesh federation
Kubernetes operator for agent lifecycle
Distributed tracing integration (OpenTelemetry)
Vector database integration for semantic agent discovery

Related Documentation: - Home - Overview and quick start - Deployment - Production deployment guide - Development - Development and contribution guide

Agent Mesh - Architecture

Overview

System Architecture

Core Components

1. Discovery Service

2. Coordinator Service

Round-Robin

Least-Loaded

Capability-Match

Random

3. Transport Service

Request/Response

Event Broadcasting

Streaming

4. Auth Service

Tailscale Integration

MagicDNS Service Discovery

WireGuard Encryption

NAT Traversal

Data Flow Patterns

Task Execution Flow

Agent Registration Flow

Health Monitoring Flow

Scalability Considerations

Horizontal Scaling

Vertical Scaling

Fault Tolerance

Agent Failure Handling

Task Failure Handling

Network Partition Handling

Performance Optimization

Connection Pooling

Request Batching

Caching

Monitoring and Observability

Metrics Collected

Logging

Security Architecture

Threat Model

Security Best Practices

Extension Points

Custom Agent Types

Custom Load Balancing

Custom Transport Protocols

Future Enhancements

Related Links

Related Links