Agent Mesh - Architecture
Mesh Networking, Service Discovery, and Distributed Coordination Architecture
Overview
Agent Mesh implements a distributed, peer-to-peer architecture for coordinating autonomous agents across networks. Built on Tailscale's encrypted mesh networking with MagicDNS for zero-configuration service discovery, the architecture enables secure, scalable, and fault-tolerant agent coordination.
System Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Agent Mesh Control Plane │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Coordinator Service │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Task Queue │ │ Load Balancer│ │ Fault Handler│ │ │
│ │ │ Management │ │ - RoundRobin│ │ - Retry Logic│ │ │
│ │ │ │ │ - LeastLoad │ │ - Failover │ │ │
│ │ │ │ │ - CapMatch │ │ - Timeout │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Discovery Service │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Agent │ │ Health │ │ Namespace │ │ │
│ │ │ Registry │ │ Monitor │ │ Isolation │ │ │
│ │ │ │ │ - Heartbeat │ │ - Groups │ │ │
│ │ │ │ │ - Status │ │ - ACLs │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └───────────────────��───────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Transport Service │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Message │ │ RPC │ │ Streaming │ │ │
│ │ │ Delivery │ │ Handler │ │ Support │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Auth Service │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ JWT │ │ ACL │ │ Audit │ │ │
│ │ │ Management │ │ Policies │ │ Logging │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Tailscale Network Layer │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ MagicDNS: agent-mesh-production-worker-1.tailnet.ts.net │ │
│ │ WireGuard Encryption: End-to-end encrypted mesh network │ │
│ │ Automatic NAT Traversal: Direct peer-to-peer connections │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└────────────────���────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Agent Data Plane │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │Orchestrator│ │ Worker 1 │ │ Worker 2 │ │ Monitor │ │
│ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │
│ │ │ │ │ │ │ │ │ │
│ │ - Planning │ │ - Execution│ │ - Execution│ │ - Metrics │ │
│ │ - Workflow │ │ - Results │ │ - Results │ │ - Health │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Core Components
1. Discovery Service
Purpose: Automatic agent registration and discovery without manual configuration.
Key Features: - Tailscale MagicDNS integration for service discovery - Agent identity management with capabilities - Namespace-based isolation - Health checking and heartbeat monitoring - Agent lifecycle management
Architecture:
interface MeshAgent {
identity: {
agentId: string;
agentName: string;
agentType: 'orchestrator' | 'worker' | 'monitor' | 'integrator' | 'governor' | 'critic';
namespace: string;
capabilities: string[];
version: string;
};
tailscaleHostname: string;
tailscaleIP: string;
port: number;
endpoint: string;
healthStatus: 'healthy' | 'degraded' | 'unreachable';
lastSeen: Date;
registeredAt: Date;
}
Discovery Flow:
1. Agent registers with Discovery Service
2. Service registers agent with Tailscale MagicDNS
3. Agent receives unique hostname (e.g., agent-mesh-prod-worker-1.tailnet.ts.net)
4. Heartbeat monitoring starts automatically
5. Other agents discover via capability-based queries
Health Monitoring: - Healthy: Last seen < 1 minute ago - Degraded: Last seen 1-5 minutes ago - Unreachable: Last seen > 5 minutes ago
2. Coordinator Service
Purpose: Intelligent task distribution and load balancing across agent mesh.
Key Features: - Task queue management with priority levels - Agent capability matching - Multiple load balancing strategies - Fault tolerance with automatic retry - Task status tracking and result retrieval - Workload analytics per agent
Architecture:
interface Task {
taskId: string;
taskType: string;
payload: any;
requiredCapabilities: string[];
priority: 'low' | 'medium' | 'high' | 'critical';
timeout: number;
retries: number;
metadata?: Record<string, any>;
}
interface TaskAssignment {
taskId: string;
agentId: string;
assignedAt: Date;
status: 'pending' | 'running' | 'completed' | 'failed' | 'timeout';
result?: any;
error?: string;
}
Task Routing Flow: 1. Task submitted to Coordinator 2. Coordinator analyzes required capabilities 3. Discovery Service queried for capable agents 4. Load balancing strategy applied 5. Task assigned to selected agent 6. Transport Service delivers task 7. Agent executes and returns result 8. Coordinator updates task status
Load Balancing Strategies:
Round-Robin
- Algorithm: Sequential distribution across agents
- Use Case: Homogeneous agent pools
- Characteristics: Simple, predictable, stateless
Least-Loaded
- Algorithm: Select agent with minimum active tasks
- Use Case: Heterogeneous performance characteristics
- Characteristics: Balances workload dynamically
Capability-Match
- Algorithm: Prefer agents with exact capability matches
- Use Case: Specialized task requirements
- Characteristics: Optimizes for task-agent affinity
Random
- Algorithm: Random selection from capable agents
- Use Case: Testing and development
- Characteristics: Stateless, no coordination overhead
3. Transport Service
Purpose: Secure, reliable agent-to-agent communication over Tailscale network.
Key Features: - HTTP/HTTPS over Tailscale encrypted mesh - Request/response messaging patterns - Streaming support for large payloads - Broadcast messaging to multiple agents - Connection pooling and reuse - Automatic retry with exponential backoff - Connection statistics and monitoring
Architecture:
interface Message {
messageId: string;
from: string; // Source agent ID
to: string; // Target agent ID
type: 'request' | 'response' | 'event' | 'stream';
payload: any;
timestamp: Date;
correlationId?: string;
metadata?: Record<string, any>;
}
interface TransportConfig {
timeout: number; // Default: 30000ms
retries: number; // Default: 3
retryDelay: number; // Default: 1000ms
maxConcurrentConnections: number; // Default: 10
keepAlive: boolean; // Default: true
useTLS: boolean; // Default: false (Tailscale provides encryption)
}
Communication Patterns:
Request/Response
Agent A → [REQUEST] → Agent B
Agent A ← [RESPONSE] ← Agent B
Event Broadcasting
Coordinator → [EVENT] → Agent 1
[EVENT] → Agent 2
[EVENT] → Agent 3
Streaming
Agent A → [CHUNK 1] → Agent B
→ [CHUNK 2] →
→ [CHUNK 3] →
→ [COMPLETE] →
Transport Flow: 1. Coordinator creates message with metadata 2. Transport Service resolves target agent endpoint 3. HTTP client created/reused from connection pool 4. Message sent over Tailscale encrypted network 5. Target agent receives and processes 6. Response returned via same channel 7. Connection statistics updated
4. Auth Service
Purpose: Zero-trust authentication and authorization for agent-to-agent communication.
Key Features: - JWT-based agent authentication - Role-based access control (RBAC) - ACL policies per agent type - Permission checking at API level - Security audit logging - Token revocation support
Architecture:
interface AuthToken {
agentId: string;
agentType: string;
namespace: string;
capabilities: string[];
issuedAt: number;
expiresAt: number;
}
interface Permission {
resource: string;
action: 'read' | 'write' | 'execute' | 'admin';
namespace?: string;
}
interface ACLPolicy {
agentType: string;
allowedActions: string[];
allowedResources: string[];
deniedActions?: string[];
deniedResources?: string[];
}
Default ACL Policies:
| Agent Type | Allowed Actions | Allowed Resources | Denied Actions |
|---|---|---|---|
| orchestrator | * (all) | * (all) | None |
| worker | read, execute | tasks, results | admin |
| monitor | read | * (all) | write, execute, admin |
| integrator | read, write | integrations, data | admin |
| governor | read, write, admin | policies, acl, audit | None |
| critic | read, write | reviews, feedback, metrics | admin |
Authentication Flow: 1. Agent requests auth token with identity 2. Auth Service generates JWT with claims 3. Token includes agent type and capabilities 4. Token signed with secret key 5. Agent includes token in mesh operations 6. Auth Service verifies token on each request 7. ACL policy checked for permission 8. Audit log entry created
Permission Check Flow:
Request → Extract Token → Verify Token → Check ACL Policy
↓ ↓ ↓
Token Valid? Not Expired? Permission Granted?
↓ ↓ ↓
Success Success Success / Deny
Tailscale Integration
MagicDNS Service Discovery
How It Works:
1. Agent Mesh registers service with Tailscale
2. Tailscale creates DNS entry: agent-mesh-{namespace}-{agentId}.tailnet.ts.net
3. DNS automatically resolves to agent's Tailscale IP
4. No configuration files or service registry needed
5. Agents discover each other via DNS queries
Example DNS Entries:
agent-mesh-production-worker-1.tailnet.ts.net → 100.64.0.1
agent-mesh-production-worker-2.tailnet.ts.net → 100.64.0.2
agent-mesh-staging-worker-1.tailnet.ts.net → 100.64.0.3
WireGuard Encryption
Security Features: - End-to-end encryption for all agent communication - Perfect forward secrecy - Automatic key rotation - No certificate management required - Zero-trust network model
NAT Traversal
Automatic Direct Connections: - Tailscale handles NAT traversal automatically - Direct peer-to-peer connections when possible - DERP relay fallback for complex NAT scenarios - No port forwarding or firewall configuration needed
Data Flow Patterns
Task Execution Flow
┌──────────────┐
│ Orchestrator │
│ Agent │
└──────┬───────┘
│ 1. Submit Task
▼
┌──────────────┐
│ Coordinator │
│ Service │
└──────┬───────┘
│ 2. Query Capable Agents
▼
┌──────────────┐
│ Discovery │
│ Service │
└──────┬───────┘
│ 3. Return Agent List
▼
┌──────────────┐
│ Coordinator │ 4. Apply Load Balancing
│ Service │ 5. Select Agent
└──────┬───────┘
│ 6. Send Task
▼
┌──────────────┐
│ Transport │
│ Service │
└──────┬───────┘
│ 7. Deliver via HTTP
▼
┌──────────────┐
│ Worker │
│ Agent │ 8. Execute Task
└──────┬───────┘
│ 9. Return Result
▼
┌──────────────┐
│ Transport │
│ Service │
└──────┬───────┘
│ 10. Store Result
▼
┌──────────────┐
│ Coordinator │
│ Service │ 11. Update Status
└──────┬───────┘
│ 12. Notify Orchestrator
▼
┌──────────────┐
│ Orchestrator │
│ Agent │
└──────────────┘
Agent Registration Flow
┌──────────────┐
│ New Agent │
└──────┬───────┘
│ 1. Register Request
▼
┌──────────────┐
│ Discovery │
│ Service │
└──────┬───────┘
│ 2. Validate Identity
│ 3. Check Tailscale Status
▼
┌──────────────┐
│ Tailscale │
│ Service │
└──────┬───────┘
│ 4. Register DNS Entry
│ 5. Get Tailscale IP
▼
┌──────────────┐
│ Discovery │
│ Service │ 6. Create Agent Entry
└──────┬───────┘ 7. Start Heartbeat
│ 8. Return Endpoint
▼
┌──────────────┐
│ New Agent │
│ (Active) │
└──────────────┘
Health Monitoring Flow
┌──────────────┐
│ Discovery │
│ Service │ ◄─── Heartbeat Timer (30s)
└──────┬───────┘
│
│ For Each Registered Agent:
│
├─ Check Last Seen Timestamp
│ │
│ ├─ < 1 min → Healthy
│ ├─ 1-5 min → Degraded
│ └─ > 5 min → Unreachable
│
├─ Update Agent Status
│
└─ Log Status Changes
Scalability Considerations
Horizontal Scaling
Coordinator Service: - Stateless design allows multiple instances - Task queue can be shared via Redis - Load balancer distributes requests - Capacity: 10,000+ tasks/second per instance
Discovery Service: - Agent registry can be distributed - Health checks can be partitioned - Capacity: 1,000+ agents per instance
Transport Service: - Connection pooling per agent - Async I/O for concurrent requests - Capacity: 10,000+ concurrent connections
Vertical Scaling
Memory Requirements: - Discovery Service: ~100MB + (10KB per agent) - Coordinator Service: ~200MB + (5KB per active task) - Transport Service: ~150MB + (2KB per connection)
CPU Requirements: - Task routing: ~0.1ms per task - Health checking: ~1ms per agent per check - Transport overhead: ~0.5ms per message
Fault Tolerance
Agent Failure Handling
Detection: - Heartbeat timeout (30 seconds) - Failed health checks - Transport errors
Recovery: 1. Mark agent as unreachable 2. Reassign active tasks to healthy agents 3. Update agent discovery cache 4. Notify monitors
Task Failure Handling
Retry Strategy: - Automatic retry with exponential backoff - Configurable retry count (default: 3) - Task timeout enforcement - Alternative agent selection on retry
Timeout Handling: - Default timeout: 5 minutes - Configurable per task - Automatic cleanup of timed-out tasks - Result storage for forensics
Network Partition Handling
Tailscale Resilience: - Automatic reconnection - Multiple relay servers - Direct connection fallback - Health status propagation
Performance Optimization
Connection Pooling
- HTTP clients cached per agent
- Keep-alive connections
- Automatic cleanup of stale connections
- Max pool size: 10 connections per agent
Request Batching
- Broadcast messages sent in parallel
- Task submissions queued and batched
- Status checks aggregated
Caching
- Agent discovery results cached (60s TTL)
- ACL policies cached in memory
- Tailscale status cached (30s TTL)
Monitoring and Observability
Metrics Collected
Coordinator Metrics: - Pending tasks count - Running tasks count - Completed tasks count - Failed tasks count - Average execution time
Discovery Metrics: - Total agents count - Healthy agents count - Degraded agents count - Unreachable agents count
Transport Metrics: - Active connections count - Total requests count - Failed requests count - Average latency
Auth Metrics: - Tokens generated count - Tokens verified count - Permission grants count - Permission denials count
Logging
Structured Logging: - JSON format - Log levels: DEBUG, INFO, WARN, ERROR - Contextual metadata per log entry
Audit Logging: - All authentication events - All permission checks - Agent registration/deregistration - Task assignments and completions
Security Architecture
Threat Model
Network Layer: - Threat: Man-in-the-middle attacks - Mitigation: Tailscale WireGuard encryption
Application Layer: - Threat: Unauthorized agent access - Mitigation: JWT authentication + ACL policies
Data Layer: - Threat: Data leakage via compromised agent - Mitigation: Namespace isolation + least privilege
Security Best Practices
- Rotate JWT secrets regularly (recommended: monthly)
- Review ACL policies for each agent type
- Monitor audit logs for suspicious activity
- Use namespace isolation for different environments
- Limit agent capabilities to minimum required
- Enable Tailscale ACLs for network-level filtering
Extension Points
Custom Agent Types
// Define new agent type with custom ACL
await authService.setACLPolicy('custom-processor', {
agentType: 'custom-processor',
allowedActions: ['read', 'execute'],
allowedResources: ['custom-data', 'results'],
deniedActions: ['admin']
});
Custom Load Balancing
// Implement custom strategy
class CustomLoadBalancer implements LoadBalancingStrategy {
selectAgent(agents: MeshAgent[], task: Task): MeshAgent {
// Custom logic
return agents[0];
}
}
Custom Transport Protocols
// Add WebSocket support
class WebSocketTransport implements IMeshTransportService {
// Implementation
}
Future Enhancements
- GraphQL subscription support for real-time updates
- gRPC transport option for high-performance RPC
- Multi-region mesh federation
- Kubernetes operator for agent lifecycle
- Distributed tracing integration (OpenTelemetry)
- Vector database integration for semantic agent discovery
Related Documentation: - Home - Overview and quick start - Deployment - Production deployment guide - Development - Development and contribution guide