Agent Mesh - Deployment Guide
Production Deployment Strategies for Distributed Agent Mesh Networks
Overview
This guide covers deploying Agent Mesh in production environments, including Tailscale configuration, agent deployment, security hardening, monitoring setup, and operational best practices.
Prerequisites
Infrastructure Requirements
Minimum Requirements (Small deployment, <100 agents): - 2 vCPUs, 4GB RAM per coordinator instance - 1 vCPU, 2GB RAM per agent instance - Network: Tailscale connectivity - Storage: 10GB per instance
Recommended Requirements (Large deployment, 100-1000 agents): - 8 vCPUs, 16GB RAM per coordinator instance - 2 vCPUs, 4GB RAM per agent instance - Network: Tailscale with dedicated subnet - Storage: 50GB per instance - Load balancer for coordinator instances
Software Dependencies
- Node.js: 20.x or later
- npm: 10.x or later
- Tailscale: Latest stable version
- Docker (optional): For containerized deployments
- Kubernetes (optional): For orchestrated deployments
Network Requirements
- Outbound HTTPS (443) for Tailscale coordination
- Inbound UDP (41641) for WireGuard
- Agent-to-agent communication over Tailscale network
- DNS resolution for Tailscale MagicDNS
Tailscale Setup
1. Install Tailscale
Ubuntu/Debian
curl -fsSL https://tailscale.com/install.sh | sh
macOS
brew install tailscale
Docker
docker pull tailscale/tailscale:latest
2. Authenticate with Tailscale
# Interactive authentication
sudo tailscale up
# With auth key (for automation)
sudo tailscale up --authkey=tskey-auth-XXXXX
3. Enable MagicDNS
# Via Tailscale admin console
# Settings → DNS → Enable MagicDNS
4. Configure Tailscale ACLs (Optional)
Create ACLs to restrict agent-to-agent communication:
{
"acls": [
{
"action": "accept",
"src": ["tag:agent-mesh"],
"dst": ["tag:agent-mesh:*"]
},
{
"action": "accept",
"src": ["tag:coordinator"],
"dst": ["tag:agent-mesh:*"]
}
],
"tagOwners": {
"tag:agent-mesh": ["autogroup:admin"],
"tag:coordinator": ["autogroup:admin"]
}
}
5. Verify Tailscale Status
tailscale status
# Should show: Connected, MagicDNS enabled
Deployment Architectures
Architecture 1: Single Coordinator
Use Case: Development, small production (<50 agents)
┌─────────────────────────────────────────────────┐
│ Tailscale Network │
│ │
│ ┌──────────────┐ │
│ │ Coordinator │ │
│ │ Instance │ │
│ └──────┬───────┘ │
│ │ │
│ ┌────┼────┬────┬────┬────┐ │
│ │ │ │ │ │ │ │
│ ┌─▼─┐┌��▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐ │
│ │Ag1││Ag2││Ag3││Ag4││Ag5││...│ │
│ └───┘└───┘└───┘└───┘└───┘└───┘ │
│ │
└─────────────────────────────────────────────────┘
Deployment Steps:
- Deploy coordinator:
# On coordinator host
cd agent-buildkit
npm install
npm run build
# Set environment variables
export MESH_JWT_SECRET="your-secret-key"
export TAILSCALE_ENABLED=true
# Start coordinator
node dist/cli/index.js agent:mesh status &
- Deploy agents:
# On each agent host
buildkit agent:mesh deploy \
--agent-id worker-1 \
--agent-name "Worker 1" \
--agent-type worker \
--namespace production \
--capabilities "task-execution,data-processing" \
--port 3000
Architecture 2: High Availability Coordinators
Use Case: Production (50-500 agents), high availability required
┌─────────────────────────────────────────────────────────────┐
│ Tailscale Network │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │Coordinator 1│ │Coordinator 2│ │Coordinator 3│ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ │ │
│ ┌───────────────────┴───────────────────┐ │
│ │ Load Balancer │ │
│ │ (Round-robin DNS or HAProxy) │ │
│ └───────────────────┬───────────────────┘ │
│ │ │
│ ┌───────────────────┴───────────────────┐ │
│ │ │ │
│ ┌────┼────┬────┬────┬────┬────┬────┬────┐ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ ┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐ │
│ │Ag1││Ag2││Ag3││Ag4││Ag5││...││...││...││...│ │
│ └───┘└───┘└───┘└───┘└───┘└───┘└───┘└───┘└───┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Requirements: - Shared task queue (Redis or PostgreSQL) - Shared agent registry (Redis or PostgreSQL) - Load balancer (DNS round-robin or HAProxy)
Deployment Steps:
- Deploy Redis for shared state:
docker run -d --name agent-mesh-redis \
-p 6379:6379 \
redis:7-alpine
- Configure coordinators for HA:
# On each coordinator
export REDIS_URL="redis://redis-host:6379"
export MESH_JWT_SECRET="shared-secret-key"
export COORDINATOR_ID="coordinator-1"
- Deploy load balancer:
# HAProxy config
frontend agent_mesh_lb
bind *:8080
default_backend agent_mesh_coordinators
backend agent_mesh_coordinators
balance roundrobin
server coord1 coord1.tailnet.ts.net:3000 check
server coord2 coord2.tailnet.ts.net:3000 check
server coord3 coord3.tailnet.ts.net:3000 check
Architecture 3: Multi-Region Mesh
Use Case: Global deployment (500+ agents), multi-region latency optimization
┌────────────────────────────────────────────────────────────┐
│ Global Tailscale Network │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ US-East Region │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │Coord US-E│◄───────►│Agents │ │ │
│ │ └─────┬────┘ └──────────┘ │ │
│ └────────┼──────────────────────────────────────────────┘ │
│ │ │
│ │ ┌────────────┐ │
│ ├─────────►│ Redis │◄──────────┐ │
│ │ │ Cluster │ │ │
│ │ └────────────┘ │ │
│ │ │ │
│ ┌────────┼──────────────────────────────────┼──────────┐ │
│ │ │ EU-West Region │ │ │
│ │ ┌─────▼────┐ ┌──────────┐ │ │ │
│ │ │Coord EU-W│◄───────►│Agents │ │ │ │
│ │ └──────────┘ └──────────┘ │ │ │
│ └───────────────────────────────────────────┼──────────┘ │
│ │ │
│ ┌──────────────────────────────────────────┼──────────┐ │
│ │ APAC Region │ │ │
│ │ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │Coord APAC│◄───────►│Agents │ │ │ │
│ │ └─────┬────┘ └──────────┘ │ │ │
│ └────────┴──────────────────────────────────┘ │ │
│ │
└────────────────────────────────────��─────────────────────┘
Regional Deployment:
- Deploy regional coordinators
- Configure region-aware routing
- Sync state via distributed Redis
- Use Tailscale subnet routers for region isolation
Docker Deployment
Coordinator Container
Dockerfile:
FROM node:20-alpine
WORKDIR /app
# Install dependencies
COPY package*.json ./
RUN npm ci --only=production
# Copy application
COPY dist ./dist
COPY config ./config
# Install Tailscale
RUN apk add --no-cache \
ca-certificates \
iptables \
ip6tables \
&& wget -O /usr/local/bin/tailscale https://pkgs.tailscale.com/stable/tailscale_latest_amd64.tgz \
&& chmod +x /usr/local/bin/tailscale
# Expose ports
EXPOSE 3000
# Start script
COPY scripts/docker-entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
CMD ["coordinator"]
docker-compose.yml:
version: '3.8'
services:
coordinator:
build: .
image: agent-mesh-coordinator:latest
environment:
- MESH_JWT_SECRET=${MESH_JWT_SECRET}
- TAILSCALE_AUTHKEY=${TAILSCALE_AUTHKEY}
- REDIS_URL=redis://redis:6379
depends_on:
- redis
cap_add:
- NET_ADMIN
devices:
- /dev/net/tun
volumes:
- ./config:/app/config
- tailscale-data:/var/lib/tailscale
redis:
image: redis:7-alpine
volumes:
- redis-data:/data
command: redis-server --appendonly yes
agent-worker-1:
build: .
image: agent-mesh-worker:latest
environment:
- AGENT_ID=worker-1
- AGENT_TYPE=worker
- AGENT_NAMESPACE=production
- AGENT_CAPABILITIES=task-execution,data-processing
- TAILSCALE_AUTHKEY=${TAILSCALE_AUTHKEY}
cap_add:
- NET_ADMIN
devices:
- /dev/net/tun
volumes:
- tailscale-worker-1:/var/lib/tailscale
volumes:
redis-data:
tailscale-data:
tailscale-worker-1:
Deploy:
# Build images
docker-compose build
# Start services
docker-compose up -d
# Check status
docker-compose ps
docker-compose logs coordinator
Kubernetes Deployment
Namespace Setup
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: agent-mesh
labels:
name: agent-mesh
Coordinator Deployment
# coordinator-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-mesh-coordinator
namespace: agent-mesh
spec:
replicas: 3
selector:
matchLabels:
app: agent-mesh-coordinator
template:
metadata:
labels:
app: agent-mesh-coordinator
spec:
containers:
- name: coordinator
image: agent-mesh-coordinator:0.1.2
env:
- name: MESH_JWT_SECRET
valueFrom:
secretKeyRef:
name: agent-mesh-secrets
key: jwt-secret
- name: TAILSCALE_AUTHKEY
valueFrom:
secretKeyRef:
name: agent-mesh-secrets
key: tailscale-authkey
- name: REDIS_URL
value: redis://agent-mesh-redis:6379
ports:
- containerPort: 3000
securityContext:
capabilities:
add:
- NET_ADMIN
volumeMounts:
- name: tailscale-state
mountPath: /var/lib/tailscale
volumes:
- name: tailscale-state
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: agent-mesh-coordinator
namespace: agent-mesh
spec:
selector:
app: agent-mesh-coordinator
ports:
- port: 3000
targetPort: 3000
type: ClusterIP
Worker Deployment
# worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-mesh-worker
namespace: agent-mesh
spec:
replicas: 10
selector:
matchLabels:
app: agent-mesh-worker
template:
metadata:
labels:
app: agent-mesh-worker
spec:
containers:
- name: worker
image: agent-mesh-worker:0.1.2
env:
- name: AGENT_TYPE
value: worker
- name: AGENT_NAMESPACE
value: production
- name: AGENT_CAPABILITIES
value: task-execution,data-processing
- name: TAILSCALE_AUTHKEY
valueFrom:
secretKeyRef:
name: agent-mesh-secrets
key: tailscale-authkey
- name: COORDINATOR_URL
value: http://agent-mesh-coordinator:3000
securityContext:
capabilities:
add:
- NET_ADMIN
volumeMounts:
- name: tailscale-state
mountPath: /var/lib/tailscale
volumes:
- name: tailscale-state
emptyDir: {}
Redis StatefulSet
# redis-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: agent-mesh-redis
namespace: agent-mesh
spec:
serviceName: agent-mesh-redis
replicas: 1
selector:
matchLabels:
app: agent-mesh-redis
template:
metadata:
labels:
app: agent-mesh-redis
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
volumeMounts:
- name: redis-data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: redis-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: agent-mesh-redis
namespace: agent-mesh
spec:
selector:
app: agent-mesh-redis
ports:
- port: 6379
targetPort: 6379
type: ClusterIP
Secrets
# secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: agent-mesh-secrets
namespace: agent-mesh
type: Opaque
stringData:
jwt-secret: "your-jwt-secret-change-me"
tailscale-authkey: "tskey-auth-XXXXX"
Deploy to Kubernetes
# Create namespace
kubectl apply -f namespace.yaml
# Create secrets
kubectl apply -f secrets.yaml
# Deploy Redis
kubectl apply -f redis-statefulset.yaml
# Deploy coordinator
kubectl apply -f coordinator-deployment.yaml
# Deploy workers
kubectl apply -f worker-deployment.yaml
# Verify deployments
kubectl get pods -n agent-mesh
kubectl logs -n agent-mesh deployment/agent-mesh-coordinator
Environment Configuration
Environment Variables
Coordinator:
# Required
MESH_JWT_SECRET=your-secret-key-here
TAILSCALE_ENABLED=true
# Optional
REDIS_URL=redis://localhost:6379
NODE_ENV=production
LOG_LEVEL=info
COORDINATOR_ID=coordinator-1
MAX_TASK_RETRIES=3
TASK_TIMEOUT_MS=300000
HEARTBEAT_INTERVAL_MS=30000
HEALTH_CHECK_INTERVAL_MS=60000
Agent:
# Required
AGENT_ID=worker-1
AGENT_TYPE=worker
AGENT_NAMESPACE=production
AGENT_CAPABILITIES=task-execution,data-processing
# Optional
COORDINATOR_URL=http://coordinator:3000
AGENT_PORT=3000
LOG_LEVEL=info
Configuration File
config/production.json:
{
"mesh": {
"coordinatorUrl": "http://coordinator.tailnet.ts.net:3000",
"jwt": {
"secret": "${MESH_JWT_SECRET}",
"expiresIn": 3600
},
"discovery": {
"heartbeatInterval": 30000,
"healthCheckInterval": 60000,
"unhealthyThreshold": 300000
},
"coordinator": {
"maxTaskRetries": 3,
"taskTimeout": 300000,
"loadBalancingStrategy": "capability-match"
},
"transport": {
"timeout": 30000,
"retries": 3,
"retryDelay": 1000,
"maxConcurrentConnections": 10,
"keepAlive": true
}
},
"tailscale": {
"enabled": true,
"servicePrefix": "agent-mesh"
},
"logging": {
"level": "info",
"format": "json"
}
}
Security Hardening
1. JWT Secret Management
Generate Strong Secret:
# Generate 64-byte random secret
openssl rand -base64 64
Secret Rotation:
# 1. Generate new secret
NEW_SECRET=$(openssl rand -base64 64)
# 2. Deploy with both old and new (grace period)
export MESH_JWT_SECRET="$OLD_SECRET"
export MESH_JWT_SECRET_NEW="$NEW_SECRET"
# 3. After grace period, switch to new only
export MESH_JWT_SECRET="$NEW_SECRET"
2. Tailscale ACLs
Restrict Agent Communication:
{
"acls": [
{
"action": "accept",
"src": ["tag:coordinator"],
"dst": ["tag:worker:3000"]
},
{
"action": "accept",
"src": ["tag:worker"],
"dst": ["tag:coordinator:3000"]
}
]
}
3. Network Policies (Kubernetes)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-mesh-netpol
namespace: agent-mesh
spec:
podSelector:
matchLabels:
app: agent-mesh-worker
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: agent-mesh-coordinator
egress:
- to:
- podSelector:
matchLabels:
app: agent-mesh-coordinator
- to:
- podSelector:
matchLabels:
app: agent-mesh-redis
4. RBAC (Kubernetes)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: agent-mesh-role
namespace: agent-mesh
rules:
- apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: agent-mesh-rolebinding
namespace: agent-mesh
subjects:
- kind: ServiceAccount
name: agent-mesh-sa
namespace: agent-mesh
roleRef:
kind: Role
name: agent-mesh-role
apiGroup: rbac.authorization.k8s.io
Monitoring and Observability
Prometheus Metrics
Expose Metrics Endpoint:
// Add to coordinator
import { register, Counter, Gauge, Histogram } from 'prom-client';
const taskCounter = new Counter({
name: 'agent_mesh_tasks_total',
help: 'Total tasks processed',
labelNames: ['status']
});
const agentGauge = new Gauge({
name: 'agent_mesh_agents',
help: 'Number of agents by status',
labelNames: ['status']
});
// Expose endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Prometheus Config:
scrape_configs:
- job_name: 'agent-mesh'
static_configs:
- targets: ['coordinator:3000']
metrics_path: '/metrics'
scrape_interval: 15s
Grafana Dashboard
Key Metrics to Track: - Task throughput (tasks/second) - Task success rate (%) - Agent health status - Average task execution time - Active connections - Failed requests rate
Health Check Endpoints
# Coordinator health
curl http://coordinator:3000/health
# Agent health
curl http://worker-1.tailnet.ts.net:3000/health
# Mesh status
buildkit agent:mesh status
Backup and Disaster Recovery
State Backup
Redis Backup:
# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
# Automated backup (cron)
0 2 * * * /usr/local/bin/backup-redis.sh
Agent Registry Backup:
# Export agent registry
buildkit agent:mesh discover --format json > agents-backup.json
# Restore from backup
buildkit agent:mesh restore --file agents-backup.json
Disaster Recovery
Coordinator Failure: 1. Automatic failover to standby coordinator 2. Redis state preserved 3. Agents reconnect automatically 4. No task data loss
Complete Mesh Failure: 1. Restore Redis from backup 2. Restart coordinators 3. Agents re-register automatically 4. Incomplete tasks marked as failed
Operational Best Practices
1. Capacity Planning
Agents per Coordinator: - Small: 1-50 agents per coordinator - Medium: 50-200 agents per coordinator - Large: 200-500 agents per coordinator
Scale Out Rules: - Add coordinator when task queue > 1000 - Add agents when average execution time > 5 minutes - Scale workers horizontally, coordinators for availability
2. Monitoring Alerts
Critical Alerts: - Coordinator unavailable - Redis connection lost - No healthy agents available - Task failure rate > 10%
Warning Alerts: - Agent health degraded - Task queue > 500 - Average task time > threshold - Failed requests > 5%
3. Maintenance Windows
Rolling Updates:
# Update coordinators one at a time
kubectl rollout status deployment/agent-mesh-coordinator
kubectl set image deployment/agent-mesh-coordinator coordinator=new-image:tag
kubectl rollout status deployment/agent-mesh-coordinator
# Update workers (can be done in parallel)
kubectl set image deployment/agent-mesh-worker worker=new-image:tag
Zero-Downtime Deployment: 1. Deploy new version alongside old 2. Gradually shift traffic to new version 3. Monitor for errors 4. Complete rollout or rollback
Troubleshooting
Common Issues
Agents not discovering each other:
# Check Tailscale status
tailscale status
# Check MagicDNS
dig agent-mesh-production-worker-1.tailnet.ts.net
# Check agent registration
buildkit agent:mesh discover
Tasks not being assigned:
# Check coordinator logs
kubectl logs -n agent-mesh deployment/agent-mesh-coordinator
# Check task queue
buildkit agent:mesh status
# Verify agent capabilities
buildkit agent:mesh discover --format json | jq '.[] | .capabilities'
Authentication failures:
# Verify JWT secret
echo $MESH_JWT_SECRET
# Generate new token
buildkit agent:mesh auth --agent-id test --agent-type worker
# Check audit log
kubectl logs -n agent-mesh deployment/agent-mesh-coordinator | grep auth
Performance Tuning
Coordinator Tuning
# Increase Node.js memory
NODE_OPTIONS="--max-old-space-size=4096"
# Increase connection pool
MAX_CONCURRENT_CONNECTIONS=50
# Adjust timeouts
TASK_TIMEOUT_MS=600000
TRANSPORT_TIMEOUT_MS=60000
Redis Tuning
# Increase max connections
maxclients 10000
# Enable persistence
appendonly yes
appendfsync everysec
Network Tuning
# Increase TCP buffer sizes
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
Related Documentation: - Home - Overview and quick start - Architecture - System architecture details - Development - Development and contribution guide