← Documentation Home

Agent Mesh - Deployment Guide

Production Deployment Strategies for Distributed Agent Mesh Networks

Overview

This guide covers deploying Agent Mesh in production environments, including Tailscale configuration, agent deployment, security hardening, monitoring setup, and operational best practices.

Prerequisites

Infrastructure Requirements

Minimum Requirements (Small deployment, <100 agents): - 2 vCPUs, 4GB RAM per coordinator instance - 1 vCPU, 2GB RAM per agent instance - Network: Tailscale connectivity - Storage: 10GB per instance

Recommended Requirements (Large deployment, 100-1000 agents): - 8 vCPUs, 16GB RAM per coordinator instance - 2 vCPUs, 4GB RAM per agent instance - Network: Tailscale with dedicated subnet - Storage: 50GB per instance - Load balancer for coordinator instances

Software Dependencies

Network Requirements

Tailscale Setup

1. Install Tailscale

Ubuntu/Debian

curl -fsSL https://tailscale.com/install.sh | sh

macOS

brew install tailscale

Docker

docker pull tailscale/tailscale:latest

2. Authenticate with Tailscale

# Interactive authentication
sudo tailscale up

# With auth key (for automation)
sudo tailscale up --authkey=tskey-auth-XXXXX

3. Enable MagicDNS

# Via Tailscale admin console
# Settings → DNS → Enable MagicDNS

4. Configure Tailscale ACLs (Optional)

Create ACLs to restrict agent-to-agent communication:

{
  "acls": [
    {
      "action": "accept",
      "src": ["tag:agent-mesh"],
      "dst": ["tag:agent-mesh:*"]
    },
    {
      "action": "accept",
      "src": ["tag:coordinator"],
      "dst": ["tag:agent-mesh:*"]
    }
  ],
  "tagOwners": {
    "tag:agent-mesh": ["autogroup:admin"],
    "tag:coordinator": ["autogroup:admin"]
  }
}

5. Verify Tailscale Status

tailscale status
# Should show: Connected, MagicDNS enabled

Deployment Architectures

Architecture 1: Single Coordinator

Use Case: Development, small production (<50 agents)

┌─────────────────────────────────────────────────┐
│              Tailscale Network                  │
│                                                 │
│  ┌──────────────┐                               │
│  │  Coordinator │                               │
│  │   Instance   │                               │
│  └──────┬───────┘                               │
│         │                                        │
│    ┌────┼────┬────┬────┬────┐                  │
│    │    │    │    │    │    │                  │
│  ┌─▼─┐┌��▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐               │
│  │Ag1││Ag2││Ag3││Ag4││Ag5││...│               │
│  └───┘└───┘└───┘└───┘└───┘└───┘               │
│                                                 │
└─────────────────────────────────────────────────┘

Deployment Steps:

  1. Deploy coordinator:
# On coordinator host
cd agent-buildkit
npm install
npm run build

# Set environment variables
export MESH_JWT_SECRET="your-secret-key"
export TAILSCALE_ENABLED=true

# Start coordinator
node dist/cli/index.js agent:mesh status &
  1. Deploy agents:
# On each agent host
buildkit agent:mesh deploy \
  --agent-id worker-1 \
  --agent-name "Worker 1" \
  --agent-type worker \
  --namespace production \
  --capabilities "task-execution,data-processing" \
  --port 3000

Architecture 2: High Availability Coordinators

Use Case: Production (50-500 agents), high availability required

┌─────────────────────────────────────────────────────────────┐
│                  Tailscale Network                          │
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐   │
│  │Coordinator 1│    │Coordinator 2│    │Coordinator 3│   │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘   │
│         │                   │                   │          │
│         └───────────────────┼───────────────────┘          │
│                             │                              │
│         ┌───────────────────┴───────────────────┐          │
│         │          Load Balancer                │          │
│         │    (Round-robin DNS or HAProxy)       │          │
│         └───────────────────┬───────────────────┘          │
│                             │                              │
│         ┌───────────────────┴───────────────────┐          │
│         │                                        │          │
│    ┌────┼────┬────┬────┬────┬────┬────┬────┐   │          │
│    │    │    │    │    │    │    │    │    │   │          │
│  ┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐┌─▼─┐           │
│  │Ag1││Ag2││Ag3││Ag4││Ag5││...││...││...││...│           │
│  └───┘└───┘└───┘└───┘└───┘└───┘└───┘└───┘└───┘           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Requirements: - Shared task queue (Redis or PostgreSQL) - Shared agent registry (Redis or PostgreSQL) - Load balancer (DNS round-robin or HAProxy)

Deployment Steps:

  1. Deploy Redis for shared state:
docker run -d --name agent-mesh-redis \
  -p 6379:6379 \
  redis:7-alpine
  1. Configure coordinators for HA:
# On each coordinator
export REDIS_URL="redis://redis-host:6379"
export MESH_JWT_SECRET="shared-secret-key"
export COORDINATOR_ID="coordinator-1"
  1. Deploy load balancer:
# HAProxy config
frontend agent_mesh_lb
    bind *:8080
    default_backend agent_mesh_coordinators

backend agent_mesh_coordinators
    balance roundrobin
    server coord1 coord1.tailnet.ts.net:3000 check
    server coord2 coord2.tailnet.ts.net:3000 check
    server coord3 coord3.tailnet.ts.net:3000 check

Architecture 3: Multi-Region Mesh

Use Case: Global deployment (500+ agents), multi-region latency optimization

┌────────────────────────────────────────────────────────────┐
│                   Global Tailscale Network                 │
│                                                            │
│  ┌──────────────────────────────────────────────────────┐ │
│  │              US-East Region                          │ │
│  │  ┌──────────┐         ┌──────────┐                  │ │
│  │  │Coord US-E│◄───────►│Agents    │                  │ │
│  │  └─────┬────┘         └──────────┘                  │ │
│  └────────┼──────────────────────────────────────────────┘ │
│           │                                                │
│           │          ┌────────────┐                        │
│           ├─────────►│   Redis    │◄──────────┐            │
│           │          │  Cluster   │           │            │
│           │          └────────────┘           │            │
│           │                                   │            │
│  ┌────────┼──────────────────────────────────┼──────────┐ │
│  │        │        EU-West Region            │          │ │
│  │  ┌─────▼────┐         ┌──────────┐       │          │ │
│  │  │Coord EU-W│◄───────►│Agents    │       │          │ │
│  │  └──────────┘         └──────────┘       │          │ │
│  └───────────────────────────────────────────┼──────────┘ │
│                                              │            │
│  ┌──────────────────────────────────────────┼──────────┐ │
│  │              APAC Region                 │          │ │
│  │  ┌──────────┐         ┌──────────┐      │          │ │
│  │  │Coord APAC│◄───────►│Agents    │      │          │ │
│  │  └─────┬────┘         └──────────┘      │          │ │
│  └────────┴──────────────────────────────────┘          │ │
│                                                          │
└────────────────────────────────────��─────────────────────┘

Regional Deployment:

  1. Deploy regional coordinators
  2. Configure region-aware routing
  3. Sync state via distributed Redis
  4. Use Tailscale subnet routers for region isolation

Docker Deployment

Coordinator Container

Dockerfile:

FROM node:20-alpine

WORKDIR /app

# Install dependencies
COPY package*.json ./
RUN npm ci --only=production

# Copy application
COPY dist ./dist
COPY config ./config

# Install Tailscale
RUN apk add --no-cache \
    ca-certificates \
    iptables \
    ip6tables \
    && wget -O /usr/local/bin/tailscale https://pkgs.tailscale.com/stable/tailscale_latest_amd64.tgz \
    && chmod +x /usr/local/bin/tailscale

# Expose ports
EXPOSE 3000

# Start script
COPY scripts/docker-entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]
CMD ["coordinator"]

docker-compose.yml:

version: '3.8'

services:
  coordinator:
    build: .
    image: agent-mesh-coordinator:latest
    environment:
      - MESH_JWT_SECRET=${MESH_JWT_SECRET}
      - TAILSCALE_AUTHKEY=${TAILSCALE_AUTHKEY}
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
    cap_add:
      - NET_ADMIN
    devices:
      - /dev/net/tun
    volumes:
      - ./config:/app/config
      - tailscale-data:/var/lib/tailscale

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes

  agent-worker-1:
    build: .
    image: agent-mesh-worker:latest
    environment:
      - AGENT_ID=worker-1
      - AGENT_TYPE=worker
      - AGENT_NAMESPACE=production
      - AGENT_CAPABILITIES=task-execution,data-processing
      - TAILSCALE_AUTHKEY=${TAILSCALE_AUTHKEY}
    cap_add:
      - NET_ADMIN
    devices:
      - /dev/net/tun
    volumes:
      - tailscale-worker-1:/var/lib/tailscale

volumes:
  redis-data:
  tailscale-data:
  tailscale-worker-1:

Deploy:

# Build images
docker-compose build

# Start services
docker-compose up -d

# Check status
docker-compose ps
docker-compose logs coordinator

Kubernetes Deployment

Namespace Setup

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: agent-mesh
  labels:
    name: agent-mesh

Coordinator Deployment

# coordinator-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-mesh-coordinator
  namespace: agent-mesh
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-mesh-coordinator
  template:
    metadata:
      labels:
        app: agent-mesh-coordinator
    spec:
      containers:
      - name: coordinator
        image: agent-mesh-coordinator:0.1.2
        env:
        - name: MESH_JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: agent-mesh-secrets
              key: jwt-secret
        - name: TAILSCALE_AUTHKEY
          valueFrom:
            secretKeyRef:
              name: agent-mesh-secrets
              key: tailscale-authkey
        - name: REDIS_URL
          value: redis://agent-mesh-redis:6379
        ports:
        - containerPort: 3000
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
        volumeMounts:
        - name: tailscale-state
          mountPath: /var/lib/tailscale
      volumes:
      - name: tailscale-state
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: agent-mesh-coordinator
  namespace: agent-mesh
spec:
  selector:
    app: agent-mesh-coordinator
  ports:
  - port: 3000
    targetPort: 3000
  type: ClusterIP

Worker Deployment

# worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-mesh-worker
  namespace: agent-mesh
spec:
  replicas: 10
  selector:
    matchLabels:
      app: agent-mesh-worker
  template:
    metadata:
      labels:
        app: agent-mesh-worker
    spec:
      containers:
      - name: worker
        image: agent-mesh-worker:0.1.2
        env:
        - name: AGENT_TYPE
          value: worker
        - name: AGENT_NAMESPACE
          value: production
        - name: AGENT_CAPABILITIES
          value: task-execution,data-processing
        - name: TAILSCALE_AUTHKEY
          valueFrom:
            secretKeyRef:
              name: agent-mesh-secrets
              key: tailscale-authkey
        - name: COORDINATOR_URL
          value: http://agent-mesh-coordinator:3000
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
        volumeMounts:
        - name: tailscale-state
          mountPath: /var/lib/tailscale
      volumes:
      - name: tailscale-state
        emptyDir: {}

Redis StatefulSet

# redis-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: agent-mesh-redis
  namespace: agent-mesh
spec:
  serviceName: agent-mesh-redis
  replicas: 1
  selector:
    matchLabels:
      app: agent-mesh-redis
  template:
    metadata:
      labels:
        app: agent-mesh-redis
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        ports:
        - containerPort: 6379
        volumeMounts:
        - name: redis-data
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: redis-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: agent-mesh-redis
  namespace: agent-mesh
spec:
  selector:
    app: agent-mesh-redis
  ports:
  - port: 6379
    targetPort: 6379
  type: ClusterIP

Secrets

# secrets.yaml
apiVersion: v1
kind: Secret
metadata:
  name: agent-mesh-secrets
  namespace: agent-mesh
type: Opaque
stringData:
  jwt-secret: "your-jwt-secret-change-me"
  tailscale-authkey: "tskey-auth-XXXXX"

Deploy to Kubernetes

# Create namespace
kubectl apply -f namespace.yaml

# Create secrets
kubectl apply -f secrets.yaml

# Deploy Redis
kubectl apply -f redis-statefulset.yaml

# Deploy coordinator
kubectl apply -f coordinator-deployment.yaml

# Deploy workers
kubectl apply -f worker-deployment.yaml

# Verify deployments
kubectl get pods -n agent-mesh
kubectl logs -n agent-mesh deployment/agent-mesh-coordinator

Environment Configuration

Environment Variables

Coordinator:

# Required
MESH_JWT_SECRET=your-secret-key-here
TAILSCALE_ENABLED=true

# Optional
REDIS_URL=redis://localhost:6379
NODE_ENV=production
LOG_LEVEL=info
COORDINATOR_ID=coordinator-1
MAX_TASK_RETRIES=3
TASK_TIMEOUT_MS=300000
HEARTBEAT_INTERVAL_MS=30000
HEALTH_CHECK_INTERVAL_MS=60000

Agent:

# Required
AGENT_ID=worker-1
AGENT_TYPE=worker
AGENT_NAMESPACE=production
AGENT_CAPABILITIES=task-execution,data-processing

# Optional
COORDINATOR_URL=http://coordinator:3000
AGENT_PORT=3000
LOG_LEVEL=info

Configuration File

config/production.json:

{
  "mesh": {
    "coordinatorUrl": "http://coordinator.tailnet.ts.net:3000",
    "jwt": {
      "secret": "${MESH_JWT_SECRET}",
      "expiresIn": 3600
    },
    "discovery": {
      "heartbeatInterval": 30000,
      "healthCheckInterval": 60000,
      "unhealthyThreshold": 300000
    },
    "coordinator": {
      "maxTaskRetries": 3,
      "taskTimeout": 300000,
      "loadBalancingStrategy": "capability-match"
    },
    "transport": {
      "timeout": 30000,
      "retries": 3,
      "retryDelay": 1000,
      "maxConcurrentConnections": 10,
      "keepAlive": true
    }
  },
  "tailscale": {
    "enabled": true,
    "servicePrefix": "agent-mesh"
  },
  "logging": {
    "level": "info",
    "format": "json"
  }
}

Security Hardening

1. JWT Secret Management

Generate Strong Secret:

# Generate 64-byte random secret
openssl rand -base64 64

Secret Rotation:

# 1. Generate new secret
NEW_SECRET=$(openssl rand -base64 64)

# 2. Deploy with both old and new (grace period)
export MESH_JWT_SECRET="$OLD_SECRET"
export MESH_JWT_SECRET_NEW="$NEW_SECRET"

# 3. After grace period, switch to new only
export MESH_JWT_SECRET="$NEW_SECRET"

2. Tailscale ACLs

Restrict Agent Communication:

{
  "acls": [
    {
      "action": "accept",
      "src": ["tag:coordinator"],
      "dst": ["tag:worker:3000"]
    },
    {
      "action": "accept",
      "src": ["tag:worker"],
      "dst": ["tag:coordinator:3000"]
    }
  ]
}

3. Network Policies (Kubernetes)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: agent-mesh-netpol
  namespace: agent-mesh
spec:
  podSelector:
    matchLabels:
      app: agent-mesh-worker
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: agent-mesh-coordinator
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: agent-mesh-coordinator
  - to:
    - podSelector:
        matchLabels:
          app: agent-mesh-redis

4. RBAC (Kubernetes)

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: agent-mesh-role
  namespace: agent-mesh
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: agent-mesh-rolebinding
  namespace: agent-mesh
subjects:
- kind: ServiceAccount
  name: agent-mesh-sa
  namespace: agent-mesh
roleRef:
  kind: Role
  name: agent-mesh-role
  apiGroup: rbac.authorization.k8s.io

Monitoring and Observability

Prometheus Metrics

Expose Metrics Endpoint:

// Add to coordinator
import { register, Counter, Gauge, Histogram } from 'prom-client';

const taskCounter = new Counter({
  name: 'agent_mesh_tasks_total',
  help: 'Total tasks processed',
  labelNames: ['status']
});

const agentGauge = new Gauge({
  name: 'agent_mesh_agents',
  help: 'Number of agents by status',
  labelNames: ['status']
});

// Expose endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Prometheus Config:

scrape_configs:
  - job_name: 'agent-mesh'
    static_configs:
      - targets: ['coordinator:3000']
    metrics_path: '/metrics'
    scrape_interval: 15s

Grafana Dashboard

Key Metrics to Track: - Task throughput (tasks/second) - Task success rate (%) - Agent health status - Average task execution time - Active connections - Failed requests rate

Health Check Endpoints

# Coordinator health
curl http://coordinator:3000/health

# Agent health
curl http://worker-1.tailnet.ts.net:3000/health

# Mesh status
buildkit agent:mesh status

Backup and Disaster Recovery

State Backup

Redis Backup:

# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb

# Automated backup (cron)
0 2 * * * /usr/local/bin/backup-redis.sh

Agent Registry Backup:

# Export agent registry
buildkit agent:mesh discover --format json > agents-backup.json

# Restore from backup
buildkit agent:mesh restore --file agents-backup.json

Disaster Recovery

Coordinator Failure: 1. Automatic failover to standby coordinator 2. Redis state preserved 3. Agents reconnect automatically 4. No task data loss

Complete Mesh Failure: 1. Restore Redis from backup 2. Restart coordinators 3. Agents re-register automatically 4. Incomplete tasks marked as failed

Operational Best Practices

1. Capacity Planning

Agents per Coordinator: - Small: 1-50 agents per coordinator - Medium: 50-200 agents per coordinator - Large: 200-500 agents per coordinator

Scale Out Rules: - Add coordinator when task queue > 1000 - Add agents when average execution time > 5 minutes - Scale workers horizontally, coordinators for availability

2. Monitoring Alerts

Critical Alerts: - Coordinator unavailable - Redis connection lost - No healthy agents available - Task failure rate > 10%

Warning Alerts: - Agent health degraded - Task queue > 500 - Average task time > threshold - Failed requests > 5%

3. Maintenance Windows

Rolling Updates:

# Update coordinators one at a time
kubectl rollout status deployment/agent-mesh-coordinator
kubectl set image deployment/agent-mesh-coordinator coordinator=new-image:tag
kubectl rollout status deployment/agent-mesh-coordinator

# Update workers (can be done in parallel)
kubectl set image deployment/agent-mesh-worker worker=new-image:tag

Zero-Downtime Deployment: 1. Deploy new version alongside old 2. Gradually shift traffic to new version 3. Monitor for errors 4. Complete rollout or rollback

Troubleshooting

Common Issues

Agents not discovering each other:

# Check Tailscale status
tailscale status

# Check MagicDNS
dig agent-mesh-production-worker-1.tailnet.ts.net

# Check agent registration
buildkit agent:mesh discover

Tasks not being assigned:

# Check coordinator logs
kubectl logs -n agent-mesh deployment/agent-mesh-coordinator

# Check task queue
buildkit agent:mesh status

# Verify agent capabilities
buildkit agent:mesh discover --format json | jq '.[] | .capabilities'

Authentication failures:

# Verify JWT secret
echo $MESH_JWT_SECRET

# Generate new token
buildkit agent:mesh auth --agent-id test --agent-type worker

# Check audit log
kubectl logs -n agent-mesh deployment/agent-mesh-coordinator | grep auth

Performance Tuning

Coordinator Tuning

# Increase Node.js memory
NODE_OPTIONS="--max-old-space-size=4096"

# Increase connection pool
MAX_CONCURRENT_CONNECTIONS=50

# Adjust timeouts
TASK_TIMEOUT_MS=600000
TRANSPORT_TIMEOUT_MS=60000

Redis Tuning

# Increase max connections
maxclients 10000

# Enable persistence
appendonly yes
appendfsync everysec

Network Tuning

# Increase TCP buffer sizes
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728

Related Documentation: - Home - Overview and quick start - Architecture - System architecture details - Development - Development and contribution guide