High Availability Deployment

Enterprise

Deploy NetStacks Controller in a production-grade, highly available configuration with multiple controller instances, Valkey Sentinel, external PostgreSQL, Nginx load balancing, TLS, backups, and geo-HA considerations.

Overview

NetStacks Controller supports active-active high availability deployment, meaning every controller instance actively serves requests simultaneously. There is no primary/standby distinction among controller nodes — any instance can handle any API call, WebSocket connection, or SSH proxy session.

The HA architecture eliminates single points of failure across every layer of the stack:

Controller instances — Two or more instances run behind a load balancer. If one goes down, the remaining instances continue serving all traffic. Sessions on the failed instance are automatically migrated when clients reconnect.
Valkey (session coordination) — A Redis-compatible in-memory store that tracks which controller instance owns each session, enabling cross-instance session routing and connection brokering. Valkey Sentinel provides automated failover for the Valkey tier itself.
PostgreSQL — The database stores all persistent state: users, devices, credentials, audit logs, certificates, and configuration. In HA mode, use an external PostgreSQL cluster with replication for database-level redundancy.
Load balancer — Nginx (or any WebSocket-aware load balancer) distributes traffic across controller instances using least-connections routing, with health-check-based removal of unhealthy nodes.

Key capabilities enabled by HA mode:

Instance registration and heartbeat — Each controller registers in Valkey with a unique ID and periodic heartbeat. Stale instances are automatically cleaned up after the TTL expires.
Session registry — Active sessions are tracked in Valkey with the owning instance ID, enabling cross-instance session lookup and migration.
Connection broker — Incoming WebSocket connections are routed to the correct controller instance via internal proxy. Clients connect to any instance; the broker handles the rest.
Circuit breaker — The LLM orchestrator tracks AI provider health per-instance and trips a circuit breaker after consecutive failures, preventing cascading timeouts.
HA status dashboard — The Admin UI provides real-time cluster health, instance status, session distribution, and Valkey connectivity at Admin → HA Status.

Enterprise feature

High availability deployment requires a multi-instance Enterprise Controller deployment. Single-instance Controller deployments do not require Valkey or multiple controller instances.

Architecture

The following diagram shows a production HA deployment with all components. The load balancer receives all external traffic and distributes it across the controller instances. Both controllers share the same PostgreSQL database and Valkey cluster for state coordination.

ha-architecture.txttext

                        ┌──────────────────────┐
                        │   Terminal Clients    │
                        │  (Tauri desktop app)  │
                        └──────────┬───────────┘
                                   │
                              HTTPS / WSS
                                   │
                        ┌──────────▼───────────┐
                        │   Nginx Load Balancer │
                        │   (least_conn, TLS    │
                        │    termination)        │
                        └────┬────────────┬─────┘
                             │            │
                    ┌────────▼──┐    ┌────▼────────┐
                    │Controller │    │ Controller   │
                    │Instance 1 │    │ Instance 2   │
                    │  :3000    │    │   :3000      │
                    └──┬────┬───┘    └──┬────┬──────┘
                       │    │           │    │
              ┌────────┘    └─────┬─────┘    └────────┐
              │                   │                    │
    ┌─────────▼─────────┐  ┌─────▼──────────┐  ┌─────▼──────────┐
    │    PostgreSQL 16   │  │ Valkey Primary  │  │ Network Devices│
    │   (pgvector ext)   │  │   + Replicas    │  │ (SSH/Telnet)   │
    │                    │  │                 │  │                │
    │  Users, devices,   │  │ Session state,  │  └────────────────┘
    │  credentials,      │  │ instance reg,   │
    │  audit logs,       │  │ heartbeats      │
    │  certificates      │  │                 │
    └────────────────────┘  └────────┬────────┘
                                     │
                            ┌────────▼────────┐
                            │ Valkey Sentinel  │
                            │  (3 instances)   │
                            │                  │
                            │ Monitors primary,│
                            │ auto-promotes    │
                            │ replica on fail  │
                            └─────────────────┘

In this architecture:

Nginx terminates TLS and distributes traffic to controller instances using least_conn. It passes WebSocket upgrades through to the backend.
Controller instances are stateless application servers. All persistent state lives in PostgreSQL; all ephemeral session state lives in Valkey. You can scale horizontally by adding more instances.
Valkey Primary + Replicas handle session coordination. The primary handles writes; replicas provide read scaling and failover candidates.
Valkey Sentinel monitors the primary and automatically promotes a replica if the primary fails. Three sentinels are required for quorum.
PostgreSQL is the single source of truth for all persistent data. Use an external PostgreSQL cluster with streaming replication for database-level HA.

Prerequisites

Before deploying NetStacks Controller in any configuration, ensure the following requirements are met:

Requirement	Minimum	Recommended (HA)
Docker Engine	24.0+	25.0+
Docker Compose	v2.20+	v2.24+
RAM per controller instance	4 GB	8 GB
CPU per controller instance	2 vCPU	4 vCPU
PostgreSQL	16+ with pgvector	16+ with pgvector, streaming replication
Valkey / Redis	Not required (single instance)	Valkey 8+ with Sentinel
Disk (database)	20 GB	100 GB+ SSD
Disk (Valkey)	1 GB	4 GB SSD (AOF persistence)
Network	Controller ↔ devices reachable on SSH/Telnet ports	Low-latency interconnect between controller instances
License	Enterprise (single instance)	Enterprise (multi-instance HA)

Cloud deployments

On AWS, use RDS for PostgreSQL with the pgvector extension (supported on RDS 16+) and ElastiCache for Valkey. On GCP, use Cloud SQL for PostgreSQL and Memorystore for Redis/Valkey. On Azure, use Azure Database for PostgreSQL Flexible Server and Azure Cache for Redis.

Single Instance Setup

Start with a single-instance deployment to verify your configuration before scaling to HA. This setup runs PostgreSQL, the controller API, and optionally Valkey in a single Docker Compose stack.

Generate Secrets

Before creating the Docker Compose file, generate the required secrets. These values must be kept secure and consistent across all controller instances if you later scale to HA.

generate-secrets.shbash

# Generate the vault master key (64 hex characters = 32 bytes)
# This encrypts all credentials, SSH CA keys, and sensitive data in the database
export VAULT_MASTER_KEY=$(openssl rand -hex 32)
echo "VAULT_MASTER_KEY=$VAULT_MASTER_KEY"

# Generate the JWT signing secret (64 hex characters = 32 bytes)
# This signs all authentication tokens — if changed, all users are logged out
export JWT_SECRET=$(openssl rand -hex 32)
echo "JWT_SECRET=$JWT_SECRET"

# Generate the database password
export DB_PASSWORD=$(openssl rand -base64 24)
echo "DB_PASSWORD=$DB_PASSWORD"

# Save these values securely — you will need them for the docker-compose file
# and for any additional controller instances in HA mode

Store secrets securely

The VAULT_MASTER_KEY is the root of trust for all encrypted data in NetStacks. If lost, encrypted credentials and SSH CA private keys cannot be recovered. Store it in a secrets manager (HashiCorp Vault, AWS Secrets Manager, etc.) or at minimum in a secure, backed-up location outside the Docker host.

Docker Compose (Single Instance)

docker-compose.ymlyaml

version: "3.8"

services:
  postgres:
    image: pgvector/pgvector:pg16
    restart: unless-stopped
    environment:
      POSTGRES_USER: netstacks
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: netstacks
    volumes:
      - postgres-data:/var/lib/postgresql/data
    ports:
      - "127.0.0.1:5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U netstacks"]
      interval: 5s
      timeout: 5s
      retries: 5

  controller:
    image: ghcr.io/netstacks/controller:latest
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
    ports:
      - "3000:3000"
    environment:
      # Database
      DATABASE_URL: postgres://netstacks:${DB_PASSWORD}@postgres:5432/netstacks

      # Security — generate with: openssl rand -hex 32
      VAULT_MASTER_KEY: ${VAULT_MASTER_KEY}
      JWT_SECRET: ${JWT_SECRET}

      # TLS — auto-generates a self-signed CA on first boot
      # Replace with your own certs for production (see TLS section)
      TLS_CERT_PATH: /data/tls/cert.pem
      TLS_KEY_PATH: /data/tls/key.pem
      TLS_AUTO_GENERATE: "true"

      # Instance naming (optional, shown in Admin UI)
      NETSTACKS_INSTANCE_NAME: "controller-1"

      # Logging
      RUST_LOG: "info,netstacks=debug"
    volumes:
      - controller-data:/data

volumes:
  postgres-data:
  controller-data:

Start the Stack

start-single-instance.shbash

# Create a .env file with your generated secrets
cat > .env << 'EOF'
VAULT_MASTER_KEY=<your-64-hex-char-key>
JWT_SECRET=<your-64-hex-char-jwt-secret>
DB_PASSWORD=<your-database-password>
EOF

# Start the stack
docker compose up -d

# Watch the logs for the initial admin password
docker compose logs -f controller

First-Run Admin Password

On first boot, the controller creates a default admin user and prints a randomly generated password to the logs. Look for a log line like:

first-boot-log.txttext

[INFO] Created default admin user
[INFO] Admin password: Xk9mP2vL8nQwR4tY
[INFO] Change this password immediately after first login

Change the admin password immediately

The auto-generated admin password is printed to stdout in plaintext. Log in to the Admin UI at https://your-host:3000, navigate to your profile, and change the password. If you miss the password in the logs, you can recreate the admin user by removing the database volume and restarting: docker compose down -v && docker compose up -d.

Verify the Deployment

verify-deployment.shbash

# Check the health endpoint (use -k for self-signed TLS)
curl -k https://localhost:3000/api/v1/health

# Expected response:
# {
#   "status": "healthy",
#   "instance_id": "ctrl-a1b2c3d4",
#   "uptime_secs": 42,
#   "valkey": "not_configured",
#   "database": "connected",
#   "cluster": {
#     "registered_instances": 1,
#     "healthy_instances": 1,
#     "total_sessions": 0,
#     "local_sessions": 0
#   }
# }

# Log in and get a token
curl -k -X POST https://localhost:3000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "<password-from-logs>"}'

High Availability Setup

The full HA deployment adds Valkey with Sentinel for session coordination, multiple controller instances, and an Nginx load balancer. All controller instances must share the same VAULT_MASTER_KEY, JWT_SECRET, and DATABASE_URL — these are the glue that makes the cluster function as a single logical system.

Shared secrets are critical

Every controller instance in the cluster must use identical values for VAULT_MASTER_KEY and JWT_SECRET. If these differ between instances: encrypted credentials decrypted on one instance will fail on another, and JWT tokens issued by one instance will be rejected by others. Generate these once and distribute to all instances via your secrets management system.

Full HA Docker Compose

docker-compose-ha.ymlyaml

version: "3.8"

services:
  # ─────────────────────────────────────────────
  # PostgreSQL (use external DB for production)
  # ─────────────────────────────────────────────
  postgres:
    image: pgvector/pgvector:pg16
    restart: unless-stopped
    environment:
      POSTGRES_USER: netstacks
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: netstacks
    volumes:
      - postgres-data:/var/lib/postgresql/data
    ports:
      - "127.0.0.1:5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U netstacks"]
      interval: 5s
      timeout: 5s
      retries: 5

  # ─────────────────────────────────────────────
  # Valkey Primary
  # ─────────────────────────────────────────────
  valkey-primary:
    image: valkey/valkey:8-alpine
    restart: unless-stopped
    command: >
      valkey-server
      --appendonly yes
      --appendfsync everysec
      --maxmemory 512mb
      --maxmemory-policy noeviction
    volumes:
      - valkey-primary-data:/data
    ports:
      - "127.0.0.1:6379:6379"
    healthcheck:
      test: ["CMD", "valkey-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  # ─────────────────────────────────────────────
  # Valkey Replicas
  # ─────────────────────────────────────────────
  valkey-replica-1:
    image: valkey/valkey:8-alpine
    restart: unless-stopped
    command: >
      valkey-server
      --appendonly yes
      --appendfsync everysec
      --replicaof valkey-primary 6379
    volumes:
      - valkey-replica1-data:/data
    depends_on:
      valkey-primary:
        condition: service_healthy

  valkey-replica-2:
    image: valkey/valkey:8-alpine
    restart: unless-stopped
    command: >
      valkey-server
      --appendonly yes
      --appendfsync everysec
      --replicaof valkey-primary 6379
    volumes:
      - valkey-replica2-data:/data
    depends_on:
      valkey-primary:
        condition: service_healthy

  # ─────────────────────────────────────────────
  # Valkey Sentinels (3 for quorum)
  # ─────────────────────────────────────────────
  valkey-sentinel-1:
    image: valkey/valkey:8-alpine
    restart: unless-stopped
    command: >
      valkey-sentinel /etc/valkey/sentinel.conf
    volumes:
      - ./sentinel.conf:/etc/valkey/sentinel.conf:ro
    depends_on:
      valkey-primary:
        condition: service_healthy

  valkey-sentinel-2:
    image: valkey/valkey:8-alpine
    restart: unless-stopped
    command: >
      valkey-sentinel /etc/valkey/sentinel.conf
    volumes:
      - ./sentinel.conf:/etc/valkey/sentinel.conf:ro
    depends_on:
      valkey-primary:
        condition: service_healthy

  valkey-sentinel-3:
    image: valkey/valkey:8-alpine
    restart: unless-stopped
    command: >
      valkey-sentinel /etc/valkey/sentinel.conf
    volumes:
      - ./sentinel.conf:/etc/valkey/sentinel.conf:ro
    depends_on:
      valkey-primary:
        condition: service_healthy

  # ─────────────────────────────────────────────
  # Controller Instance 1
  # ─────────────────────────────────────────────
  controller-1:
    image: ghcr.io/netstacks/controller:latest
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
      valkey-primary:
        condition: service_healthy
    environment:
      DATABASE_URL: postgres://netstacks:${DB_PASSWORD}@postgres:5432/netstacks
      VAULT_MASTER_KEY: ${VAULT_MASTER_KEY}
      JWT_SECRET: ${JWT_SECRET}
      NETSTACKS_INSTANCE_NAME: "controller-1"
      RUST_LOG: "info,netstacks=debug"

      # Valkey Sentinel — controllers discover the primary automatically
      VALKEY_SENTINEL_URL: "redis://valkey-sentinel-1:26379,valkey-sentinel-2:26379,valkey-sentinel-3:26379"
      VALKEY_SENTINEL_MASTER_NAME: "mymaster"

      # HA tuning
      HA_HEARTBEAT_INTERVAL_SECS: "5"
      HA_STALE_INSTANCE_TTL_SECS: "30"

      # TLS
      TLS_CERT_PATH: /data/tls/cert.pem
      TLS_KEY_PATH: /data/tls/key.pem
      TLS_AUTO_GENERATE: "true"
    volumes:
      - controller1-data:/data

  # ─────────────────────────────────────────────
  # Controller Instance 2
  # ─────────────────────────────────────────────
  controller-2:
    image: ghcr.io/netstacks/controller:latest
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
      valkey-primary:
        condition: service_healthy
    environment:
      DATABASE_URL: postgres://netstacks:${DB_PASSWORD}@postgres:5432/netstacks
      VAULT_MASTER_KEY: ${VAULT_MASTER_KEY}
      JWT_SECRET: ${JWT_SECRET}
      NETSTACKS_INSTANCE_NAME: "controller-2"
      RUST_LOG: "info,netstacks=debug"

      # Same Sentinel config — both instances discover the same Valkey primary
      VALKEY_SENTINEL_URL: "redis://valkey-sentinel-1:26379,valkey-sentinel-2:26379,valkey-sentinel-3:26379"
      VALKEY_SENTINEL_MASTER_NAME: "mymaster"

      HA_HEARTBEAT_INTERVAL_SECS: "5"
      HA_STALE_INSTANCE_TTL_SECS: "30"

      TLS_CERT_PATH: /data/tls/cert.pem
      TLS_KEY_PATH: /data/tls/key.pem
      TLS_AUTO_GENERATE: "true"
    volumes:
      - controller2-data:/data

  # ─────────────────────────────────────────────
  # Nginx Load Balancer
  # ─────────────────────────────────────────────
  nginx:
    image: nginx:1.27-alpine
    restart: unless-stopped
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - controller-1
      - controller-2

volumes:
  postgres-data:
  valkey-primary-data:
  valkey-replica1-data:
  valkey-replica2-data:
  controller1-data:
  controller2-data:

Sentinel Configuration

Create a sentinel.conf file in the same directory as your Docker Compose file. The quorum value of 2 means two out of three sentinels must agree that the primary is down before triggering failover.

sentinel.confini

# sentinel.conf
port 26379

# Monitor the Valkey primary
# Format: sentinel monitor <master-name> <host> <port> <quorum>
sentinel monitor mymaster valkey-primary 6379 2

# Time in ms before a non-responding primary is considered down
sentinel down-after-milliseconds mymaster 5000

# Failover timeout in ms
sentinel failover-timeout mymaster 10000

# How many replicas can sync from the new primary simultaneously
sentinel parallel-syncs mymaster 1

# Deny external writes to sentinel config (security)
sentinel deny-scripts-reconfig yes

Nginx Configuration

The Nginx config must handle three things: HTTPS termination, least_conn load balancing, and WebSocket upgrade passthrough. The proxy_read_timeout is set to 3600s (1 hour) to support long-running terminal sessions.

nginx.confnginx

worker_processes auto;

events {
    worker_connections 2048;
}

http {
    # Upstream — controller instances
    upstream netstacks_api {
        least_conn;
        server controller-1:3000;
        server controller-2:3000;
    }

    # Connection upgrade map for WebSocket support
    map $http_upgrade $connection_upgrade {
        default upgrade;
        ''      close;
    }

    # Redirect HTTP to HTTPS
    server {
        listen 80;
        server_name _;
        return 301 https://$host$request_uri;
    }

    # Main HTTPS server
    server {
        listen 443 ssl;
        server_name netstacks.example.net;

        # TLS certificates
        ssl_certificate     /etc/nginx/certs/fullchain.pem;
        ssl_certificate_key /etc/nginx/certs/privkey.pem;
        ssl_protocols       TLSv1.2 TLSv1.3;
        ssl_ciphers         HIGH:!aNULL:!MD5;
        ssl_prefer_server_ciphers on;

        # Security headers
        add_header Strict-Transport-Security "max-age=63072000; includeSubDomains" always;
        add_header X-Content-Type-Options "nosniff" always;
        add_header X-Frame-Options "DENY" always;

        # Proxy settings
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support — critical for terminal sessions
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;

        # Long timeout for terminal sessions (1 hour)
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;

        # All traffic to the controller API
        location / {
            proxy_pass https://netstacks_api;

            # TLS to backend (controller uses self-signed certs)
            proxy_ssl_verify off;
        }

        # Health check endpoint for monitoring (no auth required)
        location /api/v1/health {
            proxy_pass https://netstacks_api;
            proxy_ssl_verify off;
            access_log off;
        }
    }
}

Start the HA Stack

start-ha-stack.shbash

# Ensure your .env file has all required secrets
cat .env
# VAULT_MASTER_KEY=<64-hex-chars>
# JWT_SECRET=<64-hex-chars>
# DB_PASSWORD=<database-password>

# Start the full stack
docker compose -f docker-compose-ha.yml up -d

# Verify all containers are running
docker compose -f docker-compose-ha.yml ps

# Check cluster health from the load balancer
curl -k https://localhost/api/v1/health

# The response should show 2 registered instances:
# {
#   "status": "healthy",
#   "cluster": {
#     "registered_instances": 2,
#     "healthy_instances": 2,
#     ...
#   }
# }

# Verify both instances individually
docker compose -f docker-compose-ha.yml exec controller-1 \
  curl -sk https://localhost:3000/api/v1/health | jq .instance_id

docker compose -f docker-compose-ha.yml exec controller-2 \
  curl -sk https://localhost:3000/api/v1/health | jq .instance_id

Scaling beyond two instances

To add a third (or more) controller instance, duplicate the controller-2 service block with a new name and volume, and add the new container to the upstream block in Nginx. No Valkey or database changes are needed — additional instances register automatically via Valkey on startup.

External Database

For production HA deployments, use an external PostgreSQL instance (or managed service) instead of the Docker Compose PostgreSQL container. This provides database-level redundancy, automated backups, point-in-time recovery, and independent scaling.

PostgreSQL Setup

setup-database.sqlsql

# Connect to PostgreSQL as superuser
psql -h db.example.net -U postgres

# Create the netstacks database and user
CREATE USER netstacks WITH PASSWORD 'your-secure-password';
CREATE DATABASE netstacks OWNER netstacks;

# Connect to the netstacks database
\c netstacks

# Install the pgvector extension (required for AI knowledge base)
CREATE EXTENSION IF NOT EXISTS vector;

# Verify the extension is installed
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
#  extname | extversion
# ---------+------------
#  vector  | 0.7.4

# Grant schema permissions
GRANT ALL PRIVILEGES ON DATABASE netstacks TO netstacks;
GRANT ALL PRIVILEGES ON SCHEMA public TO netstacks;

pgvector is required

The pgvector extension is required for the AI knowledge base and semantic search features. If using a managed PostgreSQL service, verify that pgvector is available. AWS RDS (PostgreSQL 16+), Google Cloud SQL, and Azure Database for PostgreSQL Flexible Server all support pgvector.

Connection Pooling with PgBouncer

For deployments with many concurrent sessions, add PgBouncer between the controller instances and PostgreSQL to pool database connections. Each controller instance maintains its own connection pool, so without PgBouncer, total connection count is instances x pool_size.

pgbouncer.iniini

# pgbouncer.ini
[databases]
netstacks = host=db.example.net port=5432 dbname=netstacks

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Transaction pooling is recommended for NetStacks
pool_mode = transaction
max_client_conn = 200
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5

# Timeouts
server_idle_timeout = 600
client_idle_timeout = 0
server_connect_timeout = 15

userlist.txttext

# userlist.txt — PgBouncer auth file
# Format: "username" "password"
"netstacks" "your-secure-password"

pgbouncer-compose.ymlyaml

# Add PgBouncer to your docker-compose-ha.yml
  pgbouncer:
    image: edoburu/pgbouncer:1.23.1-p2
    restart: unless-stopped
    volumes:
      - ./pgbouncer.ini:/etc/pgbouncer/pgbouncer.ini:ro
      - ./userlist.txt:/etc/pgbouncer/userlist.txt:ro
    ports:
      - "127.0.0.1:6432:6432"
    depends_on:
      - postgres

# Then update DATABASE_URL on each controller instance:
# DATABASE_URL: postgres://netstacks:<password>@pgbouncer:6432/netstacks

Connection limits

PostgreSQL's default max_connections is 100. With two controller instances each using a pool of 25 connections, you need at least 50 connections plus overhead for maintenance and migrations. Set max_connections = 200 in postgresql.conf, or use PgBouncer to multiplex many application connections over a smaller pool of database connections.

TLS Configuration

The NetStacks Controller requires TLS for all connections. There are three approaches to TLS configuration, depending on your environment and security requirements.

Option 1: Auto-Generated Self-Signed CA (Development)

By default, the controller generates a self-signed CA and certificate on first boot when TLS_AUTO_GENERATE=true. This is suitable for development and testing but not recommended for production. The Terminal app will show a certificate warning on first connection.

tls-auto-generate.ymlyaml

# Controller environment (already in the compose examples above)
environment:
  TLS_CERT_PATH: /data/tls/cert.pem
  TLS_KEY_PATH: /data/tls/key.pem
  TLS_AUTO_GENERATE: "true"

Option 2: Bring Your Own Certificate (Production)

For production, provide your own TLS certificate and private key. This can be a certificate from an internal CA, a commercially signed certificate, or a certificate from a public CA. Mount the files into the controller container.

generate-cert.shbash

# Generate a CSR and private key (if using an internal CA)
openssl req -new -newkey rsa:4096 -nodes \
  -keyout netstacks.key \
  -out netstacks.csr \
  -subj "/CN=netstacks.example.net/O=Example Corp"

# After your CA signs the CSR, you'll have:
# - netstacks.key     (private key)
# - netstacks.crt     (signed certificate)
# - ca-chain.crt      (CA certificate chain)

# Create a full chain file
cat netstacks.crt ca-chain.crt > fullchain.pem

# Verify the certificate
openssl x509 -in fullchain.pem -text -noout | head -20

tls-byoc.ymlyaml

# Mount your certificates into the controller container
  controller-1:
    image: ghcr.io/netstacks/controller:latest
    environment:
      TLS_CERT_PATH: /certs/fullchain.pem
      TLS_KEY_PATH: /certs/netstacks.key
      TLS_AUTO_GENERATE: "false"
    volumes:
      - ./certs/fullchain.pem:/certs/fullchain.pem:ro
      - ./certs/netstacks.key:/certs/netstacks.key:ro

Option 3: Let's Encrypt with Certbot (Public-Facing)

If the controller is publicly accessible, use Let's Encrypt for free, automatically renewed TLS certificates. Run certbot on the Nginx host and configure Nginx to serve the certificates.

letsencrypt-setup.shbash

# Install certbot
sudo apt-get update && sudo apt-get install -y certbot

# Obtain a certificate (standalone mode — stop Nginx first)
sudo certbot certonly --standalone \
  -d netstacks.example.net \
  --agree-tos \
  --email admin@example.net \
  --non-interactive

# Certificate files are at:
# /etc/letsencrypt/live/netstacks.example.net/fullchain.pem
# /etc/letsencrypt/live/netstacks.example.net/privkey.pem

# Set up automatic renewal (runs twice daily by default)
sudo systemctl enable --now certbot.timer

# Test renewal
sudo certbot renew --dry-run

letsencrypt-nginx.ymlyaml

# Update nginx volumes to mount Let's Encrypt certificates
  nginx:
    image: nginx:1.27-alpine
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - /etc/letsencrypt/live/netstacks.example.net:/etc/nginx/certs:ro
      - /etc/letsencrypt/archive/netstacks.example.net:/etc/nginx/certs-archive:ro

# In nginx.conf, reference the certificates:
#   ssl_certificate     /etc/nginx/certs/fullchain.pem;
#   ssl_certificate_key /etc/nginx/certs/privkey.pem;

TLS termination at the load balancer

In most production deployments, TLS is terminated at the Nginx load balancer. The controller instances still use self-signed TLS internally (Nginx proxies to the backends with proxy_ssl_verify off). This means external clients see a trusted certificate from Nginx, while internal traffic between Nginx and the controllers uses the auto-generated certs.

Geo-HA Considerations

For organizations that need controller availability across geographic regions, NetStacks can be deployed in a multi-region configuration. However, there are important trade-offs to understand compared to a single-region HA deployment.

Managed PostgreSQL Across Regions

Use a managed PostgreSQL service with cross-region replication for database-level geo-redundancy:

AWS RDS — Create a Multi-AZ deployment with a cross-region read replica. In a failover scenario, promote the read replica to primary.
Google Cloud SQL — Use cross-region replicas with automatic failover enabled.
Azure Database for PostgreSQL — Use geo-redundant backup or read replicas in a secondary region.

Write latency

Cross-region PostgreSQL replication is asynchronous by default, meaning there is a replication lag window during which the secondary region may serve stale data. Synchronous replication across regions eliminates this but adds significant write latency (typically 50–200ms per write). Most deployments should use asynchronous replication and accept the small consistency window.

Regional Valkey Clusters

Valkey does not natively support cross-region replication. For geo-HA, deploy an independent Valkey cluster (primary + replicas + sentinels) in each region. Each regional controller cluster uses its own Valkey cluster.

Session state is regional — sessions created in Region A are tracked in Region A's Valkey cluster and served by Region A's controller instances.
If Region A fails, sessions on Region A controllers are lost. Users reconnect to Region B via DNS failover and establish new sessions. Persistent data (users, devices, credentials) is available in Region B via the database replica.

DNS-Based Routing

Use DNS routing to direct users to the nearest or healthiest region:

dns-routing-example.txttext

# Example: AWS Route 53 health-checked failover
# Primary record (us-east-1)
netstacks.example.net  A  203.0.113.10  (failover: PRIMARY, health-check: /api/v1/health)

# Secondary record (eu-west-1)
netstacks.example.net  A  198.51.100.20 (failover: SECONDARY, health-check: /api/v1/health)

# Alternative: latency-based routing (active-active geo)
netstacks.example.net  A  203.0.113.10  (region: us-east-1, health-check: enabled)
netstacks.example.net  A  198.51.100.20 (region: eu-west-1, health-check: enabled)

Session Locality Limitations

Be aware of the following limitations with geo-distributed deployments:

Sessions are not portable across regions. A terminal session opened in Region A cannot be seamlessly migrated to Region B. If Region A fails, the user must establish a new session in Region B.
Session sharing is regional. Shared terminal sessions (e.g., a senior engineer watching a junior engineer's session) only work when both users connect to the same regional cluster.
Device proximity matters. For best performance, route users to the controller region closest to their target network devices. An SSH session proxied through a controller in a distant region adds round-trip latency to every keystroke.

Recommended geo-HA pattern

For most organizations, an active-passive geo configuration is simpler and more reliable than active-active. Run the full HA stack in a primary region, with a warm standby in a secondary region (database replica + pre-deployed but stopped controller instances). Failover is triggered by DNS change and starting the standby controllers.

Environment Variable Reference

Complete reference of all environment variables accepted by the NetStacks Controller. Variables marked Required must be set for the controller to start. Variables marked HA are only relevant for multi-instance deployments.

Core Configuration

Variable	Description	Default	Required
`DATABASE_URL`	PostgreSQL connection string (e.g., `postgres://user:pass@host:5432/netstacks`)	—	Yes
`VAULT_MASTER_KEY`	64 hex character (32 byte) key for AES-256-GCM encryption of credentials and secrets	—	Yes
`JWT_SECRET`	Secret key for signing JWT authentication tokens. Must be identical across all controller instances	—	Yes
`NETSTACKS_INSTANCE_NAME`	Human-readable name for this controller instance, displayed in the Admin UI HA status page	hostname	No
`RUST_LOG`	Log level filter (e.g., `info,netstacks=debug`)	`info`	No
`API_PORT`	Port the controller API listens on	`3000`	No
`API_BIND_ADDR`	Bind address for the API server	`0.0.0.0`	No

TLS Configuration

Variable	Description	Default
`TLS_CERT_PATH`	Path to the TLS certificate (PEM format)	`/data/tls/cert.pem`
`TLS_KEY_PATH`	Path to the TLS private key (PEM format)	`/data/tls/key.pem`
`TLS_AUTO_GENERATE`	Auto-generate a self-signed cert on first boot if no cert exists at `TLS_CERT_PATH`	`true`

Valkey / HA Configuration

Variable	Description	Default
`VALKEY_URL`	Direct Valkey connection URL (e.g., `redis://valkey:6379`). Use this for single-Valkey-instance setups without Sentinel	None (HA disabled)
`VALKEY_SENTINEL_URL`	Comma-separated Sentinel addresses for automated Valkey failover (e.g., `redis://s1:26379,s2:26379,s3:26379`). Overrides `VALKEY_URL` when set	None
`VALKEY_SENTINEL_MASTER_NAME`	Sentinel master name for primary discovery. Must match the name in `sentinel.conf`	`mymaster`
`VALKEY_PASSWORD`	Password for Valkey authentication (if Valkey requires auth)	None
`HA_HEARTBEAT_INTERVAL_SECS`	How often each instance sends a heartbeat to Valkey	`5`
`HA_STALE_INSTANCE_TTL_SECS`	Time after which a non-heartbeating instance is considered stale and removed from the cluster	`30`

SSH Proxy Configuration

Variable	Description	Default
`SSH_PROXY_PORT`	Port for the SSH proxy listener	`2222`
`SSH_CONNECT_TIMEOUT_SECS`	Timeout for establishing SSH connections to network devices	`30`
`SESSION_ORPHAN_TIMEOUT_SECS`	How long an orphaned session (no active viewer) remains alive before cleanup	`300`

License Configuration

Variable	Description	Default
`LICENSE_KEY`	License key for the controller. Can also be set via the Admin UI	None
`LICENSE_SERVER_URL`	URL of the license validation server	`https://license.netstacks.io`

Health Checks & Monitoring

The controller exposes several health and status endpoints for load balancer health checks, monitoring systems, and operational dashboards.

Health Endpoints

Endpoint	Auth Required	Purpose
`GET /api/v1/health`	No	Full health check including database, Valkey, and cluster status. Use for load balancer health checks.
`GET /api/v1/ready`	No	Readiness probe. Returns 200 when the instance is fully initialized and ready to accept traffic.
`GET /api/admin/health`	Yes (admin)	Extended health information with database pool stats, Valkey memory usage, and detailed diagnostics.

Health Check Response

health-check.shbash

# Basic health check (no auth required)
curl -k https://netstacks.example.net/api/v1/health

health-response.jsonjson

{
  "status": "healthy",
  "instance_id": "ctrl-a1b2c3d4",
  "instance_name": "controller-1",
  "uptime_secs": 86400,
  "version": "0.1.0",
  "valkey": "connected",
  "database": "connected",
  "cluster": {
    "registered_instances": 2,
    "healthy_instances": 2,
    "total_sessions": 42,
    "local_sessions": 21
  }
}

Monitoring with Prometheus

While the controller does not expose a native Prometheus metrics endpoint, you can use a JSON exporter to scrape the health endpoint and convert it to Prometheus metrics.

prometheus-config.ymlyaml

# prometheus.yml — scrape NetStacks health via json_exporter
scrape_configs:
  - job_name: 'netstacks-health'
    metrics_path: /probe
    params:
      module: [netstacks]
    static_configs:
      - targets:
          - https://netstacks.example.net/api/v1/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: json-exporter:7979

---
# json_exporter config (config.yml)
modules:
  netstacks:
    metrics:
      - name: netstacks_up
        path: '{ .status }'
        help: "Controller health status"
        values:
          healthy: 1
          degraded: 0
      - name: netstacks_uptime_seconds
        path: '{ .uptime_secs }'
        help: "Controller uptime in seconds"
      - name: netstacks_cluster_instances
        path: '{ .cluster.registered_instances }'
        help: "Number of registered controller instances"
      - name: netstacks_cluster_healthy_instances
        path: '{ .cluster.healthy_instances }'
        help: "Number of healthy controller instances"
      - name: netstacks_sessions_total
        path: '{ .cluster.total_sessions }'
        help: "Total active sessions across the cluster"
      - name: netstacks_sessions_local
        path: '{ .cluster.local_sessions }'
        help: "Active sessions on this instance"

Alerting Examples

health-check-script.shbash

# Check cluster health from a monitoring script
#!/bin/bash
RESPONSE=$(curl -sk https://netstacks.example.net/api/v1/health)
STATUS=$(echo "$RESPONSE" | jq -r '.status')
HEALTHY=$(echo "$RESPONSE" | jq -r '.cluster.healthy_instances')
TOTAL=$(echo "$RESPONSE" | jq -r '.cluster.registered_instances')

if [ "$STATUS" != "healthy" ]; then
  echo "CRITICAL: NetStacks controller is $STATUS"
  exit 2
fi

if [ "$HEALTHY" -lt "$TOTAL" ]; then
  echo "WARNING: $HEALTHY/$TOTAL instances healthy"
  exit 1
fi

echo "OK: $HEALTHY/$TOTAL instances healthy, $STATUS"
exit 0

Load balancer health checks

Configure your load balancer to poll /api/v1/health every 10 seconds with a 5-second timeout. Remove instances from the pool after 3 consecutive failures. The health endpoint is lightweight and does not require authentication, making it safe for frequent polling.

Backup & Restore

PostgreSQL is the single source of truth for all persistent data in NetStacks. Regular database backups are essential for disaster recovery. Valkey data is ephemeral (session state, heartbeats) and does not need to be backed up — it rebuilds automatically when instances restart.

Manual Database Backup

manual-backup.shbash

# Backup the full database (compressed)
pg_dump -h db.example.net -U netstacks -d netstacks \
  --format=custom \
  --compress=9 \
  --file=netstacks-backup-$(date +%Y%m%d-%H%M%S).dump

# Backup only the schema (for documentation or migration reference)
pg_dump -h db.example.net -U netstacks -d netstacks \
  --schema-only \
  --file=netstacks-schema-$(date +%Y%m%d).sql

# Backup specific tables (e.g., just users and devices)
pg_dump -h db.example.net -U netstacks -d netstacks \
  --format=custom \
  --table=users --table=devices --table=credentials \
  --file=netstacks-core-$(date +%Y%m%d-%H%M%S).dump

Automated Backups with postgres-backup-local

Add an automated backup container to your Docker Compose stack that runs daily backups with configurable retention.

automated-backup.ymlyaml

# Add to your docker-compose-ha.yml
  db-backup:
    image: prodrigestivill/postgres-backup-local:16
    restart: unless-stopped
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      POSTGRES_HOST: postgres
      POSTGRES_DB: netstacks
      POSTGRES_USER: netstacks
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_EXTRA_OPTS: "--format=custom --compress=9"

      # Schedule: daily at 2:00 AM
      SCHEDULE: "0 2 * * *"

      # Retention policy
      BACKUP_KEEP_DAYS: 7      # Keep daily backups for 7 days
      BACKUP_KEEP_WEEKS: 4     # Keep weekly backups for 4 weeks
      BACKUP_KEEP_MONTHS: 6    # Keep monthly backups for 6 months

      # Healthcheck URL (optional, for monitoring)
      HEALTHCHECK_PORT: 8080
    volumes:
      - db-backups:/backups

volumes:
  db-backups:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /opt/netstacks/backups

Restore from Backup

restore-backup.shbash

# Stop the controller instances first
docker compose -f docker-compose-ha.yml stop controller-1 controller-2

# Restore from a custom-format backup
pg_restore -h db.example.net -U netstacks -d netstacks \
  --clean \
  --if-exists \
  --no-owner \
  --no-privileges \
  netstacks-backup-20260410-020000.dump

# Verify the restore
psql -h db.example.net -U netstacks -d netstacks \
  -c "SELECT count(*) FROM users; SELECT count(*) FROM devices;"

# Restart the controller instances
docker compose -f docker-compose-ha.yml start controller-1 controller-2

Controller Data Volume Backup

Besides the database, the controller stores some data on its local volume (auto-generated TLS certs, SSH CA keys if not using the vault, and temporary files). Back up the Docker volumes if you use auto-generated TLS certificates.

volume-backup.shbash

# Backup controller data volumes
docker run --rm \
  -v netstacks_controller1-data:/source:ro \
  -v /opt/netstacks/backups:/backup \
  alpine tar czf /backup/controller1-data-$(date +%Y%m%d).tar.gz -C /source .

docker run --rm \
  -v netstacks_controller2-data:/source:ro \
  -v /opt/netstacks/backups:/backup \
  alpine tar czf /backup/controller2-data-$(date +%Y%m%d).tar.gz -C /source .

Vault master key recovery

Database backups contain encrypted credential data. To restore credentials, you must have the original VAULT_MASTER_KEY that was used when the data was encrypted. A backup restored with a different vault key will start successfully, but all encrypted credentials will be unreadable. Always back up the vault master key separately from the database.

Upgrading

NetStacks Controller supports zero-downtime upgrades in HA mode by performing rolling restarts. The controller automatically applies database migrations on startup, so no manual migration step is needed.

Standard Upgrade (Single Instance)

upgrade-single.shbash

# Pull the latest image
docker compose pull controller

# Restart with the new image
docker compose up -d controller

# Watch the logs for migration output and successful startup
docker compose logs -f controller

# Verify health
curl -k https://localhost:3000/api/v1/health

Rolling Upgrade (HA Mode)

In HA mode, upgrade one controller instance at a time to maintain availability. The load balancer health check automatically removes the restarting instance and adds it back once healthy.

rolling-upgrade.shbash

# Pull the latest image on all hosts
docker compose -f docker-compose-ha.yml pull

# Step 1: Upgrade controller-1
# Stop controller-1 — nginx health check removes it from the pool
docker compose -f docker-compose-ha.yml stop controller-1

# Start controller-1 with the new image
docker compose -f docker-compose-ha.yml up -d controller-1

# Wait for controller-1 to become healthy
until curl -sk https://localhost/api/v1/health | grep -q '"healthy_instances": 2'; do
  echo "Waiting for controller-1 to rejoin cluster..."
  sleep 5
done
echo "controller-1 is healthy"

# Step 2: Upgrade controller-2
docker compose -f docker-compose-ha.yml stop controller-2
docker compose -f docker-compose-ha.yml up -d controller-2

# Wait for controller-2 to become healthy
until curl -sk https://localhost/api/v1/health | grep -q '"healthy_instances": 2'; do
  echo "Waiting for controller-2 to rejoin cluster..."
  sleep 5
done
echo "Upgrade complete — both instances healthy"

# Verify the new version
curl -sk https://localhost/api/v1/health | jq '.version'

Database migrations run on first instance

Database migrations are applied by the first controller instance that starts with the new version. The migration acquires a database-level advisory lock, so if two instances start simultaneously, only one runs the migration while the other waits. However, for safety, always upgrade one instance at a time in a rolling fashion.

Rollback Procedure

If an upgrade introduces issues, roll back by pinning the controller image to the previous version.

rollback.shbash

# Check the currently running version
docker compose -f docker-compose-ha.yml exec controller-1 \
  curl -sk https://localhost:3000/api/v1/health | jq '.version'

# Roll back to a specific version
# Edit docker-compose-ha.yml (or set the image tag):
#   controller-1:
#     image: ghcr.io/netstacks/controller:0.0.8  # previous version

# Or use an environment variable override
export CONTROLLER_IMAGE=ghcr.io/netstacks/controller:0.0.8

docker compose -f docker-compose-ha.yml stop controller-1 controller-2

# Start with the previous version
docker compose -f docker-compose-ha.yml up -d controller-1
sleep 10
docker compose -f docker-compose-ha.yml up -d controller-2

# Verify the rollback
curl -sk https://localhost/api/v1/health | jq '.version'

Rollback limitations

If the new version applied database migrations that alter existing tables (e.g., dropping columns or changing types), rolling back to the previous version may cause errors because the old code expects the old schema. Always check the release notes for migration details before upgrading. For major upgrades that include breaking migrations, take a database backup before starting the upgrade.

Blue-green deployment

For zero-risk upgrades, consider a blue-green approach: deploy the new version as a separate set of controller instances pointing to the same database. Verify health and functionality on the “green” stack, then switch the load balancer to point at the green instances. If issues arise, switch back to “blue” instantly.