High Availability Deployment
EnterpriseDeploy NetStacks Controller in a production-grade, highly available configuration with multiple controller instances, Valkey Sentinel, external PostgreSQL, Nginx load balancing, TLS, backups, and geo-HA considerations.
Overview
NetStacks Controller supports active-active high availability deployment, meaning every controller instance actively serves requests simultaneously. There is no primary/standby distinction among controller nodes — any instance can handle any API call, WebSocket connection, or SSH proxy session.
The HA architecture eliminates single points of failure across every layer of the stack:
- Controller instances — Two or more instances run behind a load balancer. If one goes down, the remaining instances continue serving all traffic. Sessions on the failed instance are automatically migrated when clients reconnect.
- Valkey (session coordination) — A Redis-compatible in-memory store that tracks which controller instance owns each session, enabling cross-instance session routing and connection brokering. Valkey Sentinel provides automated failover for the Valkey tier itself.
- PostgreSQL — The database stores all persistent state: users, devices, credentials, audit logs, certificates, and configuration. In HA mode, use an external PostgreSQL cluster with replication for database-level redundancy.
- Load balancer — Nginx (or any WebSocket-aware load balancer) distributes traffic across controller instances using least-connections routing, with health-check-based removal of unhealthy nodes.
Key capabilities enabled by HA mode:
- Instance registration and heartbeat — Each controller registers in Valkey with a unique ID and periodic heartbeat. Stale instances are automatically cleaned up after the TTL expires.
- Session registry — Active sessions are tracked in Valkey with the owning instance ID, enabling cross-instance session lookup and migration.
- Connection broker — Incoming WebSocket connections are routed to the correct controller instance via internal proxy. Clients connect to any instance; the broker handles the rest.
- Circuit breaker — The LLM orchestrator tracks AI provider health per-instance and trips a circuit breaker after consecutive failures, preventing cascading timeouts.
- HA status dashboard — The Admin UI provides real-time cluster health, instance status, session distribution, and Valkey connectivity at Admin → HA Status.
High availability deployment requires a multi-instance Enterprise Controller deployment. Single-instance Controller deployments do not require Valkey or multiple controller instances.
Architecture
The following diagram shows a production HA deployment with all components. The load balancer receives all external traffic and distributes it across the controller instances. Both controllers share the same PostgreSQL database and Valkey cluster for state coordination.
┌──────────────────────┐
│ Terminal Clients │
│ (Tauri desktop app) │
└──────────┬───────────┘
│
HTTPS / WSS
│
┌──────────▼───────────┐
│ Nginx Load Balancer │
│ (least_conn, TLS │
│ termination) │
└────┬────────────┬─────┘
│ │
┌────────▼──┐ ┌────▼────────┐
│Controller │ │ Controller │
│Instance 1 │ │ Instance 2 │
│ :3000 │ │ :3000 │
└──┬────┬───┘ └──┬────┬──────┘
│ │ │ │
┌────────┘ └─────┬─────┘ └────────┐
│ │ │
┌─────────▼─────────┐ ┌─────▼──────────┐ ┌─────▼──────────┐
│ PostgreSQL 16 │ │ Valkey Primary │ │ Network Devices│
│ (pgvector ext) │ │ + Replicas │ │ (SSH/Telnet) │
│ │ │ │ │ │
│ Users, devices, │ │ Session state, │ └────────────────┘
│ credentials, │ │ instance reg, │
│ audit logs, │ │ heartbeats │
│ certificates │ │ │
└────────────────────┘ └────────┬────────┘
│
┌────────▼────────┐
│ Valkey Sentinel │
│ (3 instances) │
│ │
│ Monitors primary,│
│ auto-promotes │
│ replica on fail │
└─────────────────┘In this architecture:
- Nginx terminates TLS and distributes traffic to controller instances using
least_conn. It passes WebSocket upgrades through to the backend. - Controller instances are stateless application servers. All persistent state lives in PostgreSQL; all ephemeral session state lives in Valkey. You can scale horizontally by adding more instances.
- Valkey Primary + Replicas handle session coordination. The primary handles writes; replicas provide read scaling and failover candidates.
- Valkey Sentinel monitors the primary and automatically promotes a replica if the primary fails. Three sentinels are required for quorum.
- PostgreSQL is the single source of truth for all persistent data. Use an external PostgreSQL cluster with streaming replication for database-level HA.
Prerequisites
Before deploying NetStacks Controller in any configuration, ensure the following requirements are met:
| Requirement | Minimum | Recommended (HA) |
|---|---|---|
| Docker Engine | 24.0+ | 25.0+ |
| Docker Compose | v2.20+ | v2.24+ |
| RAM per controller instance | 4 GB | 8 GB |
| CPU per controller instance | 2 vCPU | 4 vCPU |
| PostgreSQL | 16+ with pgvector | 16+ with pgvector, streaming replication |
| Valkey / Redis | Not required (single instance) | Valkey 8+ with Sentinel |
| Disk (database) | 20 GB | 100 GB+ SSD |
| Disk (Valkey) | 1 GB | 4 GB SSD (AOF persistence) |
| Network | Controller ↔ devices reachable on SSH/Telnet ports | Low-latency interconnect between controller instances |
| License | Enterprise (single instance) | Enterprise (multi-instance HA) |
On AWS, use RDS for PostgreSQL with the pgvector extension (supported on RDS 16+) and ElastiCache for Valkey. On GCP, use Cloud SQL for PostgreSQL and Memorystore for Redis/Valkey. On Azure, use Azure Database for PostgreSQL Flexible Server and Azure Cache for Redis.
Single Instance Setup
Start with a single-instance deployment to verify your configuration before scaling to HA. This setup runs PostgreSQL, the controller API, and optionally Valkey in a single Docker Compose stack.
Generate Secrets
Before creating the Docker Compose file, generate the required secrets. These values must be kept secure and consistent across all controller instances if you later scale to HA.
# Generate the vault master key (64 hex characters = 32 bytes)
# This encrypts all credentials, SSH CA keys, and sensitive data in the database
export VAULT_MASTER_KEY=$(openssl rand -hex 32)
echo "VAULT_MASTER_KEY=$VAULT_MASTER_KEY"
# Generate the JWT signing secret (64 hex characters = 32 bytes)
# This signs all authentication tokens — if changed, all users are logged out
export JWT_SECRET=$(openssl rand -hex 32)
echo "JWT_SECRET=$JWT_SECRET"
# Generate the database password
export DB_PASSWORD=$(openssl rand -base64 24)
echo "DB_PASSWORD=$DB_PASSWORD"
# Save these values securely — you will need them for the docker-compose file
# and for any additional controller instances in HA modeThe VAULT_MASTER_KEY is the root of trust for all encrypted data in NetStacks. If lost, encrypted credentials and SSH CA private keys cannot be recovered. Store it in a secrets manager (HashiCorp Vault, AWS Secrets Manager, etc.) or at minimum in a secure, backed-up location outside the Docker host.
Docker Compose (Single Instance)
version: "3.8"
services:
postgres:
image: pgvector/pgvector:pg16
restart: unless-stopped
environment:
POSTGRES_USER: netstacks
POSTGRES_PASSWORD: ${DB_PASSWORD}
POSTGRES_DB: netstacks
volumes:
- postgres-data:/var/lib/postgresql/data
ports:
- "127.0.0.1:5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U netstacks"]
interval: 5s
timeout: 5s
retries: 5
controller:
image: ghcr.io/netstacks/controller:latest
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
ports:
- "3000:3000"
environment:
# Database
DATABASE_URL: postgres://netstacks:${DB_PASSWORD}@postgres:5432/netstacks
# Security — generate with: openssl rand -hex 32
VAULT_MASTER_KEY: ${VAULT_MASTER_KEY}
JWT_SECRET: ${JWT_SECRET}
# TLS — auto-generates a self-signed CA on first boot
# Replace with your own certs for production (see TLS section)
TLS_CERT_PATH: /data/tls/cert.pem
TLS_KEY_PATH: /data/tls/key.pem
TLS_AUTO_GENERATE: "true"
# Instance naming (optional, shown in Admin UI)
NETSTACKS_INSTANCE_NAME: "controller-1"
# Logging
RUST_LOG: "info,netstacks=debug"
volumes:
- controller-data:/data
volumes:
postgres-data:
controller-data:Start the Stack
# Create a .env file with your generated secrets
cat > .env << 'EOF'
VAULT_MASTER_KEY=<your-64-hex-char-key>
JWT_SECRET=<your-64-hex-char-jwt-secret>
DB_PASSWORD=<your-database-password>
EOF
# Start the stack
docker compose up -d
# Watch the logs for the initial admin password
docker compose logs -f controllerFirst-Run Admin Password
On first boot, the controller creates a default admin user and prints a randomly generated password to the logs. Look for a log line like:
[INFO] Created default admin user
[INFO] Admin password: Xk9mP2vL8nQwR4tY
[INFO] Change this password immediately after first loginThe auto-generated admin password is printed to stdout in plaintext. Log in to the Admin UI at https://your-host:3000, navigate to your profile, and change the password. If you miss the password in the logs, you can recreate the admin user by removing the database volume and restarting: docker compose down -v && docker compose up -d.
Verify the Deployment
# Check the health endpoint (use -k for self-signed TLS)
curl -k https://localhost:3000/api/v1/health
# Expected response:
# {
# "status": "healthy",
# "instance_id": "ctrl-a1b2c3d4",
# "uptime_secs": 42,
# "valkey": "not_configured",
# "database": "connected",
# "cluster": {
# "registered_instances": 1,
# "healthy_instances": 1,
# "total_sessions": 0,
# "local_sessions": 0
# }
# }
# Log in and get a token
curl -k -X POST https://localhost:3000/api/auth/login \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "<password-from-logs>"}'High Availability Setup
The full HA deployment adds Valkey with Sentinel for session coordination, multiple controller instances, and an Nginx load balancer. All controller instances must share the same VAULT_MASTER_KEY, JWT_SECRET, and DATABASE_URL — these are the glue that makes the cluster function as a single logical system.
Every controller instance in the cluster must use identical values for VAULT_MASTER_KEY and JWT_SECRET. If these differ between instances: encrypted credentials decrypted on one instance will fail on another, and JWT tokens issued by one instance will be rejected by others. Generate these once and distribute to all instances via your secrets management system.
Full HA Docker Compose
version: "3.8"
services:
# ─────────────────────────────────────────────
# PostgreSQL (use external DB for production)
# ─────────────────────────────────────────────
postgres:
image: pgvector/pgvector:pg16
restart: unless-stopped
environment:
POSTGRES_USER: netstacks
POSTGRES_PASSWORD: ${DB_PASSWORD}
POSTGRES_DB: netstacks
volumes:
- postgres-data:/var/lib/postgresql/data
ports:
- "127.0.0.1:5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U netstacks"]
interval: 5s
timeout: 5s
retries: 5
# ─────────────────────────────────────────────
# Valkey Primary
# ─────────────────────────────────────────────
valkey-primary:
image: valkey/valkey:8-alpine
restart: unless-stopped
command: >
valkey-server
--appendonly yes
--appendfsync everysec
--maxmemory 512mb
--maxmemory-policy noeviction
volumes:
- valkey-primary-data:/data
ports:
- "127.0.0.1:6379:6379"
healthcheck:
test: ["CMD", "valkey-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
# ─────────────────────────────────────────────
# Valkey Replicas
# ─────────────────────────────────────────────
valkey-replica-1:
image: valkey/valkey:8-alpine
restart: unless-stopped
command: >
valkey-server
--appendonly yes
--appendfsync everysec
--replicaof valkey-primary 6379
volumes:
- valkey-replica1-data:/data
depends_on:
valkey-primary:
condition: service_healthy
valkey-replica-2:
image: valkey/valkey:8-alpine
restart: unless-stopped
command: >
valkey-server
--appendonly yes
--appendfsync everysec
--replicaof valkey-primary 6379
volumes:
- valkey-replica2-data:/data
depends_on:
valkey-primary:
condition: service_healthy
# ─────────────────────────────────────────────
# Valkey Sentinels (3 for quorum)
# ─────────────────────────────────────────────
valkey-sentinel-1:
image: valkey/valkey:8-alpine
restart: unless-stopped
command: >
valkey-sentinel /etc/valkey/sentinel.conf
volumes:
- ./sentinel.conf:/etc/valkey/sentinel.conf:ro
depends_on:
valkey-primary:
condition: service_healthy
valkey-sentinel-2:
image: valkey/valkey:8-alpine
restart: unless-stopped
command: >
valkey-sentinel /etc/valkey/sentinel.conf
volumes:
- ./sentinel.conf:/etc/valkey/sentinel.conf:ro
depends_on:
valkey-primary:
condition: service_healthy
valkey-sentinel-3:
image: valkey/valkey:8-alpine
restart: unless-stopped
command: >
valkey-sentinel /etc/valkey/sentinel.conf
volumes:
- ./sentinel.conf:/etc/valkey/sentinel.conf:ro
depends_on:
valkey-primary:
condition: service_healthy
# ─────────────────────────────────────────────
# Controller Instance 1
# ─────────────────────────────────────────────
controller-1:
image: ghcr.io/netstacks/controller:latest
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
valkey-primary:
condition: service_healthy
environment:
DATABASE_URL: postgres://netstacks:${DB_PASSWORD}@postgres:5432/netstacks
VAULT_MASTER_KEY: ${VAULT_MASTER_KEY}
JWT_SECRET: ${JWT_SECRET}
NETSTACKS_INSTANCE_NAME: "controller-1"
RUST_LOG: "info,netstacks=debug"
# Valkey Sentinel — controllers discover the primary automatically
VALKEY_SENTINEL_URL: "redis://valkey-sentinel-1:26379,valkey-sentinel-2:26379,valkey-sentinel-3:26379"
VALKEY_SENTINEL_MASTER_NAME: "mymaster"
# HA tuning
HA_HEARTBEAT_INTERVAL_SECS: "5"
HA_STALE_INSTANCE_TTL_SECS: "30"
# TLS
TLS_CERT_PATH: /data/tls/cert.pem
TLS_KEY_PATH: /data/tls/key.pem
TLS_AUTO_GENERATE: "true"
volumes:
- controller1-data:/data
# ─────────────────────────────────────────────
# Controller Instance 2
# ─────────────────────────────────────────────
controller-2:
image: ghcr.io/netstacks/controller:latest
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
valkey-primary:
condition: service_healthy
environment:
DATABASE_URL: postgres://netstacks:${DB_PASSWORD}@postgres:5432/netstacks
VAULT_MASTER_KEY: ${VAULT_MASTER_KEY}
JWT_SECRET: ${JWT_SECRET}
NETSTACKS_INSTANCE_NAME: "controller-2"
RUST_LOG: "info,netstacks=debug"
# Same Sentinel config — both instances discover the same Valkey primary
VALKEY_SENTINEL_URL: "redis://valkey-sentinel-1:26379,valkey-sentinel-2:26379,valkey-sentinel-3:26379"
VALKEY_SENTINEL_MASTER_NAME: "mymaster"
HA_HEARTBEAT_INTERVAL_SECS: "5"
HA_STALE_INSTANCE_TTL_SECS: "30"
TLS_CERT_PATH: /data/tls/cert.pem
TLS_KEY_PATH: /data/tls/key.pem
TLS_AUTO_GENERATE: "true"
volumes:
- controller2-data:/data
# ─────────────────────────────────────────────
# Nginx Load Balancer
# ─────────────────────────────────────────────
nginx:
image: nginx:1.27-alpine
restart: unless-stopped
ports:
- "443:443"
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- controller-1
- controller-2
volumes:
postgres-data:
valkey-primary-data:
valkey-replica1-data:
valkey-replica2-data:
controller1-data:
controller2-data:Sentinel Configuration
Create a sentinel.conf file in the same directory as your Docker Compose file. The quorum value of 2 means two out of three sentinels must agree that the primary is down before triggering failover.
# sentinel.conf
port 26379
# Monitor the Valkey primary
# Format: sentinel monitor <master-name> <host> <port> <quorum>
sentinel monitor mymaster valkey-primary 6379 2
# Time in ms before a non-responding primary is considered down
sentinel down-after-milliseconds mymaster 5000
# Failover timeout in ms
sentinel failover-timeout mymaster 10000
# How many replicas can sync from the new primary simultaneously
sentinel parallel-syncs mymaster 1
# Deny external writes to sentinel config (security)
sentinel deny-scripts-reconfig yesNginx Configuration
The Nginx config must handle three things: HTTPS termination, least_conn load balancing, and WebSocket upgrade passthrough. The proxy_read_timeout is set to 3600s (1 hour) to support long-running terminal sessions.
worker_processes auto;
events {
worker_connections 2048;
}
http {
# Upstream — controller instances
upstream netstacks_api {
least_conn;
server controller-1:3000;
server controller-2:3000;
}
# Connection upgrade map for WebSocket support
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
# Redirect HTTP to HTTPS
server {
listen 80;
server_name _;
return 301 https://$host$request_uri;
}
# Main HTTPS server
server {
listen 443 ssl;
server_name netstacks.example.net;
# TLS certificates
ssl_certificate /etc/nginx/certs/fullchain.pem;
ssl_certificate_key /etc/nginx/certs/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
# Security headers
add_header Strict-Transport-Security "max-age=63072000; includeSubDomains" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-Frame-Options "DENY" always;
# Proxy settings
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support — critical for terminal sessions
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
# Long timeout for terminal sessions (1 hour)
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
# All traffic to the controller API
location / {
proxy_pass https://netstacks_api;
# TLS to backend (controller uses self-signed certs)
proxy_ssl_verify off;
}
# Health check endpoint for monitoring (no auth required)
location /api/v1/health {
proxy_pass https://netstacks_api;
proxy_ssl_verify off;
access_log off;
}
}
}Start the HA Stack
# Ensure your .env file has all required secrets
cat .env
# VAULT_MASTER_KEY=<64-hex-chars>
# JWT_SECRET=<64-hex-chars>
# DB_PASSWORD=<database-password>
# Start the full stack
docker compose -f docker-compose-ha.yml up -d
# Verify all containers are running
docker compose -f docker-compose-ha.yml ps
# Check cluster health from the load balancer
curl -k https://localhost/api/v1/health
# The response should show 2 registered instances:
# {
# "status": "healthy",
# "cluster": {
# "registered_instances": 2,
# "healthy_instances": 2,
# ...
# }
# }
# Verify both instances individually
docker compose -f docker-compose-ha.yml exec controller-1 \
curl -sk https://localhost:3000/api/v1/health | jq .instance_id
docker compose -f docker-compose-ha.yml exec controller-2 \
curl -sk https://localhost:3000/api/v1/health | jq .instance_idTo add a third (or more) controller instance, duplicate the controller-2 service block with a new name and volume, and add the new container to the upstream block in Nginx. No Valkey or database changes are needed — additional instances register automatically via Valkey on startup.
External Database
For production HA deployments, use an external PostgreSQL instance (or managed service) instead of the Docker Compose PostgreSQL container. This provides database-level redundancy, automated backups, point-in-time recovery, and independent scaling.
PostgreSQL Setup
# Connect to PostgreSQL as superuser
psql -h db.example.net -U postgres
# Create the netstacks database and user
CREATE USER netstacks WITH PASSWORD 'your-secure-password';
CREATE DATABASE netstacks OWNER netstacks;
# Connect to the netstacks database
\c netstacks
# Install the pgvector extension (required for AI knowledge base)
CREATE EXTENSION IF NOT EXISTS vector;
# Verify the extension is installed
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
# extname | extversion
# ---------+------------
# vector | 0.7.4
# Grant schema permissions
GRANT ALL PRIVILEGES ON DATABASE netstacks TO netstacks;
GRANT ALL PRIVILEGES ON SCHEMA public TO netstacks;The pgvector extension is required for the AI knowledge base and semantic search features. If using a managed PostgreSQL service, verify that pgvector is available. AWS RDS (PostgreSQL 16+), Google Cloud SQL, and Azure Database for PostgreSQL Flexible Server all support pgvector.
Connection Pooling with PgBouncer
For deployments with many concurrent sessions, add PgBouncer between the controller instances and PostgreSQL to pool database connections. Each controller instance maintains its own connection pool, so without PgBouncer, total connection count is instances x pool_size.
# pgbouncer.ini
[databases]
netstacks = host=db.example.net port=5432 dbname=netstacks
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
# Transaction pooling is recommended for NetStacks
pool_mode = transaction
max_client_conn = 200
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
# Timeouts
server_idle_timeout = 600
client_idle_timeout = 0
server_connect_timeout = 15# userlist.txt — PgBouncer auth file
# Format: "username" "password"
"netstacks" "your-secure-password"# Add PgBouncer to your docker-compose-ha.yml
pgbouncer:
image: edoburu/pgbouncer:1.23.1-p2
restart: unless-stopped
volumes:
- ./pgbouncer.ini:/etc/pgbouncer/pgbouncer.ini:ro
- ./userlist.txt:/etc/pgbouncer/userlist.txt:ro
ports:
- "127.0.0.1:6432:6432"
depends_on:
- postgres
# Then update DATABASE_URL on each controller instance:
# DATABASE_URL: postgres://netstacks:<password>@pgbouncer:6432/netstacksPostgreSQL's default max_connections is 100. With two controller instances each using a pool of 25 connections, you need at least 50 connections plus overhead for maintenance and migrations. Set max_connections = 200 in postgresql.conf, or use PgBouncer to multiplex many application connections over a smaller pool of database connections.
TLS Configuration
The NetStacks Controller requires TLS for all connections. There are three approaches to TLS configuration, depending on your environment and security requirements.
Option 1: Auto-Generated Self-Signed CA (Development)
By default, the controller generates a self-signed CA and certificate on first boot when TLS_AUTO_GENERATE=true. This is suitable for development and testing but not recommended for production. The Terminal app will show a certificate warning on first connection.
# Controller environment (already in the compose examples above)
environment:
TLS_CERT_PATH: /data/tls/cert.pem
TLS_KEY_PATH: /data/tls/key.pem
TLS_AUTO_GENERATE: "true"Option 2: Bring Your Own Certificate (Production)
For production, provide your own TLS certificate and private key. This can be a certificate from an internal CA, a commercially signed certificate, or a certificate from a public CA. Mount the files into the controller container.
# Generate a CSR and private key (if using an internal CA)
openssl req -new -newkey rsa:4096 -nodes \
-keyout netstacks.key \
-out netstacks.csr \
-subj "/CN=netstacks.example.net/O=Example Corp"
# After your CA signs the CSR, you'll have:
# - netstacks.key (private key)
# - netstacks.crt (signed certificate)
# - ca-chain.crt (CA certificate chain)
# Create a full chain file
cat netstacks.crt ca-chain.crt > fullchain.pem
# Verify the certificate
openssl x509 -in fullchain.pem -text -noout | head -20# Mount your certificates into the controller container
controller-1:
image: ghcr.io/netstacks/controller:latest
environment:
TLS_CERT_PATH: /certs/fullchain.pem
TLS_KEY_PATH: /certs/netstacks.key
TLS_AUTO_GENERATE: "false"
volumes:
- ./certs/fullchain.pem:/certs/fullchain.pem:ro
- ./certs/netstacks.key:/certs/netstacks.key:roOption 3: Let's Encrypt with Certbot (Public-Facing)
If the controller is publicly accessible, use Let's Encrypt for free, automatically renewed TLS certificates. Run certbot on the Nginx host and configure Nginx to serve the certificates.
# Install certbot
sudo apt-get update && sudo apt-get install -y certbot
# Obtain a certificate (standalone mode — stop Nginx first)
sudo certbot certonly --standalone \
-d netstacks.example.net \
--agree-tos \
--email admin@example.net \
--non-interactive
# Certificate files are at:
# /etc/letsencrypt/live/netstacks.example.net/fullchain.pem
# /etc/letsencrypt/live/netstacks.example.net/privkey.pem
# Set up automatic renewal (runs twice daily by default)
sudo systemctl enable --now certbot.timer
# Test renewal
sudo certbot renew --dry-run# Update nginx volumes to mount Let's Encrypt certificates
nginx:
image: nginx:1.27-alpine
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- /etc/letsencrypt/live/netstacks.example.net:/etc/nginx/certs:ro
- /etc/letsencrypt/archive/netstacks.example.net:/etc/nginx/certs-archive:ro
# In nginx.conf, reference the certificates:
# ssl_certificate /etc/nginx/certs/fullchain.pem;
# ssl_certificate_key /etc/nginx/certs/privkey.pem;In most production deployments, TLS is terminated at the Nginx load balancer. The controller instances still use self-signed TLS internally (Nginx proxies to the backends with proxy_ssl_verify off). This means external clients see a trusted certificate from Nginx, while internal traffic between Nginx and the controllers uses the auto-generated certs.
Geo-HA Considerations
For organizations that need controller availability across geographic regions, NetStacks can be deployed in a multi-region configuration. However, there are important trade-offs to understand compared to a single-region HA deployment.
Managed PostgreSQL Across Regions
Use a managed PostgreSQL service with cross-region replication for database-level geo-redundancy:
- AWS RDS — Create a Multi-AZ deployment with a cross-region read replica. In a failover scenario, promote the read replica to primary.
- Google Cloud SQL — Use cross-region replicas with automatic failover enabled.
- Azure Database for PostgreSQL — Use geo-redundant backup or read replicas in a secondary region.
Cross-region PostgreSQL replication is asynchronous by default, meaning there is a replication lag window during which the secondary region may serve stale data. Synchronous replication across regions eliminates this but adds significant write latency (typically 50–200ms per write). Most deployments should use asynchronous replication and accept the small consistency window.
Regional Valkey Clusters
Valkey does not natively support cross-region replication. For geo-HA, deploy an independent Valkey cluster (primary + replicas + sentinels) in each region. Each regional controller cluster uses its own Valkey cluster.
- Session state is regional — sessions created in Region A are tracked in Region A's Valkey cluster and served by Region A's controller instances.
- If Region A fails, sessions on Region A controllers are lost. Users reconnect to Region B via DNS failover and establish new sessions. Persistent data (users, devices, credentials) is available in Region B via the database replica.
DNS-Based Routing
Use DNS routing to direct users to the nearest or healthiest region:
# Example: AWS Route 53 health-checked failover
# Primary record (us-east-1)
netstacks.example.net A 203.0.113.10 (failover: PRIMARY, health-check: /api/v1/health)
# Secondary record (eu-west-1)
netstacks.example.net A 198.51.100.20 (failover: SECONDARY, health-check: /api/v1/health)
# Alternative: latency-based routing (active-active geo)
netstacks.example.net A 203.0.113.10 (region: us-east-1, health-check: enabled)
netstacks.example.net A 198.51.100.20 (region: eu-west-1, health-check: enabled)Session Locality Limitations
Be aware of the following limitations with geo-distributed deployments:
- Sessions are not portable across regions. A terminal session opened in Region A cannot be seamlessly migrated to Region B. If Region A fails, the user must establish a new session in Region B.
- Session sharing is regional. Shared terminal sessions (e.g., a senior engineer watching a junior engineer's session) only work when both users connect to the same regional cluster.
- Device proximity matters. For best performance, route users to the controller region closest to their target network devices. An SSH session proxied through a controller in a distant region adds round-trip latency to every keystroke.
For most organizations, an active-passive geo configuration is simpler and more reliable than active-active. Run the full HA stack in a primary region, with a warm standby in a secondary region (database replica + pre-deployed but stopped controller instances). Failover is triggered by DNS change and starting the standby controllers.
Environment Variable Reference
Complete reference of all environment variables accepted by the NetStacks Controller. Variables marked Required must be set for the controller to start. Variables marked HA are only relevant for multi-instance deployments.
Core Configuration
| Variable | Description | Default | Required |
|---|---|---|---|
DATABASE_URL | PostgreSQL connection string (e.g., postgres://user:pass@host:5432/netstacks) | — | Yes |
VAULT_MASTER_KEY | 64 hex character (32 byte) key for AES-256-GCM encryption of credentials and secrets | — | Yes |
JWT_SECRET | Secret key for signing JWT authentication tokens. Must be identical across all controller instances | — | Yes |
NETSTACKS_INSTANCE_NAME | Human-readable name for this controller instance, displayed in the Admin UI HA status page | hostname | No |
RUST_LOG | Log level filter (e.g., info,netstacks=debug) | info | No |
API_PORT | Port the controller API listens on | 3000 | No |
API_BIND_ADDR | Bind address for the API server | 0.0.0.0 | No |
TLS Configuration
| Variable | Description | Default |
|---|---|---|
TLS_CERT_PATH | Path to the TLS certificate (PEM format) | /data/tls/cert.pem |
TLS_KEY_PATH | Path to the TLS private key (PEM format) | /data/tls/key.pem |
TLS_AUTO_GENERATE | Auto-generate a self-signed cert on first boot if no cert exists at TLS_CERT_PATH | true |
Valkey / HA Configuration
| Variable | Description | Default |
|---|---|---|
VALKEY_URL | Direct Valkey connection URL (e.g., redis://valkey:6379). Use this for single-Valkey-instance setups without Sentinel | None (HA disabled) |
VALKEY_SENTINEL_URL | Comma-separated Sentinel addresses for automated Valkey failover (e.g., redis://s1:26379,s2:26379,s3:26379). Overrides VALKEY_URL when set | None |
VALKEY_SENTINEL_MASTER_NAME | Sentinel master name for primary discovery. Must match the name in sentinel.conf | mymaster |
VALKEY_PASSWORD | Password for Valkey authentication (if Valkey requires auth) | None |
HA_HEARTBEAT_INTERVAL_SECS | How often each instance sends a heartbeat to Valkey | 5 |
HA_STALE_INSTANCE_TTL_SECS | Time after which a non-heartbeating instance is considered stale and removed from the cluster | 30 |
SSH Proxy Configuration
| Variable | Description | Default |
|---|---|---|
SSH_PROXY_PORT | Port for the SSH proxy listener | 2222 |
SSH_CONNECT_TIMEOUT_SECS | Timeout for establishing SSH connections to network devices | 30 |
SESSION_ORPHAN_TIMEOUT_SECS | How long an orphaned session (no active viewer) remains alive before cleanup | 300 |
License Configuration
| Variable | Description | Default |
|---|---|---|
LICENSE_KEY | License key for the controller. Can also be set via the Admin UI | None |
LICENSE_SERVER_URL | URL of the license validation server | https://license.netstacks.io |
Health Checks & Monitoring
The controller exposes several health and status endpoints for load balancer health checks, monitoring systems, and operational dashboards.
Health Endpoints
| Endpoint | Auth Required | Purpose |
|---|---|---|
GET /api/v1/health | No | Full health check including database, Valkey, and cluster status. Use for load balancer health checks. |
GET /api/v1/ready | No | Readiness probe. Returns 200 when the instance is fully initialized and ready to accept traffic. |
GET /api/admin/health | Yes (admin) | Extended health information with database pool stats, Valkey memory usage, and detailed diagnostics. |
Health Check Response
# Basic health check (no auth required)
curl -k https://netstacks.example.net/api/v1/health{
"status": "healthy",
"instance_id": "ctrl-a1b2c3d4",
"instance_name": "controller-1",
"uptime_secs": 86400,
"version": "0.1.0",
"valkey": "connected",
"database": "connected",
"cluster": {
"registered_instances": 2,
"healthy_instances": 2,
"total_sessions": 42,
"local_sessions": 21
}
}Monitoring with Prometheus
While the controller does not expose a native Prometheus metrics endpoint, you can use a JSON exporter to scrape the health endpoint and convert it to Prometheus metrics.
# prometheus.yml — scrape NetStacks health via json_exporter
scrape_configs:
- job_name: 'netstacks-health'
metrics_path: /probe
params:
module: [netstacks]
static_configs:
- targets:
- https://netstacks.example.net/api/v1/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: json-exporter:7979
---
# json_exporter config (config.yml)
modules:
netstacks:
metrics:
- name: netstacks_up
path: '{ .status }'
help: "Controller health status"
values:
healthy: 1
degraded: 0
- name: netstacks_uptime_seconds
path: '{ .uptime_secs }'
help: "Controller uptime in seconds"
- name: netstacks_cluster_instances
path: '{ .cluster.registered_instances }'
help: "Number of registered controller instances"
- name: netstacks_cluster_healthy_instances
path: '{ .cluster.healthy_instances }'
help: "Number of healthy controller instances"
- name: netstacks_sessions_total
path: '{ .cluster.total_sessions }'
help: "Total active sessions across the cluster"
- name: netstacks_sessions_local
path: '{ .cluster.local_sessions }'
help: "Active sessions on this instance"Alerting Examples
# Check cluster health from a monitoring script
#!/bin/bash
RESPONSE=$(curl -sk https://netstacks.example.net/api/v1/health)
STATUS=$(echo "$RESPONSE" | jq -r '.status')
HEALTHY=$(echo "$RESPONSE" | jq -r '.cluster.healthy_instances')
TOTAL=$(echo "$RESPONSE" | jq -r '.cluster.registered_instances')
if [ "$STATUS" != "healthy" ]; then
echo "CRITICAL: NetStacks controller is $STATUS"
exit 2
fi
if [ "$HEALTHY" -lt "$TOTAL" ]; then
echo "WARNING: $HEALTHY/$TOTAL instances healthy"
exit 1
fi
echo "OK: $HEALTHY/$TOTAL instances healthy, $STATUS"
exit 0Configure your load balancer to poll /api/v1/health every 10 seconds with a 5-second timeout. Remove instances from the pool after 3 consecutive failures. The health endpoint is lightweight and does not require authentication, making it safe for frequent polling.
Backup & Restore
PostgreSQL is the single source of truth for all persistent data in NetStacks. Regular database backups are essential for disaster recovery. Valkey data is ephemeral (session state, heartbeats) and does not need to be backed up — it rebuilds automatically when instances restart.
Manual Database Backup
# Backup the full database (compressed)
pg_dump -h db.example.net -U netstacks -d netstacks \
--format=custom \
--compress=9 \
--file=netstacks-backup-$(date +%Y%m%d-%H%M%S).dump
# Backup only the schema (for documentation or migration reference)
pg_dump -h db.example.net -U netstacks -d netstacks \
--schema-only \
--file=netstacks-schema-$(date +%Y%m%d).sql
# Backup specific tables (e.g., just users and devices)
pg_dump -h db.example.net -U netstacks -d netstacks \
--format=custom \
--table=users --table=devices --table=credentials \
--file=netstacks-core-$(date +%Y%m%d-%H%M%S).dumpAutomated Backups with postgres-backup-local
Add an automated backup container to your Docker Compose stack that runs daily backups with configurable retention.
# Add to your docker-compose-ha.yml
db-backup:
image: prodrigestivill/postgres-backup-local:16
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
environment:
POSTGRES_HOST: postgres
POSTGRES_DB: netstacks
POSTGRES_USER: netstacks
POSTGRES_PASSWORD: ${DB_PASSWORD}
POSTGRES_EXTRA_OPTS: "--format=custom --compress=9"
# Schedule: daily at 2:00 AM
SCHEDULE: "0 2 * * *"
# Retention policy
BACKUP_KEEP_DAYS: 7 # Keep daily backups for 7 days
BACKUP_KEEP_WEEKS: 4 # Keep weekly backups for 4 weeks
BACKUP_KEEP_MONTHS: 6 # Keep monthly backups for 6 months
# Healthcheck URL (optional, for monitoring)
HEALTHCHECK_PORT: 8080
volumes:
- db-backups:/backups
volumes:
db-backups:
driver: local
driver_opts:
type: none
o: bind
device: /opt/netstacks/backupsRestore from Backup
# Stop the controller instances first
docker compose -f docker-compose-ha.yml stop controller-1 controller-2
# Restore from a custom-format backup
pg_restore -h db.example.net -U netstacks -d netstacks \
--clean \
--if-exists \
--no-owner \
--no-privileges \
netstacks-backup-20260410-020000.dump
# Verify the restore
psql -h db.example.net -U netstacks -d netstacks \
-c "SELECT count(*) FROM users; SELECT count(*) FROM devices;"
# Restart the controller instances
docker compose -f docker-compose-ha.yml start controller-1 controller-2Controller Data Volume Backup
Besides the database, the controller stores some data on its local volume (auto-generated TLS certs, SSH CA keys if not using the vault, and temporary files). Back up the Docker volumes if you use auto-generated TLS certificates.
# Backup controller data volumes
docker run --rm \
-v netstacks_controller1-data:/source:ro \
-v /opt/netstacks/backups:/backup \
alpine tar czf /backup/controller1-data-$(date +%Y%m%d).tar.gz -C /source .
docker run --rm \
-v netstacks_controller2-data:/source:ro \
-v /opt/netstacks/backups:/backup \
alpine tar czf /backup/controller2-data-$(date +%Y%m%d).tar.gz -C /source .Database backups contain encrypted credential data. To restore credentials, you must have the original VAULT_MASTER_KEY that was used when the data was encrypted. A backup restored with a different vault key will start successfully, but all encrypted credentials will be unreadable. Always back up the vault master key separately from the database.
Upgrading
NetStacks Controller supports zero-downtime upgrades in HA mode by performing rolling restarts. The controller automatically applies database migrations on startup, so no manual migration step is needed.
Standard Upgrade (Single Instance)
# Pull the latest image
docker compose pull controller
# Restart with the new image
docker compose up -d controller
# Watch the logs for migration output and successful startup
docker compose logs -f controller
# Verify health
curl -k https://localhost:3000/api/v1/healthRolling Upgrade (HA Mode)
In HA mode, upgrade one controller instance at a time to maintain availability. The load balancer health check automatically removes the restarting instance and adds it back once healthy.
# Pull the latest image on all hosts
docker compose -f docker-compose-ha.yml pull
# Step 1: Upgrade controller-1
# Stop controller-1 — nginx health check removes it from the pool
docker compose -f docker-compose-ha.yml stop controller-1
# Start controller-1 with the new image
docker compose -f docker-compose-ha.yml up -d controller-1
# Wait for controller-1 to become healthy
until curl -sk https://localhost/api/v1/health | grep -q '"healthy_instances": 2'; do
echo "Waiting for controller-1 to rejoin cluster..."
sleep 5
done
echo "controller-1 is healthy"
# Step 2: Upgrade controller-2
docker compose -f docker-compose-ha.yml stop controller-2
docker compose -f docker-compose-ha.yml up -d controller-2
# Wait for controller-2 to become healthy
until curl -sk https://localhost/api/v1/health | grep -q '"healthy_instances": 2'; do
echo "Waiting for controller-2 to rejoin cluster..."
sleep 5
done
echo "Upgrade complete — both instances healthy"
# Verify the new version
curl -sk https://localhost/api/v1/health | jq '.version'Database migrations are applied by the first controller instance that starts with the new version. The migration acquires a database-level advisory lock, so if two instances start simultaneously, only one runs the migration while the other waits. However, for safety, always upgrade one instance at a time in a rolling fashion.
Rollback Procedure
If an upgrade introduces issues, roll back by pinning the controller image to the previous version.
# Check the currently running version
docker compose -f docker-compose-ha.yml exec controller-1 \
curl -sk https://localhost:3000/api/v1/health | jq '.version'
# Roll back to a specific version
# Edit docker-compose-ha.yml (or set the image tag):
# controller-1:
# image: ghcr.io/netstacks/controller:0.0.8 # previous version
# Or use an environment variable override
export CONTROLLER_IMAGE=ghcr.io/netstacks/controller:0.0.8
docker compose -f docker-compose-ha.yml stop controller-1 controller-2
# Start with the previous version
docker compose -f docker-compose-ha.yml up -d controller-1
sleep 10
docker compose -f docker-compose-ha.yml up -d controller-2
# Verify the rollback
curl -sk https://localhost/api/v1/health | jq '.version'If the new version applied database migrations that alter existing tables (e.g., dropping columns or changing types), rolling back to the previous version may cause errors because the old code expects the old schema. Always check the release notes for migration details before upgrading. For major upgrades that include breaking migrations, take a database backup before starting the upgrade.
For zero-risk upgrades, consider a blue-green approach: deploy the new version as a separate set of controller instances pointing to the same database. Verify health and functionality on the “green” stack, then switch the load balancer to point at the green instances. If issues arise, switch back to “blue” instantly.