High Availability Deployment
TeamsEnterpriseDeploy NetStacks Controller in high availability mode with Valkey for session coordination, connection brokering, and automated failover across multiple instances.
Overview
NetStacks Controller supports multi-instance deployment for high availability and horizontal scaling. HA mode uses Valkey — a Redis-compatible open-source in-memory data store — to coordinate session state, route connections across instances, and provide failover capabilities.
Key HA components include:
- Instance Registration & Heartbeat — Each Controller instance registers itself in Valkey with a unique ID and heartbeat TTL. Stale instances are automatically cleaned up.
- Session Registry — Active sessions are tracked in Valkey with the hosting instance ID, enabling cross-instance session state coordination.
- Connection Broker — Incoming viewer and terminal connections are routed to the correct Controller instance via WebSocket proxy.
- Sentinel Support — Valkey Sentinel monitors the primary Valkey node and promotes a replica on failure. NetStacks auto-reconnects to the new primary.
- Circuit Breaker — The LLM orchestrator tracks provider health and trips the circuit breaker after consecutive failures, preventing cascading timeouts across the cluster.
- HA Status Dashboard — The Admin UI provides a real-time view of cluster health, instance status, session distribution, and Valkey connectivity.
High availability deployment requires the Enterprise license tier. Single-instance deployments on Teams or Starter tiers do not require Valkey.
How It Works
Instance Registration
When a Controller instance starts, it generates a unique instance ID and registers itself in Valkey with a heartbeat TTL. The instance periodically refreshes its heartbeat to signal liveness. If an instance stops sending heartbeats (e.g., due to a crash or network partition), its registration expires and it is automatically removed from the cluster. Other instances detect the removal and can take over orphaned sessions.
Session Registry
Active terminal sessions are stored in Valkey as a session registry. Each entry maps a session ID to the instance that owns it, along with metadata such as the connected device, user, and session start time. When a client connects, the connection broker queries the registry to determine which instance holds the session.
Connection Broker
The connection broker receives incoming WebSocket connections and routes them to the correct Controller instance. The routing flow is:
- Client connects to any Controller instance behind the load balancer.
- The receiving instance queries the session registry in Valkey to find the owning instance.
- If the session is owned locally, the connection is handled directly.
- If the session is owned by another instance, the connection is proxied via WebSocket to the owning instance.
- If the owning instance is down, the session can be migrated to the receiving instance.
Sentinel Support
For production HA deployments, Valkey Sentinel provides automated failover of the Valkey data store itself. Sentinel monitors the primary Valkey node and, if it becomes unavailable, promotes a replica to primary. NetStacks Controller connects to Sentinel rather than directly to Valkey, allowing it to automatically discover and reconnect to the new primary after a failover event.
Circuit Breaker
The LLM orchestrator includes a circuit breaker pattern for managing AI provider health. When a provider experiences consecutive failures (timeouts, rate limits, or errors), the circuit breaker trips and temporarily stops routing requests to that provider. This prevents cascading failures across the cluster and allows degraded AI functionality rather than complete outage. The circuit breaker automatically resets after a configurable cooldown period.
Health Endpoints
Each Controller instance exposes a health endpoint at /api/v1/health that returns detailed status information including:
- Instance ID and uptime
- Valkey connectivity status
- Number of registered instances in the cluster
- Local and total session counts
- Database connectivity
Load balancers should use this endpoint for health checks to automatically remove unhealthy instances from the pool.
Step-by-Step Guide
Workflow 1: Add Valkey to Your Deployment
- Add a Valkey container to your Docker Compose file or provision a standalone Valkey instance.
- Configure the
VALKEY_URLenvironment variable on your Controller instance, pointing to the Valkey server (e.g.,redis://valkey:6379). - Restart the Controller. On startup, it will detect the Valkey connection and enable HA features automatically.
- Verify connectivity by checking the health endpoint:
curl http://localhost:8080/api/v1/health. The response should show"valkey": "connected".
Enable Valkey's append-only file (AOF) persistence with --appendonly yes to preserve session state across Valkey restarts. Without persistence, all session registrations are lost on restart and instances must re-register.
Workflow 2: Multi-Instance Deployment
- Deploy two or more Controller API instances, each configured with the same
DATABASE_URL,VAULT_MASTER_KEY,JWT_SECRET, andVALKEY_URL. - Place a load balancer (e.g., Nginx, HAProxy, or a cloud ALB) in front of the Controller instances. The load balancer must support WebSocket upgrades.
- Configure health checks on the load balancer to poll
/api/v1/healthon each instance. - Deploy one or more Controller Admin UI instances behind the same load balancer (or a separate one), configured to point at the API load balancer URL.
- Verify the cluster by navigating to Admin → HA Status in the Admin UI. All instances should appear with a Healthy status.
All Controller instances must share the same VAULT_MASTER_KEY and JWT_SECRET. If these differ between instances, sessions and encrypted credentials will not be portable across the cluster.
Workflow 3: Enable Sentinel for Valkey Failover
- Deploy at least three Valkey Sentinel instances alongside your Valkey primary and replica(s).
- Configure Sentinel to monitor your Valkey primary with a master name (e.g.,
mymaster). - Set the
VALKEY_SENTINEL_URLenvironment variable on each Controller instance, pointing to the Sentinel addresses (e.g.,redis://sentinel1:26379,sentinel2:26379,sentinel3:26379). - Set
VALKEY_SENTINEL_MASTER_NAMEto your Sentinel master name (e.g.,mymaster). - Remove or leave unset the
VALKEY_URLvariable — when Sentinel is configured, the Controller discovers the primary automatically. - Restart the Controller instances. They will connect to Sentinel and auto-discover the current Valkey primary.
Workflow 4: Monitor HA Status
- Navigate to Admin → HA Status in the Admin UI.
- View the Instance Table showing each registered instance, its ID, last heartbeat, session count, and health status.
- Check the Valkey Status panel for connection state, memory usage, and replication info.
- Review Session Distribution to see how sessions are balanced across instances.
- Use the health endpoint programmatically for alerting:
curl http://controller:8080/api/v1/health
The /api/v1/health endpoint returns structured JSON suitable for integration with monitoring tools such as Prometheus (via a JSON exporter), Datadog, or custom health-check scripts.
Code Examples
Docker Compose with Valkey
services:
valkey:
image: valkey/valkey:8-alpine
ports:
- "6379:6379"
volumes:
- valkey-data:/data
command: valkey-server --appendonly yes
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: netstacks
POSTGRES_PASSWORD: secret
POSTGRES_DB: netstacks
volumes:
- postgres-data:/var/lib/postgresql/data
controller-api:
image: ghcr.io/netstacks/controller-api:latest
environment:
DATABASE_URL: postgres://netstacks:secret@postgres:5432/netstacks
VAULT_MASTER_KEY: "your-64-hex-char-key"
JWT_SECRET: "your-jwt-secret"
VALKEY_URL: "redis://valkey:6379"
depends_on:
- postgres
- valkey
controller-admin:
image: ghcr.io/netstacks/controller-admin:latest
ports:
- "3000:80"
volumes:
valkey-data:
postgres-data:Health Endpoint Response
{
"status": "healthy",
"instance_id": "ctrl-a1b2c3d4",
"uptime_secs": 86400,
"valkey": "connected",
"database": "connected",
"cluster": {
"registered_instances": 3,
"healthy_instances": 3,
"total_sessions": 42,
"local_sessions": 14
}
}Sentinel Configuration
# sentinel.conf
port 26379
sentinel monitor mymaster valkey-primary 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel parallel-syncs mymaster 1Multi-Instance Docker Compose
services:
controller-api-1:
image: ghcr.io/netstacks/controller-api:latest
environment:
DATABASE_URL: postgres://netstacks:secret@postgres:5432/netstacks
VAULT_MASTER_KEY: "your-64-hex-char-key"
JWT_SECRET: "your-jwt-secret"
VALKEY_URL: "redis://valkey:6379"
depends_on:
- postgres
- valkey
controller-api-2:
image: ghcr.io/netstacks/controller-api:latest
environment:
DATABASE_URL: postgres://netstacks:secret@postgres:5432/netstacks
VAULT_MASTER_KEY: "your-64-hex-char-key"
JWT_SECRET: "your-jwt-secret"
VALKEY_URL: "redis://valkey:6379"
depends_on:
- postgres
- valkey
nginx:
image: nginx:alpine
ports:
- "8080:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- controller-api-1
- controller-api-2Environment Variables
| Variable | Description | Default |
|---|---|---|
VALKEY_URL | Connection URL for the Valkey instance (e.g., redis://valkey:6379) | None (HA disabled) |
VALKEY_SENTINEL_URL | Comma-separated Sentinel addresses for automated failover | None |
VALKEY_SENTINEL_MASTER_NAME | Sentinel master name for primary discovery | mymaster |
HA_HEARTBEAT_INTERVAL_SECS | How often each instance sends a heartbeat to Valkey | 5 |
HA_STALE_INSTANCE_TTL_SECS | Time after which a non-heartbeating instance is considered stale and removed | 30 |
Questions & Answers
- What is Valkey?
- Valkey is a Redis-compatible open-source in-memory data store, maintained by the Linux Foundation. It emerged after the Redis license change to provide a truly open-source alternative. NetStacks uses Valkey for session state coordination, instance registration, and connection brokering in HA deployments. Any Redis-compatible server should work, but Valkey is the recommended and tested option.
- How does session affinity work?
- Sessions are pinned to the Controller instance that created them. The session registry in Valkey maps each session ID to its owning instance. When a subsequent connection arrives (e.g., a viewer joining a shared session), the connection broker queries the registry and routes the connection to the owning instance via WebSocket proxy. This avoids the need for sticky sessions at the load balancer level.
- What happens during a failover?
- When a Controller instance goes down, its heartbeat in Valkey expires and it is removed from the cluster. Active sessions on the failed instance become temporarily unavailable. Clients automatically reconnect to the load balancer and are routed to a healthy instance, where new sessions are established. Session recordings and audit logs are preserved in PostgreSQL and are not affected by instance failures.
- Is Valkey required for all deployments?
- No. Valkey is only required for multi-instance HA deployments. Single-instance deployments work without Valkey — session state is held in-process and no cross-instance coordination is needed. Simply omit the
VALKEY_URLenvironment variable to run in single-instance mode. - What about the separate API and Admin UI images?
- Starting with v0.0.9, the NetStacks Controller ships as two Docker images:
controller-api(the Rust API server) andcontroller-admin(Nginx serving the React admin dashboard). This separation enables independent scaling — you can run multiple API instances behind a load balancer for HA while serving the static Admin UI from a single container or CDN. The Admin UI communicates with the API exclusively through HTTP/WebSocket, so it works seamlessly with any number of API instances. - Can I use Redis instead of Valkey?
- Yes. Valkey is wire-compatible with Redis, so any Redis 7+ instance will work. However, NetStacks officially tests and recommends Valkey due to its open-source license (BSD-3-Clause). Simply point
VALKEY_URLat your Redis instance using the sameredis://connection string.
Troubleshooting
Instance Not Registering in Cluster
- Verify
VALKEY_URLis set correctly and the Valkey server is reachable from the Controller container:redis-cli -h valkey -p 6379 ping - Check Controller logs for Valkey connection errors (e.g., connection refused, authentication failures, timeout).
- If using Docker networking, ensure the Controller and Valkey containers are on the same Docker network.
- Verify that Valkey is not running in protected mode, which blocks connections from non-loopback addresses by default.
Stale Sessions in Valkey
- Stale session cleanup runs automatically based on the heartbeat TTL. If sessions persist after an instance goes down, wait for the
HA_STALE_INSTANCE_TTL_SECSperiod to elapse (default: 30 seconds). - Check that the
HA_HEARTBEAT_INTERVAL_SECSis shorter than the stale TTL. If the heartbeat interval is too close to or exceeds the TTL, instances may be prematurely marked stale. - Manually inspect Valkey state using
redis-clito view registered instances and active sessions.
Sentinel Failover Issues
- Verify Sentinel can reach all Valkey nodes (primary and replicas) on the configured ports.
- Confirm that
VALKEY_SENTINEL_MASTER_NAMEmatches the master name in your Sentinel configuration exactly (case-sensitive). - Ensure you have at least three Sentinel instances to achieve quorum for failover decisions.
- Check Sentinel logs for
+odownand+failoverevents to verify failover is triggering correctly. - After a failover, verify the Controller reconnected to the new primary by checking the health endpoint or Controller logs.
Load Balancer Configuration
- The load balancer must support WebSocket upgrades. Ensure the
UpgradeandConnectionheaders are passed through to backend instances. - Sticky sessions at the load balancer level are not required — the connection broker handles session routing. However, enabling sticky sessions can reduce proxy hops for established connections.
- Configure health checks to poll
/api/v1/healthand remove instances that return non-200 responses. - Set appropriate timeouts for WebSocket connections (idle timeout should be at least 300 seconds to avoid dropping long-running terminal sessions).
Session Migration Not Working
- Verify all instances share the same
VAULT_MASTER_KEYandJWT_SECRET. Mismatched secrets prevent session portability. - Check that the failed instance has been removed from the registry (its heartbeat TTL has expired).
- Clients must reconnect through the load balancer for session migration to occur — direct connections to a specific instance bypass the connection broker.
Related Features
- Session Sharing — Share terminal sessions across users. In HA mode, session sharing works cross-instance via the connection broker.
- System Settings — Configure platform-wide settings including feature toggles and license management.
- Audit Logs — Track administrative actions and session activity. Audit logs are stored in PostgreSQL and are preserved across instance failures.