NetStacksNetStacks

Performance

Diagnose and resolve performance bottlenecks including slow database queries, connection pooling issues, high memory usage, UI responsiveness, and large device inventory scaling.

Overview

Performance issues in NetStacks can manifest as slow page loads, delayed API responses, sluggish terminal sessions, or high resource consumption on the Controller host. These issues generally fall into distinct categories with different diagnostic approaches.

Performance Issue Categories

  • Database Queries -- Slow or unoptimized queries against the PostgreSQL database, missing indexes, or connection pool exhaustion.
  • Connection Pooling -- Too many concurrent device connections overwhelming the Controller, or pool settings too conservative for the workload.
  • Memory Usage -- High memory consumption from large device inventories, accumulated session data, or background task queues.
  • UI Responsiveness -- Slow dashboard rendering, laggy terminal input, or delayed table loading in the web interface.
  • Large Device Inventories -- Scaling challenges when managing hundreds or thousands of devices with bulk operations.
Note

NetStacks is designed to manage up to several thousand devices on a single Controller instance with appropriate resource allocation. Performance tuning becomes important beyond approximately 500 managed devices.

How It Works

Understanding the NetStacks architecture helps identify where performance bottlenecks occur in the request path.

Request Flow and Bottleneck Points

  1. Client (Browser / Terminal App) -- Renders the UI, manages terminal sessions, sends API requests. Bottlenecks here affect only the local user (slow rendering, high local memory).
  2. Reverse Proxy (nginx / Caddy) -- Terminates TLS, routes requests, handles WebSocket upgrades for terminal sessions. Misconfigured proxy buffering or connection limits cause latency.
  3. Controller API -- Processes requests, manages business logic, coordinates device connections. CPU-bound operations (template rendering, encryption) and concurrent request handling are bottleneck risks.
  4. PostgreSQL Database -- Stores device inventory, credentials, audit logs, task history. Missing indexes, large table scans, and connection pool exhaustion are the most common performance issues.
  5. Device Connections -- SSH/Telnet sessions to managed devices. Slow device responses, high connection counts, and network latency affect bulk operations.
Tip

Most performance issues originate at the database layer (step 4). Start your diagnosis there unless symptoms clearly point elsewhere.

Step-by-Step Guide

Follow this workflow to systematically identify and resolve performance issues.

Step 1: Check System Resources

Verify the Controller host has adequate CPU, memory, and disk I/O capacity.

System Resourcesbash
# Check CPU and memory usage
top -bn1 | head -20

# Check disk I/O
iostat -x 1 3

# Check available disk space (PostgreSQL needs free space for WAL)
df -h /var/lib/postgresql

Step 2: Check Database Metrics

Review PostgreSQL connection count, active queries, and table sizes.

Database Metricssql
# Check active connections vs max_connections
SELECT count(*) AS active,
  (SELECT setting FROM pg_settings WHERE name = 'max_connections') AS max
FROM pg_stat_activity;

# Find long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC;

Step 3: Check Connection Pool Stats

Review how many device connections are active and whether pools are exhausted.

Step 4: Identify Slow Queries

Use PostgreSQL's query statistics to find the slowest or most frequently executed queries.

Slow Query Identificationsql
# Top 10 slowest queries (requires pg_stat_statements extension)
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Step 5: Apply Tuning

Based on findings, apply the appropriate tuning from the code examples section below.

Code Examples

Database Optimization

Database Optimizationsql
-- Check for missing indexes on frequently queried columns
SELECT schemaname, relname, seq_scan, seq_tup_read,
  idx_scan, idx_tup_fetch
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 20;

-- Analyze query performance with EXPLAIN
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM devices WHERE organization_id = 'org-uuid' AND status = 'active';

-- Update table statistics for the query planner
ANALYZE devices;
ANALYZE credentials;
ANALYZE audit_logs;

PostgreSQL Tuning for NetStacks

PostgreSQL Tuningini
# postgresql.conf -- Recommended settings for NetStacks
# Adjust based on available system RAM

# Memory (for 8GB RAM system)
shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 64MB
maintenance_work_mem = 512MB

# Connection handling
max_connections = 200
idle_in_transaction_session_timeout = 30000  # 30s

# Write performance
wal_buffers = 64MB
checkpoint_completion_target = 0.9

# Query planning
random_page_cost = 1.1  # For SSD storage
effective_io_concurrency = 200  # For SSD storage

Resource Monitoring

Resource Monitoringbash
# Monitor Controller process resource usage
# Find the Controller process
ps aux | grep netstacks

# Watch memory and CPU in real-time
top -p $(pgrep -f netstacks-controller)

# Check network connections to the Controller
ss -tnp | grep :443 | wc -l   # Active HTTPS connections
ss -tnp | grep :22 | wc -l    # Active SSH connections (if proxied)

Reverse Proxy Tuning

nginx Performancenginx
# nginx.conf -- Performance tuning for NetStacks
worker_processes auto;
worker_connections 2048;

http {
    # Increase timeouts for long-lived WebSocket connections
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;

    # Buffer settings for API responses
    proxy_buffer_size 128k;
    proxy_buffers 4 256k;

    # Enable gzip for API responses
    gzip on;
    gzip_types application/json text/plain;
    gzip_min_length 1000;
}

Q&A

Q: Why is the dashboard loading slowly?
A: Slow dashboard loads are usually caused by expensive database queries that aggregate device status, recent activity, and alert counts. Check PostgreSQL for long-running queries during dashboard load. Ensure indexes exist on commonly filtered columns like organization_id, status, and updated_at. For large inventories (500+ devices), the dashboard may benefit from caching -- check that the Controller's built-in response caching is enabled.
Q: How do I optimize database performance?
A: Start by running ANALYZE on all tables to update query planner statistics. Check for missing indexes using the pg_stat_user_tables view -- tables with high seq_scan counts and low idx_scan counts likely need indexes. Tune PostgreSQL memory settings based on your system RAM (shared_buffers should be roughly 25% of total RAM). Enable the pg_stat_statements extension to identify the slowest queries over time.
Q: What are the recommended connection pool settings?
A: The default connection pool settings work well for deployments up to about 200 devices. For larger inventories, increase the maximum pool size proportionally. A good rule of thumb is max_connections in PostgreSQL should be at least 2x the expected concurrent API users plus a buffer for background tasks. If using a connection pooler like PgBouncer, set it to transaction mode with a pool size of 50-100 for most deployments.
Q: How do I monitor NetStacks resource usage?
A: Monitor the Controller process with standard system tools: top or htop for CPU and memory, iostat for disk I/O, and ss for network connections. For PostgreSQL, monitor active connections, query durations, and table sizes through pg_stat_activity and pg_stat_user_tables. Set up alerts for when CPU exceeds 80% sustained, memory exceeds 90%, or database connections approach the maximum.
Q: Why do bulk operations take a long time?
A: Bulk operations (device discovery, mass configuration pushes, inventory syncs) are inherently time-consuming because they involve connecting to many devices sequentially or with limited concurrency. The Controller limits concurrent device connections to avoid overwhelming the network. You can adjust the concurrency limit in admin settings, but setting it too high can cause connection failures on devices with limited VTY lines. For bulk config pushes, consider using stack deployments which optimize the execution order.
Q: How do I tune performance for large device inventories (1000+)?
A: For 1000+ devices: increase PostgreSQL shared_buffers to 4GB+ and max_connections to 300+, ensure the Controller host has at least 8GB RAM and 4 CPU cores, use SSD storage for the database, enable pagination on all device list views, schedule bulk operations during off-peak hours, and consider partitioning the audit_logs table by date if it grows very large. Review the step-by-step guide above for systematic tuning.
Q: Why does my terminal session feel laggy?
A: Terminal lag can be caused by network latency to the device, high CPU usage on the Controller (especially with AI features analyzing output in real-time), or browser rendering overhead with very large scrollback buffers. Check the network round-trip time to the device first. If using AI features, they run asynchronously and should not block input -- but if the Controller CPU is saturated, all operations slow down. Reduce the scrollback buffer size in terminal settings if rendering is slow.

Troubleshooting

Use this decision tree to match symptoms to diagnostic commands and fixes.

SymptomDiagnostic CommandLikely CauseFix
Dashboard loads in 5+ secondsSELECT * FROM pg_stat_activity WHERE state = 'active'Slow aggregation queries, missing indexesAdd indexes, run ANALYZE, enable caching
API returns 504 Gateway Timeouttail -f /var/log/nginx/error.logController not responding in timeIncrease proxy_read_timeout, check Controller health
Controller memory above 80%top -p $(pgrep -f netstacks)Too many concurrent sessions or large data sets in memoryIncrease host RAM, reduce concurrent connection limit
Database connections exhaustedSELECT count(*) FROM pg_stat_activityConnection pool leak or too many concurrent usersIncrease max_connections, check for idle transactions
Bulk operations stall partwayCheck Controller logs for connection errorsDevice connection limits reached, network congestionReduce concurrency, stagger operations by device group
Terminal input lag (200ms+)ping target-deviceNetwork latency or Controller CPU saturationCheck RTT, reduce AI processing load, check CPU
Web UI freezes on large tablesBrowser DevTools Performance tabRendering too many DOM elementsEnable pagination, reduce items per page
Disk space filling updu -sh /var/lib/postgresql/data/*Audit logs or WAL files growing unboundedConfigure log retention, check WAL archiving
Warning

Always take a database backup before applying PostgreSQL configuration changes. Some settings require a database restart to take effect.