NetStacksNetStacks

Profiling Agents

Enterprise

Persistent AI agents that own devices via gNMI telemetry: behavioral baselines, adaptive anomaly detection, the awareness cycle, agent mesh, and per-agent chat.

Overview

Profiling Agents is a first-party NetStacks plugin (a Rust service) that runs persistent AI agents which own network devices. Each agent subscribes to a device's streaming gNMI telemetry, distills the raw stream into a compact health picture, learns a multi-day behavioral baseline, then continuously watches for anomalies on its own awareness cycle. Agents form a mesh and elect a coordinator, exchange messages, and each exposes a chat endpoint so an engineer can ask "what is going on with this device?" in plain language.

The plugin combines five capabilities into one persistent operator per group of devices:

  • Telemetry ownership — a long-lived gNMI subscription per assigned device, debounced and persisted as live state.
  • Behavioral baselines — a multi-day learning phase that compiles a structured profile of "normal" for each device.
  • Adaptive anomaly detection — an LLM-driven awareness cycle plus bootstrap watch rules that flag deviations from the baseline.
  • Agent mesh — coordinator election and rate-limited inter-agent messaging so agents can correlate across the fleet.
  • Per-agent chat — a conversational interface grounded in the agent's live device states, anomalies, and watch rules.
Enterprise tier

Profiling Agents is a Controller-side plugin and is part of the Enterprise tier. It runs as a Docker container managed by the Controller, declares the capabilities read_devices, read_credentials, invoke_llm, and execute_gnmi, and stores its data in the isolated plugin_profiling_agents Postgres schema. See Plugin System for how plugins install, enable, and proxy.

gNMI-native devices required

Profiling Agents owns devices over gNMI (gRPC Network Management Interface), not SSH or SNMP. Assigned devices must expose a gNMI target — the plugin connects to gNMI port 6030 by default (Arista EOS-style). Devices without a reachable gNMI endpoint will fail to subscribe and will not enter baseline learning.

How It Works

The service is built on Axum (Rust). On startup it connects to Postgres, builds an SDK client that talks to the Controller over its internal API, creates the gNMI collector, and then resumes every agent marked active — reopening their gNMI subscriptions. It runs a coordinator election, and finally spins up five always-on background engines. The plugin listens on port 8080 and exposes a /health endpoint the Controller polls every 30 seconds.

An agent moves a device through a clear lifecycle. When you assign a device and activate the agent, the collector opens a gNMI subscription, fetches the device config once to build a config digest, and starts the baseline. From there the device flows pendinglearningready profiled as the engines do their work.

agent-lifecycle.txttext
Assign device  ->  Activate agent
       |                |
       v                v
  gNMI subscribe   fetch config (digest)  start_baseline()
       |                                       |
       v                                       v
  telemetry stream  --(5s flush)-->  device_states.structured_state
       |
       v
  Distiller (5s)   ->  state_summary + health_score
       |
       v
  Learner (hourly, 7 days)  ->  baseline_status: learning -> ready
       |
       v
  Compiler (hourly)         ->  baseline_status: ready -> profiled
       |
       v
  Awareness cycle (per agent interval)  ->  anomalies + notifications
       |
       v
  Refiner (weekly)          ->  keep the profile current

Data Model

All tables live in the plugin_profiling_agents schema. The core entities:

agents
An agent definition: name, system_prompt, provider, the chat model and a separate cycle_model for background work, temperature (default 0.7), max_iterations (default 15), active, proactive_enabled/proactive_channels, and awareness_interval_mins (default 15).
device_assignments
Maps a device to an agent, optionally pinning a subscription_config_id. Unique per (agent_id, device_id).
subscription_configs
Named sets of gNMI subscription paths. One is flagged is_default and is used when an assignment does not pin a specific config.
device_states
The live, one-row-per-device picture: structured_state (raw distilled telemetry), state_summary, health_score, config_digest, behavioral_profile, and baseline_status.
state_history
Time-series snapshots used by the learner and weekly refiner.
watch_rules, anomaly_events, conversations
Detection rules, detected anomalies, and persistent per-user chat history, respectively.
agent_mesh, agent_messages, notification_log
The mesh hierarchy, inter-agent message queue, and notification delivery audit.

The Background Engines

Five background tasks run for the lifetime of the service. Their startup delays and cadences are fixed in the service:

EngineStarts afterCadenceWhat it does
Distillationimmediatelyevery 5sParses raw telemetry into a summary + health score per device.
SnapshotterimmediatelyperiodicWrites state_history rows for trend analysis.
Learner60shourlyAnalyzes 24h of history per learning-phase device, builds daily observations over a 7-day window.
Compiler / Refiner120shourlyCompiles ready devices into a structured profile; refines profiled devices weekly.
Awareness cycle180schecks every 60sRuns a per-agent LLM review on each agent's awareness_interval_mins schedule.

gNMI Telemetry & Distillation

When an agent activates, the collector resolves each assigned device's host and default credential through the Controller SDK, then opens a gNMI subscription on port 6030 with the paths from the device's subscription config. Streamed updates are accumulated in memory and flushed every 5 seconds into device_states.structured_state — this debouncing keeps a chatty on_change stream from hammering the database.

The distillation engine then runs every 5 seconds: for every active agent it reads each device's structured state, parses it into interfaces, BGP peers, and system metrics, compiles a human/AI-readable state_summary, computes a health_score, and writes both back. The summary looks like:

state_summary.txttext
Device: edge-rtr-01 (eos, router)
System: CPU 34% | Memory 38% (6.1/16.0 GB) | Temp 38°C

Interfaces (12 up, 1 down, 2 admin-down)
  DOWN: Ethernet7
  HIGH-UTIL: Ethernet1 — 78% in, 12% out

BGP (4/5 peers established)
  DOWN: 10.0.0.5 (AS65010)
  Total prefixes: 847,291

Subscription Configs

A subscription config is a named list of OpenConfig paths, each with a mode (on_change or sample) and, for sampled paths, an interval_ms. The plugin ships one default config named "Standard Router", flagged is_default:

standard-router.paths.jsonjson
[
  { "path": "/interfaces/interface/state", "mode": "on_change" },
  { "path": "/interfaces/interface/state/counters", "mode": "sample", "interval_ms": 30000 },
  { "path": "/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state", "mode": "on_change" },
  { "path": "/system/state", "mode": "sample", "interval_ms": 60000 },
  { "path": "/system/memory/state", "mode": "sample", "interval_ms": 60000 }
]

When you assign a device you can pin a specific subscription_config_id; otherwise the default config is used. If a pinned config is missing at subscribe time the collector falls back to the default and logs a warning.

Tailor paths to the platform

The default paths are OpenConfig models that map cleanly to Arista EOS and other gNMI-native NOSes. Build a second config (for example a switch-oriented set without BGP, or one that adds /components for optics/temperature) and pin it on assignment for those device types.

Health Score

The distiller computes a health_score in the range 0.0–1.0, starting at 1.0 and subtracting penalties. This is a deterministic, baseline-independent signal — the chat context treats devices with health ≥ 0.8 as healthy and gives full detail for anything below 0.8.

ConditionPenalty
CPU > 90% / > 80% / > 70%−0.30 / −0.15 / −0.05
Memory > 95% / > 90% / > 85%−0.30 / −0.15 / −0.05
Each interface down (not admin-down)−0.10
Each BGP peer not Established−0.15

The result is floored at 0.0 and capped at 1.0. For example, a router at 84% CPU with one non-admin interface down scores 1.0 − 0.15 − 0.10 = 0.75 — below the 0.8 threshold, so it is surfaced with full detail to the agent.

Behavioral Baselines

A health score tells you a device is unwell right now; a behavioral baseline tells you what normal looks like for this device. Building it is a two-stage, multi-day process driven by the learner and compiler.

Stage 1 — Learning (default 7 days)

When a device first subscribes, baseline_status moves to learning and baseline_started_at is stamped. Each hour the learner pulls the last 24h of state_history, samples ~20 snapshots evenly across the window, and asks the agent's cycle_model (default claude-haiku-4-5-20251001) to summarize the day's patterns — normal operating ranges, time-of-day patterns, recurring cycles, and stability. Each day's 2–4 sentence summary is appended to behavioral_profile:

behavioral_profile (learning)text
--- Day 1 ---
CPU held 25-40% with a brief 65% spike at 02:10 (likely a backup job). Ethernet1
ran 60-80% inbound during business hours, near-idle overnight. All 4 BGP peers
stayed Established. Temperatures stable around 38C.

--- Day 2 ---
Pattern repeated: 02:00-02:30 CPU spike, daytime utilization on Ethernet1/Ethernet2.
No interface flaps. Memory crept from 36% to 39% over the day.

Once baseline_days have elapsed the learner flips the device to ready.

Stage 2 — Compilation

The compiler picks up ready devices hourly and synthesizes the accumulated daily observations into one structured profile with fixed sections: Traffic Patterns, System Resources, BGP Stability, Neighbor Relationships, Notable Patterns, and a critical Never Seen In Baseline section listing conditions that would be anomalous (for example "CPU > 80%", "BGP peer down > 5min"). The device becomes profiled and baseline_completed_at is set.

Stage 3 — Refinement (weekly)

For profiled devices whose last_profile_refinement is null or older than 7 days, the refiner feeds the last 168 hours of history plus the current profile back to the model and returns the complete updated profile — keeping still-valid information, adding new patterns, and updating thresholds. This is how the baseline stays adaptive as the network evolves.

Watch Rules

Alongside the learned profile, every agent is seeded with five bootstrap watch rules the first time it activates (if it has none). These are deterministic guardrails the awareness cycle considers each run. Three rule types are supported: threshold, rate_change, and state_transition.

bootstrap-watch-rulestext
CPU Over 80%          threshold        {"metric":"cpu_percent","threshold":80.0,"operator":"gt"}
Memory Over 90%       threshold        {"metric":"memory_percent","threshold":90.0,"operator":"gt"}
Interface Down        state_transition {"metric":"interface_status","from":"up","to":"down"}
BGP Peer Down         state_transition {"metric":"bgp_peer_status","from":"established","to":"down"}
Error Rate 5x Increase rate_change     {"metric":"error_rate","multiplier":5.0,"window_minutes":5}

Rules carry a created_by tag (bootstrap for seeded rules) and can be temporarily suppressed (suppressed_until with a suppression_reason) — useful during a known maintenance window so the agent does not raise noise. Rule names are unique per agent.

The Awareness Cycle

The awareness cycle is the agent's heartbeat. The engine wakes every 60 seconds, but an individual agent only runs a full cycle once its awareness_interval_mins (default 15) has elapsed since its last awareness_cycle event. When it runs, the agent assembles a focused prompt from its own state:

  • The current device states (health, baseline status, summary) for up to 20 devices.
  • Its watch rules and their status (active / disabled / suppressed).
  • Anomalies from the last 24 hours (excluding prior cycle meta-events).
  • Any pending inter-agent messages (which are then marked processed).

It asks the cycle_model to respond with strict JSON: an overall status (nominal, watching, or concerned), a list of observations, and a list of detected anomalies:

awareness-response.jsonjson
{
  "status": "concerned",
  "observations": [
    "edge-rtr-01 health dropped to 0.70 after Ethernet7 went down",
    "BGP peer 10.0.0.5 has been flapping for the last hour"
  ],
  "anomalies": [
    {
      "device_id": "8f1c...e2",
      "severity": "warning",
      "description": "Ethernet7 down — not seen in baseline; expected always-up"
    }
  ],
  "rule_changes": []
}

Each detected anomaly is written to anomaly_events (with source = awareness_cycle) and triggers a proactive notification. The cycle itself records an info meta-event of type awareness_cycle so the next run can compute the interval. Anomaly severities are constrained to critical, warning, or info.

Two detection paths, one record

Anomalies can arrive from the LLM awareness cycle or from deterministic watch rules; both land in anomaly_events with a severity, anomaly_type, and description. Unacknowledged anomalies are what the chat context and UI surface first. Acknowledge one to clear it from the active list.

Proactive Notifications

When the awareness cycle detects an anomaly and the agent has proactive_enabled = true, the plugin delivers a notification through the channels in proactive_channels. Delivery is gated by a per-device, per-source/severity cooldown (60 minutes for awareness-cycle anomalies) so a persistent condition does not spam you. Three channels are supported:

  • in_app — always logged to notification_log when enabled.
  • webhook — a JSON POST to your webhook_url (10s timeout, best-effort).
  • email — sent via the Controller's internal SMTP endpoint to the listed addresses.
proactive_channelsjson
{
  "in_app": true,
  "email": ["noc@example.com"],
  "webhook_url": "https://hooks.example.com/netstacks-agents"
}

Every delivery attempt — success or failure — is recorded in notification_log, queryable per agent.

Agent Mesh & Coordinator Election

Agents do not operate in isolation. They form a mesh with a single elected coordinator. Election runs automatically on startup and after any agent is activated or deactivated, and can also be triggered on demand. The current algorithm scores each active agent by its device count (a topology-aware score is planned), clears existing coordinators, elects the top scorer as coordinator (no parent), and attaches every other active agent to it as a child in agent_mesh.

Agents exchange messages through the agent_messages queue. A message has a message_type of query, alert, correlation, or directive; a priority of normal or urgent; and may expect a reply. Direct messages target one agent; broadcasts set no recipient. Both direct and broadcast sends are rate-limited to 10 messages per minute per agent, and coordinator-restricted broadcasts are rejected unless the sender is the current coordinator.

Pending messages for an agent are injected into its next awareness-cycle prompt under an INCOMING MESSAGES heading and then marked processed — this is how one agent's observation ("peer 10.0.0.5 flapping") can inform another agent that owns the device on the far side of that link.

mesh-api.httphttp
# Inspect the current mesh and coordinator
GET  /admin/mesh

# Force a re-election (e.g. after a topology change)
POST /admin/mesh/elect
# -> {"coordinator_id":"a1b2...c3"}   (or null if no active agents)

# Recent inter-agent messages (optionally filtered)
GET  /admin/messages?agent_id=<uuid>&hours=24

Per-Agent Chat

Each agent exposes a chat endpoint so you can talk to the operator that owns a set of devices. Chat is grounded: before calling the LLM, the handler assembles a system prompt from the agent's identity (system_prompt or a default), its current device states (healthy devices condensed to one line, unhealthy devices in full detail), unacknowledged anomalies, and a watch-rule summary. The chat model is the agent's model (distinct from the cheaper cycle_model used for background work).

Conversations are persistent and scoped per (agent, user) in the conversations table; history is trimmed to the last 50 messages to stay within context. The agent must be active to chat (an inactive agent returns 403).

agent-chat.httphttp
# Load context for the chat UI (agent, device states, anomalies, rules, history)
GET  /admin/agents/<agent_id>/chat?user_id=<uuid>

# Send a message
POST /admin/agents/<agent_id>/chat
Content-Type: application/json

{
  "message": "Why did edge-rtr-01 health drop this afternoon?",
  "user_id": "11111111-1111-1111-1111-111111111111"
}

# Response
{ "response": "Ethernet7 went oper-down at 14:12 UTC ...", "tool_calls_made": 0 }
Tooling in the system prompt

The chat system prompt advertises agent tools — query_device_state, query_state_history, create_anomaly, create_watch_rule, suppress_watch_rule, and send_proactive_message — so the model reasons in those terms. The chat response returns tool_calls_made for observability.

Plugin Settings

The plugin's manifest.json exposes a settings schema. The Controller renders these in the admin UI and the defaults are what ship out of the box:

SettingDefaultRangePurpose
default_gnmi_port6030Default gNMI port for device connections.
telemetry_flush_interval_seconds51–60How often accumulated telemetry is written to state.
awareness_default_modelclaude-haiku-4-5-20251001Default model for background awareness/baseline work.
baseline_learning_days71–30Length of the learning phase before compilation.
Two models per agent

Background engines (learner, compiler, awareness) default to claude-haiku-4-5-20251001 — cheap and fast because they run constantly. Interactive chat uses the agent's richer model. You can override the background model per agent via cycle_model.

API Reference

The plugin's endpoints are reached through the Controller reverse proxy at /api/plugins/profiling-agents/* (see Plugin System). Internally the service routes them under /admin/*. The full surface:

profiling-agents-api.httphttp
# Agents
GET    /admin/agents                         # list
POST   /admin/agents                         # create
GET    /admin/agents/:id                      # get
DELETE /admin/agents/:id                      # delete (stops subscriptions)
POST   /admin/agents/:id/activate             # activate + start gNMI + re-elect
POST   /admin/agents/:id/deactivate           # deactivate + stop gNMI + re-elect

# Device assignment
GET    /admin/agents/:id/devices              # list assigned devices
POST   /admin/agents/:id/devices              # assign a device
DELETE /admin/agents/:id/devices/:device_id   # unassign

# Chat
GET    /admin/agents/:id/chat                 # chat context
POST   /admin/agents/:id/chat                 # send a message

# Anomalies & notifications
GET    /admin/agents/:id/anomalies            # ?hours=24
POST   /admin/agents/:id/anomalies/:anomaly_id/acknowledge
GET    /admin/agents/:id/notifications        # delivery log

# Mesh & messaging
GET    /admin/mesh
POST   /admin/mesh/elect
GET    /admin/messages                        # ?agent_id=<uuid>&hours=24

# Health (polled by the Controller)
GET    /health
Acknowledge vs. delete

Acknowledging an anomaly only flips its acknowledged flag — it stays in history for audit. Deleting an agent cascades: its subscriptions are stopped and all its assignments, device states, anomalies, conversations, watch rules, and mesh entries are removed.

Worked Examples

The examples below call the plugin through the Controller proxy. Replace netstacks.example.com and the bearer token with your own. Requests are relayed to the running plugin container.

1. Create an agent

curl -X POST https://netstacks.example.com/api/plugins/profiling-agents/admin/agents \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
        "name": "Edge Fabric Operator",
        "description": "Owns the edge routers in DC1",
        "system_prompt": "You are a senior NOC engineer responsible for the DC1 edge routers.",
        "provider": "anthropic",
        "model": "claude-sonnet-4-5",
        "cycle_model": "claude-haiku-4-5-20251001",
        "awareness_interval_mins": 15,
        "proactive_enabled": true,
        "proactive_channels": { "in_app": true, "email": ["noc@example.com"] }
      }'

2. Assign a device, then activate

# Assign (omit subscription_config_id to use the default "Standard Router" config)
curl -X POST https://netstacks.example.com/api/plugins/profiling-agents/admin/agents/$AGENT/devices \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{ "device_id": "8f1c2d3e-...-e2" }'

# Activate: opens the gNMI subscription, fetches config, starts the baseline
curl -X POST https://netstacks.example.com/api/plugins/profiling-agents/admin/agents/$AGENT/activate \
  -H "Authorization: Bearer $TOKEN"

3. Pin a custom subscription config on assignment

curl -X POST https://netstacks.example.com/api/plugins/profiling-agents/admin/agents/$AGENT/devices \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{
        "device_id": "8f1c2d3e-...-e2",
        "subscription_config_id": "c0ffee00-...-01"
      }'

4. Review anomalies and acknowledge one

# Last 6 hours of anomalies for the agent
curl -s "https://netstacks.example.com/api/plugins/profiling-agents/admin/agents/$AGENT/anomalies?hours=6" \
  -H "Authorization: Bearer $TOKEN" | jq

# Acknowledge a specific anomaly
curl -X POST \
  "https://netstacks.example.com/api/plugins/profiling-agents/admin/agents/$AGENT/anomalies/$ANOMALY/acknowledge" \
  -H "Authorization: Bearer $TOKEN"

5. Chat with the agent

curl -X POST https://netstacks.example.com/api/plugins/profiling-agents/admin/agents/$AGENT/chat \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{ "message": "Summarize the health of all your devices.",
        "user_id": "11111111-1111-1111-1111-111111111111" }'

Q&A

What protocol do agents use to read devices?
gNMI (gRPC Network Management Interface). Each assigned device gets a long-lived gNMI subscription on port 6030 by default. Devices must expose a gNMI target; SSH/SNMP-only devices are not supported by this plugin.
How long until an agent has a baseline?
The learning phase defaults to baseline_learning_days = 7. After that the compiler synthesizes a structured profile and the device becomes profiled. You can shorten or lengthen the window (1–30 days) in plugin settings.
What is the difference between the chat model and the cycle model?
model drives interactive chat and should be capable. cycle_model drives the constantly running learner, compiler, and awareness cycle, and defaults to claude-haiku-4-5-20251001 to keep background cost and latency low.
How often does the awareness cycle actually run for an agent?
The engine checks every 60 seconds, but each agent only runs a full cycle once its awareness_interval_mins (default 15) has elapsed since its last cycle. The very first cycle starts after a 180-second warm-up on service startup.
Will I get paged for the same problem repeatedly?
No. Proactive notifications use a per-device, per-source/severity cooldown (60 minutes for awareness-cycle anomalies), so a persistent condition notifies once per window, not every cycle.
How is the coordinator chosen?
Election runs on startup and after any activate/deactivate (or via POST /admin/mesh/elect). Active agents are scored by device count; the top scorer becomes coordinator and the rest attach as children. A topology-aware score is planned.
What happens to subscriptions when the plugin restarts?
On startup the service queries every agent with active = true and resumes its gNMI subscriptions automatically, then re-runs the election. No manual re-activation is needed.
Can I suppress detection during maintenance?
Yes — suppress the relevant watch rules by setting suppressed_until (with a suppression_reason). Suppressed rules are shown to the agent as suppressed in its awareness prompt so it knows not to alert on them.
  • Plugin System — how plugins install, enable, proxy, and declare capabilities and settings.
  • Alert Pipeline — the alerts plugin for ingesting and routing alerts.
  • Incidents & ITSM — turn anomalies and alerts into tracked incidents.
  • NOC Agents — the Terminal-side AI agents and how they differ from device-owning Profiling Agents.
  • LLM Configuration — configuring providers and models that agents invoke.
  • Network Topology — the topology data a future topology-aware coordinator election will use.