NOC Agents

Enterprise

Autonomous AI agents that monitor, triage, and respond to network events using the ReAct pattern with tool approval workflows.

Overview

NOC Agents are autonomous AI agents that monitor, triage, and respond to network events without constant human intervention. They combine the reasoning capabilities of large language models with the ability to execute network commands, query knowledge bases, and create incidents, all governed by a tool approval system that prevents unauthorized changes.

Agent Types

Type	Purpose	Example Use Case
Triage	Initial event assessment and categorization	Receive a BGP neighbor down alert, check neighbor status, classify severity, and create an incident with initial findings
Protocol-Specific	Deep expertise in a specific protocol (BGP, OSPF, STP)	Investigate OSPF adjacency failures by checking interface status, MTU mismatches, authentication, and area configuration
Custom	User-defined for specific workflows	Validate post-maintenance state by running a predefined set of checks and comparing to a baseline

The ReAct Pattern

NOC Agents use the ReAct (Reasoning + Acting) pattern, a structured loop where the agent observes the current state, reasons about what to do next, takes an action, and then observes the result. This loop continues until the agent has enough information to produce a final answer or reaches its iteration limit.

Tool Approvals

Agents can be configured to require human approval before executing destructive commands (configuration changes, interface shutdowns, device reloads). Read-only commands like show can be auto-approved while write operations require explicit confirmation.

How It Works

Agent Architecture

Each agent is defined with a system prompt, a set of available tools, an LLM provider and model, and execution constraints (max iterations, tool timeout, total execution timeout). When triggered, the agent enters the ReAct loop:

Observe — The agent receives the trigger event (alert, scheduled task, or manual invocation) along with relevant context from the knowledge base.
Think — The LLM analyzes the observation and decides what action to take next. It explains its reasoning in the conversation log.
Act — The agent requests execution of a tool (SSH command, API call, knowledge base query, incident creation). If the tool requires approval, execution pauses until approved.
Observe — The result of the action is fed back to the agent. The loop continues from step 2 until the agent produces a final answer or reaches its iteration limit.

Trigger Types

Agents can be activated through three trigger mechanisms:

Alert-Triggered — When an alert matches a configured pattern, the agent is automatically invoked. Agents with auto_triage_enabled are evaluated against incoming alerts using their auto_triage_alert_patterns.
Scheduled — Agents can run on a schedule for periodic health checks, compliance audits, or baseline comparisons.
Manual — Users can invoke agents directly from the NOC Agents dashboard, providing a specific task or question as input.

Tool Approval System

The tool approval system controls what agents can do on your network. Each agent has a require_approval_for_destructive setting and a list of allowed_tool_categories that determine its permissions:

Auto-Approved — Read-only commands (show, display, get) execute immediately without human intervention.
Requires Approval — Configuration changes, interface modifications, and device reloads pause execution and notify an approver. The agent resumes after approval or terminates on denial.

Execution History and Auditing

Every agent execution is recorded with full details: trigger source, input task, conversation log (the complete ReAct loop), tool calls made, iterations used, final answer, execution state (running, completed, failed, timeout), and timestamps. Execution history supports filtering by agent, state, trigger type, and date range.

Creating Agents

Follow these steps to create and deploy a NOC Agent for automated network event response.

Navigate to AI > NOC Agents — Open the NOC Agents dashboard from the main navigation menu.
Create a new agent — Click “New Agent” and provide a name, description, and agent type (Triage, Protocol, or Custom). For protocol-specific agents, select the target protocol (BGP, OSPF, STP, MPLS).
Configure the LLM — Select the LLM provider and model for this agent. Agents performing complex analysis benefit from more capable models (e.g., GPT-4o, Claude Sonnet), while simple triage agents can use faster models.
Write the system prompt — Define the agent's personality, expertise area, and instructions. The system prompt guides how the agent reasons about problems and what approach it takes.
Configure triggers — Set up alert patterns that activate this agent, enable auto-triage for automatic activation, or configure a schedule for periodic execution.
Define available tools — Specify which tool categories the agent can use: SSH commands, API calls, knowledge base queries, incident management.
Set tool approval policy — Enable require_approval_for_destructive to ensure configuration changes require human approval. Set execution limits: max iterations, tool timeout (seconds per tool call), and total execution timeout.
Test the agent — Run the agent manually with a simulated event. Review the conversation log to verify the agent's reasoning and actions are correct.
Enable for production — Toggle the agent to enabled state. It will now respond to configured triggers automatically.

Start Read-Only

When deploying a new agent, start with read-only tool permissions and destructive approval required. Monitor the agent's execution history for several cycles before expanding its capabilities.

Agent Task Limits

To prevent any single user from consuming excessive system resources, NetStacks enforces per-user limits on the number of concurrent AI agent tasks running in the background. When a user reaches their limit, they must wait for one or more running tasks to complete before launching new ones.

Default Limit

The system-wide default is 3 concurrent agent tasks per user. This applies to all users unless an administrator has configured a custom limit for a specific user.

Custom Per-User Limits

Administrators can override the default limit on a per-user basis. Navigate to Admin → AI → Agent Task Limits to manage custom limits:

The limits table displays usernames (not UUIDs) for easy identification.
Use the user dropdown to select a user and assign a custom limit between 1 and 20 concurrent tasks.
Users without a custom limit fall back to the system default of 3.
Set a limit of 1 to restrict a user to sequential agent execution, or increase to 20 for power users who run many parallel investigations.

Exceeding the Limit

When a user attempts to launch an agent task and has already reached their concurrent limit, the request is rejected with a message indicating how many tasks are currently running and what their limit is. The user must wait for at least one running task to complete (or cancel a running task) before starting a new one.

Alert-triggered tasks

Agent tasks triggered by incoming alerts count toward the user limit of the alert pipeline's configured owner. If the owner has reached their limit, the alert-triggered task is queued and executed once a slot becomes available.

Code Examples

Agent Definition: BGP Triage Agent

Here is a complete agent configuration for a BGP triage agent that automatically investigates BGP neighbor state changes:

bgp-triage-agent.jsonjson

{
  "name": "BGP Triage Agent",
  "description": "Automatically triages BGP neighbor state change alerts",
  "agent_type": "protocol",
  "protocol": "bgp",
  "provider": "anthropic",
  "model": "claude-sonnet-4-20250514",
  "system_prompt": "You are a BGP expert NOC agent. When triggered by a BGP alert, investigate the neighbor state, check for common issues (authentication, reachability, AS mismatch), and create an incident with your findings. Always check both sides of the peering if accessible.",
  "max_iterations": 10,
  "tool_timeout_seconds": 30,
  "max_total_seconds": 300,
  "require_approval_for_destructive": true,
  "allowed_tool_categories": ["ssh_read", "knowledge_query", "incident_create"],
  "auto_triage_enabled": true,
  "auto_triage_alert_patterns": ["BGP.*neighbor.*down", "BGP.*state.*Active", "NOTIFICATION.*BGP"]
}

ReAct Loop in Action

This shows the agent's reasoning and actions during an actual triage of a BGP neighbor down event:

react-loop-example.txttext

[Trigger] Alert: BGP neighbor 10.0.2.2 (AS 65004) state changed to Active
         on device core-rtr-01

[Iteration 1]
  Think: BGP neighbor 10.0.2.2 changed to Active state. I need to check the
         detailed neighbor status to understand why the session dropped.
  Act:   ssh core-rtr-01 "show ip bgp neighbor 10.0.2.2"
  Result: BGP neighbor is 10.0.2.2, remote AS 65004
          BGP state = Active
          Last read 00:05:23, Last write 00:05:23
          Last reset 00:05:23, due to BGP Notification sent
          Notification sent: hold time expired

[Iteration 2]
  Think: The session was reset due to hold time expiration. This typically
         indicates a connectivity issue or the remote router stopped sending
         keepalives. I should verify IP reachability.
  Act:   ssh core-rtr-01 "ping 10.0.2.2 source Loopback0 repeat 5"
  Result: Success rate is 0 percent (0/5)

[Iteration 3]
  Think: No IP reachability to 10.0.2.2. Let me check the route to this
         neighbor to see if there is a routing issue.
  Act:   ssh core-rtr-01 "show ip route 10.0.2.2"
  Result: 10.0.2.0/30 is directly connected, GigabitEthernet0/0/2

[Iteration 4]
  Think: The route exists as directly connected via Gi0/0/2. Let me check
         the interface status.
  Act:   ssh core-rtr-01 "show interface GigabitEthernet0/0/2 | include line protocol"
  Result: GigabitEthernet0/0/2 is up, line protocol is up

[Iteration 5]
  Think: Interface is up/up but pings fail. This could be an issue on the
         remote side. Let me check for any error counters.
  Act:   ssh core-rtr-01 "show interface GigabitEthernet0/0/2 | include errors|CRC|drops"
  Result: 0 input errors, 0 CRC, 0 frame
          847 output errors, 0 collisions, 0 interface resets
          Input queue: 0/75/0/0, Total output drops: 312

[Final Answer]
  BGP neighbor 10.0.2.2 (AS 65004) is down due to hold time expiration.
  Root cause: IP connectivity failure despite interface up/up status.
  Evidence: Ping fails, 847 output errors and 312 output drops on Gi0/0/2.
  Severity: HIGH - Possible physical layer or remote-side issue.
  Recommendation: Check physical connectivity and remote device status.
  Incident created: INC-2026-0342

Tool Approval Flow

When an agent requests a destructive operation, execution pauses until approved:

tool-approval-flow.txttext

[Iteration 6]
  Think: The interface has high output errors. I should clear the counters
         and monitor for new errors to determine if the issue is ongoing.
  Act:   ssh core-rtr-01 "clear counters GigabitEthernet0/0/2"

  [APPROVAL REQUIRED]
  Agent "BGP Triage Agent" is requesting to execute a write command:
    Device: core-rtr-01
    Command: clear counters GigabitEthernet0/0/2
    Reason: Clear interface counters to monitor for ongoing errors

  Status: Waiting for approval...
  Approved by: admin@netstacks.io at 14:23:05 UTC

  Result: Clear "show interface" counters on interface GigabitEthernet0/0/2 [OK]

Execution History Query

Execution history can be filtered by agent, state, trigger type, and date range:

query-execution-history.shbash

# List recent executions for the BGP Triage Agent
curl -s https://controller.example.com/api/v1/agents/executions \
  -H "Authorization: Bearer $TOKEN" \
  -G -d "agent_id=a1b2c3d4-..." \
  -d "state=completed" \
  -d "triggered_by=alert" \
  -d "limit=10" | jq '.[] | {id, state, input_task, iterations_used, tool_calls_made}'

# Example output
{
  "id": "exec-9f8e7d6c-...",
  "state": "completed",
  "input_task": "BGP neighbor 10.0.2.2 state changed to Active",
  "iterations_used": 6,
  "tool_calls_made": 5
}

Questions & Answers

What types of NOC agents are available?: NetStacks provides three agent types: Triage agents perform initial event assessment and categorization, routing alerts to the right team with initial findings. Protocol-specific agents (BGP, OSPF, STP, MPLS) have deep expertise in a single protocol and can perform detailed investigation of protocol-related issues. Custom agents are user-defined and can be tailored to any workflow, such as post-maintenance validation, compliance checks, or capacity planning.
What is the ReAct pattern?: ReAct (Reasoning + Acting) is a structured execution pattern where the agent alternates between thinking and acting. In each iteration, the agent observes the current state, reasons about what to do next (visible in the conversation log), executes a tool action, and then observes the result. This loop continues until the agent has enough information to produce a final answer or reaches its configured iteration limit (max_iterations). The pattern ensures transparency because the agent's reasoning is logged at every step.
How do tool approvals work?: Each agent has a require_approval_for_destructive setting. When enabled, the agent can execute read-only commands (show, display, get) automatically, but any command that modifies configuration, shuts down interfaces, or reloads devices requires explicit human approval. When the agent requests a destructive command, execution pauses and a notification is sent to configured approvers. The agent resumes after approval or terminates if the request is denied.
Can agents execute commands on devices?: Yes. Agents can execute SSH commands on managed devices as part of their investigation workflow. The allowed_tool_categories setting controls which types of tools the agent can use, including ssh_read (show commands), ssh_write (configuration changes), knowledge_query (knowledge base search), and incident_create (incident management). Combined with the tool approval system, this provides controlled automation.
How do I view agent execution history?: Navigate to AI > NOC Agents and select an agent to view its execution history. Each execution shows the trigger source, input task, full conversation log (the ReAct loop with reasoning and actions), tool calls made, iterations used, final answer, and execution state. You can filter by state (running, completed, failed, timeout), trigger type (alert, scheduled, manual), and date range.
Can I create custom agent types?: Yes. Select “Custom” as the agent type when creating a new agent. Custom agents accept a free-form system prompt that defines their behavior, expertise area, and approach. You can create agents for any workflow: post-maintenance validation, security compliance audits, capacity utilization checks, or multi-vendor configuration comparison. Custom agents have access to the same tools and approval system as built-in agent types.

Troubleshooting

Issue	Possible Cause	Solution
Agent not triggering on alerts	Trigger patterns not matching or agent disabled	Verify the agent is enabled and `auto_triage_enabled` is set to true. Check that `auto_triage_alert_patterns` contain regex patterns that match your alert messages. Test patterns against sample alerts before deploying.
Agent stuck waiting for approval	Approval notification not received or no approver available	Check that approval notifications are configured correctly (email, webhook, or in-app). Verify approver users have the required permissions. Consider adjusting the tool timeout if approvals take longer than expected.
Agent making incorrect decisions	System prompt needs refinement or insufficient knowledge base	Review the agent's conversation log to understand its reasoning. Refine the system prompt with more specific instructions. Add relevant documentation to the Knowledge Base so the agent has organizational context for its decisions.
Agent timing out	Too many iterations or slow tool responses	Increase `max_total_seconds` if the investigation legitimately requires more time. Increase `tool_timeout_seconds` if individual commands are slow. Check network connectivity to managed devices. Reduce `max_iterations` if the agent is looping without progress.
Agent execution failed	LLM provider error or tool execution failure	Check the execution error message in the history view. Verify the LLM provider is reachable and the API key is valid. Check that the agent's configured model is available on the provider. Review device connectivity for SSH tool failures.

AI Chat — Interactive AI assistance for manual network operations with credential sanitization and copilot panel.
Knowledge Base — Build a searchable document repository that NOC Agents use for organizational context during investigations.
LLM Configuration — Configure the AI providers and models that power NOC Agents, including model selection and token budgets.
Methods of Procedure — Define structured maintenance procedures that agents can reference or execute as part of automated workflows.
Task Monitoring — Monitor running agent tasks, view execution progress, and manage scheduled agent invocations.
MCP Servers — Model Context Protocol servers that extend agent capabilities with additional tools and data sources.