NetStacksNetStacks

Task Monitoring

TeamsEnterprise

Monitor scheduled tasks, MOP executions, and agent tasks in real time with execution logs, retry tracking, and status filtering.

Overview

The Task Monitoring dashboard provides real-time visibility into every automated operation running in your NetStacks environment. From scheduled backups and health checks to multi-step MOP executions and AI agent tasks, the monitoring system tracks execution state, captures logs, and surfaces failures so your team can respond quickly.

Monitoring covers three categories of automation:

  • Scheduled Tasks — Recurring cron-based operations (backups, health checks, deployments, custom scripts, agent tasks)
  • MOP Executions — Multi-step procedures running against target devices with phase-by-phase progress tracking
  • Agent Tasks — AI-driven operations triggered by scheduled tasks or manually initiated through the NOC agent system
Centralized operations view

The monitoring dashboard is designed for NOC operators and senior engineers who need to see the health of all automated operations at a glance. Pin it to a NOC display or use it as a daily check-in dashboard.

How It Works

Execution States

Every task execution moves through a series of states:

StatusDescription
PendingTask is queued and waiting for the executor to pick it up
RunningTask is actively executing against target devices
CompletedTask finished successfully with all steps passing
FailedTask failed after exhausting all retry attempts
CancelledTask was manually cancelled by an operator before completion

Retry Mechanism

When a task fails, the retry mechanism kicks in based on the task's configuration:

  • max_retries — How many times to retry before marking the task as failed (default: 3)
  • retry_delay_seconds — How long to wait between retry attempts. The delay is fixed (not exponential) to keep behavior predictable for network operations.
  • timeout_seconds — Maximum time a single attempt can run. If the timeout is exceeded, the attempt is killed and counted as a failure.
Retry vs re-schedule

Retries happen within the same scheduled run. If all retries are exhausted, the task is marked as failed for that run, but it remains enabled and will attempt to execute again at its next scheduled time.

MOP Execution Tracking

MOP executions provide deeper tracking than simple scheduled tasks. The monitoring system tracks:

  • Overall execution status and current phase (pre-check, change, post-check)
  • Per-device status, current step, and error messages
  • Per-step status, output, duration, and AI feedback (when using AI Supervised mode)
  • Execution timestamps for started, completed, and checkpoint events

Execution History

All execution records are stored in PostgreSQL with full detail. The monitoring dashboard queries this data to show recent history, and you can filter by date range, status, task type, or device to find specific executions.

Using the Monitoring Dashboard

Step 1: Navigate to the Dashboard

Open Automation → Monitoring in the main navigation. The dashboard loads with a summary of current and recent task activity.

Step 2: View Active Tasks

The top section shows all currently running tasks with real-time status updates. Each row displays the task name, type, target devices, start time, and current progress. MOP executions show the current phase and step.

Step 3: Filter by Status, Type, or Device

Use the filter controls to narrow the view:

  • Status — Show only pending, running, completed, failed, or cancelled tasks
  • Type — Filter by task type (backup, health check, deployment, etc.) or show only MOP executions
  • Device — Search for tasks targeting a specific device
  • Date range — View executions from a specific time period

Step 4: Drill into Task Details

Click any task to open the detail view. For scheduled tasks, you see the full execution log with timestamps, device outputs, and error messages. For MOP executions, you see phase-by-phase progress with per-device and per-step breakdowns.

Step 5: Configure Retry Settings

From the task detail view, you can adjust retry settings (max retries, retry delay, timeout) for future runs. Changes take effect on the next scheduled execution.

Step 6: Cancel or Retry a Task

Use the action buttons in the task detail view to:

  • Cancel a running task (the current execution stops and the task is marked as cancelled)
  • Retry a failed task immediately without waiting for the next scheduled run
Cancelling MOP executions

Cancelling a MOP execution stops the procedure at the current step. Any commands already sent to devices are not rolled back automatically. You may need to execute the rollback phase manually or create a new MOP execution to clean up.

Code Examples

Scheduled Task Execution Log

A backup task execution showing device-by-device progress:

backup-execution-log.txttext
Task: Nightly Core Router Backup
Type: backup
Status: completed
Started: 2026-03-10 02:00:01 UTC
Completed: 2026-03-10 02:03:47 UTC
Duration: 3m 46s
Retry Attempts: 0

Device Results:
  core-rtr-01.dc1 (10.1.0.1)
    Status: success
    Duration: 45s
    Config size: 24,832 bytes
    Snapshot: snap-20260310-020001-core-rtr-01

  core-rtr-02.dc1 (10.1.0.2)
    Status: success
    Duration: 52s
    Config size: 26,104 bytes
    Snapshot: snap-20260310-020046-core-rtr-02

  dist-sw-01.dc1 (10.1.1.1)
    Status: success
    Duration: 38s
    Config size: 18,456 bytes
    Snapshot: snap-20260310-020138-dist-sw-01

  dist-sw-02.dc1 (10.1.1.2)
    Status: success
    Duration: 41s
    Config size: 19,220 bytes
    Snapshot: snap-20260310-020216-dist-sw-02

MOP Execution Progress

A MOP execution showing phase-by-phase tracking with per-step detail:

mop-execution-progress.txttext
Execution: Deploy VLAN 100 - DC1 Access Switches
Strategy: sequential
Control Mode: ai_supervised (autonomy level 2)
Status: running
Current Phase: change
Current Device: acc-sw-01.dc1 (10.1.2.1)

Phase Progress:
  [x] Pre-Check (completed - 2 of 2 steps passed)
      Step 1: show vlan brief                    [passed] 1.2s
      Step 2: show interfaces trunk              [passed] 1.8s

  [>] Change (in progress - 1 of 2 steps completed)
      Step 3: conf t / vlan 100 / name ENG       [passed] 2.1s
      Step 4: conf t / interface range Gi1/0/1-24 [running]

  [ ] Post-Check (pending - 0 of 2 steps)
      Step 5: show vlan brief | include 100       [pending]
      Step 6: show interfaces Gi1/0/1 switchport  [pending]

  [ ] Rollback (standby)
      Step 7: no vlan 100 / revert ports          [standby]

Retry Configuration

Configure retry behavior for a scheduled task:

retry-config.jsonjson
{
  "max_retries": 3,
  "retry_delay_seconds": 60,
  "timeout_seconds": 300
}

// Retry timeline for a failing task:
// Attempt 1: 02:00:00 - 02:05:00 (timeout after 300s) -> FAILED
// Wait 60 seconds
// Attempt 2: 02:06:00 - 02:11:00 (timeout after 300s) -> FAILED
// Wait 60 seconds
// Attempt 3: 02:12:00 - 02:17:00 (timeout after 300s) -> FAILED
// All retries exhausted -> Task marked as FAILED
// Next scheduled run proceeds at normal cron time

Health Check Failure Output

Example of a health check task that failed on one device:

health-check-failure.txttext
Task: Edge Switch Health Monitor
Type: health_check
Status: failed (1 of 4 devices failed)
Retry Attempts: 2 of 2

Failing Device:
  edge-sw-03.branch2 (10.5.3.1)
    Check: ssh
    Error: "Connection timed out after 30s - no route to host"

    Attempt 1: 14:15:00 - timed out
    Attempt 2: 14:15:30 - timed out (retry_delay: 30s)

    Recommendation: Verify physical connectivity and check
    upstream switch port status for edge-sw-03.branch2

Passing Devices:
  edge-sw-01.branch1 (10.5.1.1) - all checks passed (1.1s)
  edge-sw-02.branch1 (10.5.1.2) - all checks passed (0.9s)
  edge-sw-04.branch2 (10.5.3.2) - all checks passed (1.3s)

Questions & Answers

Q: How do I see which tasks are currently running?
A: Open Automation → Monitoring. The top section of the dashboard shows all currently active tasks with real-time status. You can filter by the "Running" status to see only tasks that are currently executing.
Q: What do the different task statuses mean?
A: Pending means the task is queued and waiting to execute. Running means it is currently in progress. Completed means it finished successfully. Failed means it failed after exhausting all retry attempts. Cancelled means an operator manually stopped the execution.
Q: How does retry logic work?
A: When a task attempt fails or times out, the scheduler waits for the configured retry_delay_seconds and tries again, up to max_retries times. Each attempt runs with the same timeout_seconds limit. If all retries are exhausted, the task is marked as failed. The task remains enabled and will try again at its next scheduled time.
Q: Can I cancel a running task?
A: Yes. Click the task in the monitoring dashboard and use the Cancel button. The current execution stops and the task is marked as cancelled. For scheduled tasks, the task remains enabled and will run at its next scheduled time. For MOP executions, cancellation stops at the current step — you may need to run rollback manually.
Q: How long is execution history retained?
A: Execution history is stored in PostgreSQL and retained indefinitely by default. Each execution record includes start time, completion time, status, device outputs, error messages, and retry attempts. You can configure a data retention policy on the Controller to automatically purge old records if storage is a concern.
Q: Can I get notified when a task fails?
A: The monitoring dashboard highlights failed tasks prominently. For proactive alerting, configure notifications in Admin → Notification Settings to receive alerts via email or webhook when tasks fail. This is especially useful for critical tasks like backup jobs where silent failures could go unnoticed.

Troubleshooting

Dashboard not updating in real time

The monitoring dashboard uses polling to refresh task status. If the dashboard appears stale, try refreshing the page. If the issue persists, check that the Controller API is responding by visiting the health endpoint. Network connectivity issues between your browser and the Controller can also prevent updates.

Task showing wrong status

If a task appears stuck in "Running" but is no longer executing (e.g., the Controller was restarted), the status may be stale. The Controller's cleanup process detects abandoned executions and marks them as failed after the timeout period. If you need to resolve it immediately, cancel the task from the dashboard.

Excessive retries consuming resources

If a task with a short retry delay is failing repeatedly, it can create load on target devices and the Controller. Reduce the max_retries count or increase the retry_delay_seconds to space out attempts. For tasks that target unreachable devices, consider disabling the task until the connectivity issue is resolved.

Execution logs missing device output

If the execution log shows the task completed but device outputs are empty, the SSH session may have disconnected before the output was captured. Check the Controller logs for SSH connection errors. Verify that the timeout_seconds is long enough for the command to complete and return output.

Tip

Export execution logs from the task detail view for offline analysis or to share with your team when investigating failures. The export includes all timestamps, device outputs, and error messages.

Monitoring works alongside these NetStacks features:

  • Scheduled Tasks — Create and configure the tasks that appear in the monitoring dashboard
  • Method of Procedures (MOPs) — Multi-step procedures with phase-by-phase execution tracking
  • Audit Logs — Complete audit trail of all automated operations and manual actions
  • NOC Agents — AI agent tasks that appear in the monitoring dashboard alongside scheduled tasks
  • Cron Expressions — Scheduling syntax used by all monitored scheduled tasks