Task Monitoring
TeamsEnterpriseMonitor scheduled tasks, MOP executions, and agent tasks in real time with execution logs, retry tracking, and status filtering.
Overview
The Task Monitoring dashboard provides real-time visibility into every automated operation running in your NetStacks environment. From scheduled backups and health checks to multi-step MOP executions and AI agent tasks, the monitoring system tracks execution state, captures logs, and surfaces failures so your team can respond quickly.
Monitoring covers three categories of automation:
- Scheduled Tasks — Recurring cron-based operations (backups, health checks, deployments, custom scripts, agent tasks)
- MOP Executions — Multi-step procedures running against target devices with phase-by-phase progress tracking
- Agent Tasks — AI-driven operations triggered by scheduled tasks or manually initiated through the NOC agent system
The monitoring dashboard is designed for NOC operators and senior engineers who need to see the health of all automated operations at a glance. Pin it to a NOC display or use it as a daily check-in dashboard.
How It Works
Execution States
Every task execution moves through a series of states:
| Status | Description |
|---|---|
| Pending | Task is queued and waiting for the executor to pick it up |
| Running | Task is actively executing against target devices |
| Completed | Task finished successfully with all steps passing |
| Failed | Task failed after exhausting all retry attempts |
| Cancelled | Task was manually cancelled by an operator before completion |
Retry Mechanism
When a task fails, the retry mechanism kicks in based on the task's configuration:
- max_retries — How many times to retry before marking the task as failed (default: 3)
- retry_delay_seconds — How long to wait between retry attempts. The delay is fixed (not exponential) to keep behavior predictable for network operations.
- timeout_seconds — Maximum time a single attempt can run. If the timeout is exceeded, the attempt is killed and counted as a failure.
Retries happen within the same scheduled run. If all retries are exhausted, the task is marked as failed for that run, but it remains enabled and will attempt to execute again at its next scheduled time.
MOP Execution Tracking
MOP executions provide deeper tracking than simple scheduled tasks. The monitoring system tracks:
- Overall execution status and current phase (pre-check, change, post-check)
- Per-device status, current step, and error messages
- Per-step status, output, duration, and AI feedback (when using AI Supervised mode)
- Execution timestamps for started, completed, and checkpoint events
Execution History
All execution records are stored in PostgreSQL with full detail. The monitoring dashboard queries this data to show recent history, and you can filter by date range, status, task type, or device to find specific executions.
Using the Monitoring Dashboard
Step 1: Navigate to the Dashboard
Open Automation → Monitoring in the main navigation. The dashboard loads with a summary of current and recent task activity.
Step 2: View Active Tasks
The top section shows all currently running tasks with real-time status updates. Each row displays the task name, type, target devices, start time, and current progress. MOP executions show the current phase and step.
Step 3: Filter by Status, Type, or Device
Use the filter controls to narrow the view:
- Status — Show only pending, running, completed, failed, or cancelled tasks
- Type — Filter by task type (backup, health check, deployment, etc.) or show only MOP executions
- Device — Search for tasks targeting a specific device
- Date range — View executions from a specific time period
Step 4: Drill into Task Details
Click any task to open the detail view. For scheduled tasks, you see the full execution log with timestamps, device outputs, and error messages. For MOP executions, you see phase-by-phase progress with per-device and per-step breakdowns.
Step 5: Configure Retry Settings
From the task detail view, you can adjust retry settings (max retries, retry delay, timeout) for future runs. Changes take effect on the next scheduled execution.
Step 6: Cancel or Retry a Task
Use the action buttons in the task detail view to:
- Cancel a running task (the current execution stops and the task is marked as cancelled)
- Retry a failed task immediately without waiting for the next scheduled run
Cancelling a MOP execution stops the procedure at the current step. Any commands already sent to devices are not rolled back automatically. You may need to execute the rollback phase manually or create a new MOP execution to clean up.
Code Examples
Scheduled Task Execution Log
A backup task execution showing device-by-device progress:
Task: Nightly Core Router Backup
Type: backup
Status: completed
Started: 2026-03-10 02:00:01 UTC
Completed: 2026-03-10 02:03:47 UTC
Duration: 3m 46s
Retry Attempts: 0
Device Results:
core-rtr-01.dc1 (10.1.0.1)
Status: success
Duration: 45s
Config size: 24,832 bytes
Snapshot: snap-20260310-020001-core-rtr-01
core-rtr-02.dc1 (10.1.0.2)
Status: success
Duration: 52s
Config size: 26,104 bytes
Snapshot: snap-20260310-020046-core-rtr-02
dist-sw-01.dc1 (10.1.1.1)
Status: success
Duration: 38s
Config size: 18,456 bytes
Snapshot: snap-20260310-020138-dist-sw-01
dist-sw-02.dc1 (10.1.1.2)
Status: success
Duration: 41s
Config size: 19,220 bytes
Snapshot: snap-20260310-020216-dist-sw-02MOP Execution Progress
A MOP execution showing phase-by-phase tracking with per-step detail:
Execution: Deploy VLAN 100 - DC1 Access Switches
Strategy: sequential
Control Mode: ai_supervised (autonomy level 2)
Status: running
Current Phase: change
Current Device: acc-sw-01.dc1 (10.1.2.1)
Phase Progress:
[x] Pre-Check (completed - 2 of 2 steps passed)
Step 1: show vlan brief [passed] 1.2s
Step 2: show interfaces trunk [passed] 1.8s
[>] Change (in progress - 1 of 2 steps completed)
Step 3: conf t / vlan 100 / name ENG [passed] 2.1s
Step 4: conf t / interface range Gi1/0/1-24 [running]
[ ] Post-Check (pending - 0 of 2 steps)
Step 5: show vlan brief | include 100 [pending]
Step 6: show interfaces Gi1/0/1 switchport [pending]
[ ] Rollback (standby)
Step 7: no vlan 100 / revert ports [standby]Retry Configuration
Configure retry behavior for a scheduled task:
{
"max_retries": 3,
"retry_delay_seconds": 60,
"timeout_seconds": 300
}
// Retry timeline for a failing task:
// Attempt 1: 02:00:00 - 02:05:00 (timeout after 300s) -> FAILED
// Wait 60 seconds
// Attempt 2: 02:06:00 - 02:11:00 (timeout after 300s) -> FAILED
// Wait 60 seconds
// Attempt 3: 02:12:00 - 02:17:00 (timeout after 300s) -> FAILED
// All retries exhausted -> Task marked as FAILED
// Next scheduled run proceeds at normal cron timeHealth Check Failure Output
Example of a health check task that failed on one device:
Task: Edge Switch Health Monitor
Type: health_check
Status: failed (1 of 4 devices failed)
Retry Attempts: 2 of 2
Failing Device:
edge-sw-03.branch2 (10.5.3.1)
Check: ssh
Error: "Connection timed out after 30s - no route to host"
Attempt 1: 14:15:00 - timed out
Attempt 2: 14:15:30 - timed out (retry_delay: 30s)
Recommendation: Verify physical connectivity and check
upstream switch port status for edge-sw-03.branch2
Passing Devices:
edge-sw-01.branch1 (10.5.1.1) - all checks passed (1.1s)
edge-sw-02.branch1 (10.5.1.2) - all checks passed (0.9s)
edge-sw-04.branch2 (10.5.3.2) - all checks passed (1.3s)Questions & Answers
- Q: How do I see which tasks are currently running?
- A: Open Automation → Monitoring. The top section of the dashboard shows all currently active tasks with real-time status. You can filter by the "Running" status to see only tasks that are currently executing.
- Q: What do the different task statuses mean?
- A: Pending means the task is queued and waiting to execute. Running means it is currently in progress. Completed means it finished successfully. Failed means it failed after exhausting all retry attempts. Cancelled means an operator manually stopped the execution.
- Q: How does retry logic work?
- A: When a task attempt fails or times out, the scheduler waits for the configured
retry_delay_secondsand tries again, up tomax_retriestimes. Each attempt runs with the sametimeout_secondslimit. If all retries are exhausted, the task is marked as failed. The task remains enabled and will try again at its next scheduled time. - Q: Can I cancel a running task?
- A: Yes. Click the task in the monitoring dashboard and use the Cancel button. The current execution stops and the task is marked as cancelled. For scheduled tasks, the task remains enabled and will run at its next scheduled time. For MOP executions, cancellation stops at the current step — you may need to run rollback manually.
- Q: How long is execution history retained?
- A: Execution history is stored in PostgreSQL and retained indefinitely by default. Each execution record includes start time, completion time, status, device outputs, error messages, and retry attempts. You can configure a data retention policy on the Controller to automatically purge old records if storage is a concern.
- Q: Can I get notified when a task fails?
- A: The monitoring dashboard highlights failed tasks prominently. For proactive alerting, configure notifications in Admin → Notification Settings to receive alerts via email or webhook when tasks fail. This is especially useful for critical tasks like backup jobs where silent failures could go unnoticed.
Troubleshooting
Dashboard not updating in real time
The monitoring dashboard uses polling to refresh task status. If the dashboard appears stale, try refreshing the page. If the issue persists, check that the Controller API is responding by visiting the health endpoint. Network connectivity issues between your browser and the Controller can also prevent updates.
Task showing wrong status
If a task appears stuck in "Running" but is no longer executing (e.g., the Controller was restarted), the status may be stale. The Controller's cleanup process detects abandoned executions and marks them as failed after the timeout period. If you need to resolve it immediately, cancel the task from the dashboard.
Excessive retries consuming resources
If a task with a short retry delay is failing repeatedly, it can create load on target devices and the Controller. Reduce the max_retries count or increase the retry_delay_seconds to space out attempts. For tasks that target unreachable devices, consider disabling the task until the connectivity issue is resolved.
Execution logs missing device output
If the execution log shows the task completed but device outputs are empty, the SSH session may have disconnected before the output was captured. Check the Controller logs for SSH connection errors. Verify that the timeout_seconds is long enough for the command to complete and return output.
Export execution logs from the task detail view for offline analysis or to share with your team when investigating failures. The export includes all timestamps, device outputs, and error messages.
Related Features
Monitoring works alongside these NetStacks features:
- Scheduled Tasks — Create and configure the tasks that appear in the monitoring dashboard
- Method of Procedures (MOPs) — Multi-step procedures with phase-by-phase execution tracking
- Audit Logs — Complete audit trail of all automated operations and manual actions
- NOC Agents — AI agent tasks that appear in the monitoring dashboard alongside scheduled tasks
- Cron Expressions — Scheduling syntax used by all monitored scheduled tasks