<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>NetStacks Blog</title>
    <link>https://www.netstacks.net/blog</link>
    <description>Technical articles on network automation, terminal workflows, SSH security, and AI-assisted network engineering.</description>
    <language>en-us</language>
    <atom:link href="https://www.netstacks.net/blog/rss.xml" rel="self" type="application/rss+xml" />
    
    <item>
      <title>AI Isn&apos;t Going to Fix Your BGP Convergence Time — But Here&apos;s What Will</title>
      <link>https://www.netstacks.net/blog/ai-isnt-fixing-your-bgp-converge-time</link>
      <description>The LLM hype promises autonomous networks that self-heal. The reality is messier. What network engineers actually need isn&apos;t autonomous AI — it&apos;s an AI that thinks with you when things break.</description>
      <pubDate>Fri, 22 May 2026 00:00:00 GMT</pubDate>
      <guid>https://www.netstacks.net/blog/ai-isnt-fixing-your-bgp-converge-time</guid>
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[When was the last time an AI fixed your BGP convergence problem without you asking? Never. That's the gap between what vendors sell and what network engineers actually need.

The AI networking conversation is broken. Vendors show demos where a chatbot analyzes an outage, generates a fix, and pushes it — all while the audience nods politely and nobody asks what happens when it's wrong. Real networks don't work that way. Nobody pushes an AI-suggested BGP config change at 2 AM without understanding exactly what it does and why.

**The Convergence Problem No One Talks About**

When a BGP session drops across a multi-path fabric, you don't need a chatbot to tell you "BGP session went down." You need: every affected peer identified, the exact point of failure isolated across AS boundaries, and the downstream impact quantified across your VRFs. And you need it in the 90 seconds before your monitoring alert becomes an outage ticket.

What happens in reality? You're juggling three SSH sessions, a show ip bgp summary that's still rendering, and a Slack thread where someone is asking "is it the fiber or the configuration?" Your legacy terminal gives you a blinking cursor. Your NMS shows you a red dot. Nobody is helping.

**Where NetStacks Actually Helps**

NetStacks doesn't try to be your autonomous NOC replacement. It does something more practical: when you're troubleshooting, it reads the CLI output you've already gathered, understands protocol state, and suggests the next diagnostic step. It's the difference between staring at truncated show bgp neighbors output and having someone immediately point out that the Hold Timer expired on one side because of asymmetric MTU on the transit link.

The multi-send mode is where this gets powerful. Broadcast a show bgp summary to 200 devices simultaneously, and the AI immediately flags which neighbors are in Active state, which have zero prefixes received, and correlate that back to your topology map so you can see the outage boundary visually. Five minutes of gathering output becomes 30 seconds.

**The Automation Gap Nobody Admits**

Here's the uncomfortable truth: most network teams that claim to be doing "AI-driven automation" are running scheduled scripts that collect show commands, parse the output with regex, and dump it into a dashboard nobody looks at until Something Is Red appears. That's monitoring, not automation. And the AI piece is usually a bolt-on chatbot that you query separately — another tab, another context switch, another workflow.

The gap isn't in the AI models. GPT-4 already understands BGP better than most junior engineers. The gap is in the workflow integration. The moment between seeing a problem and knowing what command to run next is where engineers lose time. That's what needs to be closed — not with autonomous AI, but with AI that operates in the same terminal session where you're already working.

**What to Actually Do About Convergence Time**

1. Pre-stage diagnostic templates. When BGP drops, you should already have a Jinja2 template that queries every relevant show command across every affected device and collects the results into a structured report. No typing during the outage.

2. Use multi-send to run them. NetStacks lets you define these templates once and broadcast them across device groups with a single click. The results come back in real-time, side-by-side in split panes.

3. Let AI highlight the anomalies. Once you have the output, select the confusing bits. The AI assistant compares the output against what it knows about BGP state machines and tells you exactly which neighbor is stuck and why. Not after you finish troubleshooting — during.

4. Map it visually. The topology view shows you the outage boundary. Double-click a device to SSH into it. No context switching between your NMS dashboard and your terminal.

This isn't autonomous networking. It's augmented networking. And it's what you can implement today with tools you already have access to.

The teams that will actually move the needle aren't the ones waiting for fully autonomous AI. They're the ones who figure out how to close the gap between gathering data and understanding it. That gap is measured in minutes during an outage — and minutes compound into hours per engineer per week that you're never getting back.]]></content:encoded>
      <category>AI</category>
      <category>BGP</category>
      <category>Automation</category>
    </item>
    <item>
      <title>Running Methods of Procedure in Production Without Praying: The MOP Pattern That Actually Works</title>
      <link>https://www.netstacks.net/blog/method-of-procedures-without-the-risk</link>
      <description>Post-mortems love MOPs because they&apos;re supposed to prevent exactly the disasters that keep happening. The problem isn&apos;t the concept — it&apos;s that most MOP tooling makes them painful enough that teams skip them.</description>
      <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
      <guid>https://www.netstacks.net/blog/method-of-procedures-without-the-risk</guid>
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[Every post-mortem includes the phrase "no MOP was followed" or "the MOP was incomplete." MOPs — Methods of Procedure — are supposed to be the structured, pre-approved, step-by-step change plans that prevent production disasters. And somehow, they fail every single time.

The problem isn't that network engineers don't understand the need for structured change procedures. The problem is that most MOP tooling turns them into 40-line Word documents that nobody reads, no one can execute from directly, and that provide zero safety net when step 23 produces output the author didn't anticipate.

**Why MOPs Fail in the Real World**

A proper MOP has three sections that most tools get wrong:

Pre-checks verify the current state matches your assumptions before you touch anything. Is the BGP session up? Are the routes present? Is the target device reachable? If any pre-check fails, the MOP should stop immediately — but most teams run these manually and just "assume things are fine."

Change steps are the actual configuration commands. They need to be sequenced, validated, and — critically — each step needs to define what success looks like. "Add the route-map" isn't a complete step. "Add the route-map and verify it appears in show run | section route-map" is.

Post-checks confirm the change achieved what it was supposed to. Not "the device is reachable" — "the new next-hop is installed in the FIB and traffic is flowing through it."

The gap between what a MOP should be and what most teams actually use is enormous. Most MOPs live in Confluence, get copy-pasted into SSH sessions, and have zero programmatic validation of results.

**The Approval Workflow Nobody Builds**

Here's a gap most automation platforms ignore: MOPs need approval workflows. Not just "someone reviewed the Word doc." Actual, programmatic, pre-change approval gates where:

- The MOP author defines who needs to approve
- Approvers review it against the change window
- Each step requires sign-off before execution
- Rollback is pre-approved and pre-defined

Without this, you're just running scripts. The MOP isn't a script — it's a governance boundary that protects production.

**How NetStacks Handles This Differently**

NetStacks treats MOPs as first-class automation objects, not scripts or documents. A MOP in NetStacks has structured pre-checks that run automatically and halt execution if conditions don't match. Each change step has its own success criteria that gets evaluated against the device output in real-time. Post-checks aren't an afterthought — they're baked into the execution flow.

The approval workflow means that before any MOP runs in production, the right people review and authorize it. No bypassing, no "I'll just run it quickly." The system enforces the process.

And when something goes wrong — because something always goes wrong — the rollback is already defined and tested. You don't figure out how to undo the change while the NOC is watching. You hit rollback and the MOP executes the reverse steps in order.

**The MOP Pattern Teams Should Copy**

Even if you're not using NetStacks, here's the pattern every team should adopt:

1. Every MOP gets pre-checks, change steps with success criteria, post-checks, and rollback — all structured, not free-text.

2. Pre-checks run automatically. If BGP isn't established on the peer you're modifying, the MOP stops. No human judgment call needed.

3. Each change step defines its own validation. After "ip route 10.0.0.0/8 192.168.1.1," the MOP immediately runs "show ip route 10.0.0.0/8" and verifies the next-hop.

4. Post-checks are independent verification, not just re-running the change command. They answer: "Is traffic actually flowing the way we intended?"

5. Rollback is defined before execution, not after failure. Every successful MOP should end with "post-check passed, no rollback needed."

The teams that take MOPs seriously don't just write better documents. They build enforcement into the tooling so the process can't be skipped. That's the difference between "we have MOPs" and "our MOPs actually prevent outages."]]></content:encoded>
      <category>Automation</category>
      <category>MOPs</category>
      <category>Change Management</category>
    </item>
    <item>
      <title>Your NetBox Isn&apos;t a Source of Truth (Yet) — Here&apos;s the Missing Piece Every Team Skips</title>
      <link>https://www.netstacks.net/blog/netbox-isnt-a-source-of-truth-yet</link>
      <description>NetBox adoption is near-universal, but most teams use it as a CMDB with a pretty UI. The gap between having NetBox and actually using it as a source of truth is smaller than you think — if you close one specific gap.</description>
      <pubDate>Mon, 18 May 2026 00:00:00 GMT</pubDate>
      <guid>https://www.netstacks.net/blog/netbox-isnt-a-source-of-truth-yet</guid>
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[Almost every network team has NetBox installed. Far fewer actually use it as their source of truth. The gap between "we have NetBox" and "NetBox is our source of truth" is where most network automation projects die.

The difference isn't the data model. NetBox already has interfaces, IPs, prefixes, VLANs, VRFs, circuits, and device roles. The difference is in the validation loop between what NetBox says and what your network actually shows.

**The Drift Nobody Measures**

When was the last time you verified that NetBox accurately reflects your network? Not "NetBox is up to date because we follow process." I mean, actually SSH'd into devices and compared show run against the NetBox data.

Most teams can't answer that question because nobody has built the verification step. You enter data into NetBox, and it stays there. Network changes happen through SSH sessions, change tickets, and Slack messages. NetBox gets updated "when there's time" — which is never during an outage window or a busy sprint.

This creates silent drift. NetBox says an interface is in VLAN 100. It's actually in VLAN 200. The interface shows as "up" in NetBox. The SFP failed three weeks ago. You're rendering stack configs from NetBox data that hasn't matched reality since the last person who bothered to update it.

**The Integration Most Teams Build**

The standard NetBox integration pattern looks like this: build a script that pulls device data from NetBox, generates configs, and pushes them to devices. This works — until NetBox is wrong. And because there's no continuous verification, NetBox is wrong more often than anyone admits.

The missing piece is the reverse direction. Not just NetBox → device, but device → NetBox. Every time you connect to a device, you should be able to:

- Snapshot the current config
- Compare it against the expected state from NetBox
- Flag any deltas immediately
- Propose NetBox updates when the delta is intentional

This isn't NetBox polling or SNMP scraping. This is interactive, engineer-driven verification at the moment of connection — when you already have the session open and the data is right there.

**How NetStacks Closes This Loop**

NetStacks integrates with NetBox at the session level, not the config-generation level. When you connect to a device that exists in your NetBox instance:

- The device info from NetBox appears alongside your terminal session
- You can run a config snapshot and diff it against what NetBox expects
- The topology view uses NetBox data to render your network, and you can verify it live
- When NetBox is wrong, you flag it immediately — not as a separate "update NetBox" ticket

The crawler feature takes this further. It discovers devices across your network, maps their connections, and builds a topology that you can compare against NetBox. The drift shows up as a visible gap between what you discovered and what NetBox says — making it impossible to ignore.

**The Pattern Every Team Should Adopt**

1. Connect NetBox to your terminal workflow. If your SoT data isn't visible at the point of connection, it's just a database — not truth.

2. Run device discovery independently of NetBox. Compare the discovered topology against NetBox. The delta is your drift measurement.

3. Snapshot configs during every session. Not scheduled — during the session when you're already there. Compare against NetBox-derived expected state.

4. Make drift visible and actionable. If NetBox says VLAN 100 and the device says VLAN 200, the delta should be surfaced immediately, not discovered during an outage.

5. Treat NetBox as a living system. The moment you start treating it as "something we update quarterly," it becomes a historical record, not a source of truth.

The teams who successfully use NetBox as an SoT don't have better discipline. They have better tooling that makes verification effortless. The gap isn't in the data model — it's in the feedback loop between what's recorded and what's real.]]></content:encoded>
      <category>NetBox</category>
      <category>SoT</category>
      <category>Automation</category>
    </item>
    <item>
      <title>Why Your Terminal Is Your Biggest Automation Bottleneck (And Nobody Realizes It)</title>
      <link>https://www.netstacks.net/blog/why-your-terminal-is-your-biggest-bottleneck</link>
      <description>Teams invest in Ansible, Python, NetBox, Windmill. But the terminal — where 80% of network engineering actually happens — is still PuTTY from 2019. Here&apos;s why that matters more than you think.</description>
      <pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate>
      <guid>https://www.netstacks.net/blog/why-your-terminal-is-your-biggest-bottleneck</guid>
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[Network teams have invested millions in automation infrastructure. Ansible Tower, Python libraries, NetBox instances, Windmill platforms, CI/CD pipelines for configs. And yet, the moment something breaks, everyone opens a terminal and starts typing show commands.

The terminal is where 80% of daily network engineering happens. It's where outages are investigated, changes are verified, and knowledge is captured. And for most teams, the terminal experience hasn't meaningfully changed since PuTTY in 2005.

**The Terminal Gap Nobody Measures**

Think about your current terminal workflow. How many SSH sessions are open right now? Where does the output go when you need to reference it later? How do you organize sessions across projects, sites, and device types? When you need to run the same command across 20 devices, what's the process?

Most engineers use a combination of PuTTY, SecureCRT, or Terminal.app with saved session files scattered across folders. Output is captured in local log files (if capture is even enabled), which are never searched, never correlated, and almost never shared with the team. When someone asks "what did you see on that device?", you're scrolling through a local log file and hoping you captured enough.

This isn't a minor inconvenience. It's a structural bottleneck that affects every aspect of network operations:

- Knowledge capture happens in ad-hoc text files that nobody else can access
- Multi-device operations require manual session switching and copy-paste
- Outage investigation involves juggling 5-10 terminal windows with no correlation
- Change verification is manual — you run the command and visually check the output
- No structured link between what you see in the terminal and what's in your automation platform

**The Credential Problem Hidden in Plain Sight**

Here's a security issue most teams tolerate: engineers store device passwords locally. In PuTTY saved sessions, in SecureCRT connection files, in shell profiles, in sticky notes. Every laptop is a credential leak waiting to happen. When an engineer leaves, you're hoping they deleted their saved passwords.

The terminal should never hold credentials. The platform should. The terminal should authenticate through the platform, and the credentials should never leave the server-side vault. This isn't theoretical — it's how every other cloud service works. SSH sessions are the exception, and only because the tooling has never provided a better alternative.

**How NetStacks Addresses This**

NetStacks approaches the terminal as an engineering platform, not a connection utility. The differences are practical, not cosmetic:

Sessions are organized, not scattered. Folders, tags, and search across your entire device library. Every session is tracked, tagged, and searchable. When you need to find what happened on dc1-spine-01 last Tuesday, it's one search away — not a grep through three months of log files.

Multi-send eliminates the session juggling. Run the same command across a device group simultaneously. Results come back in real-time, side-by-side. No copy-paste, no session switching, no "let me check each one."

The workspace is integrated. File explorer, code editor with Python/YANG/XML support, integrated Git, and terminal sessions — all in one window. When you're debugging a Jinja2 template for config generation, you're not switching between a text editor, a file browser, and three SSH sessions. They're all there.

Credentials never touch your laptop. The Controller owns all credentials and proxies every SSH connection. No saved passwords, no credential files, no "I'll just store it in my config for now." This is the security model every platform should have.

AI assistance is built into the terminal, not bolted on as a separate tool. When CLI output is confusing, you select it and get an explanation in context. Not a chatbot in another tab — the assistant reads the output you've already gathered and tells you what matters about it.

**The Real Impact**

This isn't about making the terminal prettier. It's about closing the gap between the tools engineers use for investigation (terminals) and the tools teams use for automation (platforms). When these are disconnected, you get the pattern every team knows: build automation in the platform, debug in the terminal, update the platform based on what you found, and hope someone documents it.

NetStacks puts the investigation and automation tools in the same workflow. Terminal sessions, file management, config editing, AI assistance, topology visualization — all in one application. The gap between "I found the problem" and "I fixed the automation" shrinks from hours to minutes.

The question isn't whether your team needs a better terminal. It's how much time your team loses every week because the terminal experience hasn't evolved.]]></content:encoded>
      <category>Terminal</category>
      <category>Engineering</category>
      <category>Workflow</category>
    </item>
    <item>
      <title>The SSH CA Pattern Every Network Team Needs But Nobody Implements</title>
      <link>https://www.netstacks.net/blog/the-ssh-ca-pattern-every-team-needs</link>
      <description>Individual SSH keys are manageable. Individual SSH keys across 200 engineers, 5000 devices, and 17 automation platforms are not. SSH certificates aren&apos;t a nice-to-have anymore — they&apos;re the only way to scale access control.</description>
      <pubDate>Tue, 12 May 2026 00:00:00 GMT</pubDate>
      <guid>https://www.netstacks.net/blog/the-ssh-ca-pattern-every-team-needs</guid>
      <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[If your network team manages device access by distributing SSH keys, you're already in technical debt. Not "legacy system that works fine" debt. "Every departing engineer could have taken a copy of every device key they ever received" debt.

SSH key management at scale breaks down into predictable stages that every team goes through:

Stage 1: "Just add my public key to authorized_keys on the jumpbox." This works for three engineers across ten devices. Then you get to eleven.

Stage 2: "We need a shared key management system." Someone sets up a script that pushes keys from a central list. It works until someone leaves and you need to rotate all their keys across every device. That's a weekend project.

Stage 3: "We need certificates." Someone reads about SSH CAs and it makes perfect sense. Sign once, trust everywhere. Expire automatically. Principals and extensions for fine-grained control. It's the right solution. Nobody implements it because the infrastructure is complex and the migration from key-based access is painful.

Stage 4: "We have SSH CAs now." This is what happens when the platform handles certificate issuance, signing, and distribution transparently. Engineers connect through the platform, get short-lived certificates, and never manage keys directly. This is where most teams want to be and very few actually are.

**The SSH CA Model That Actually Works**

The certificate-based approach should look like this from the engineer's perspective: they open their terminal, select a device, and connect. Behind the scenes, the platform requests a short-lived SSH certificate from an internal CA, the CA signs a certificate scoped to the engineer's identity and the target device, and the SSH connection uses that certificate. No key management, no authorized_keys files, no rotation projects.

The certificate should be short-lived. Hours, not months. If an engineer's access is revoked, their existing certificates expire naturally — no forced key rotation across every device. The CA should enforce principal restrictions. An engineer authorized only for the spine layer shouldn't be able to certificate-scoped access to core routers.

**What NetStacks Does Here**

NetStacks implements the SSH CA pattern at the platform level. The Controller acts as the certificate authority, issuing short-lived certificates scoped to the engineer's identity, role, and authorized device scope. Credentials never leave the server. Engineers authenticate to the Controller, which issues the certificate and proxies the connection.

This means:

- No SSH keys on engineer laptops. Zero. The Controller handles authentication and certificate issuance. Every connection is authenticated, authorized, and audited through the platform. RBAC determines who can connect to which devices, when, and with what privileges. Audit logging captures every session — not just the connection event, but the full session transcript.

When an engineer leaves, you revoke their platform access. Their certificates expire. There are no device-side keys to rotate.

**The Multi-Vendor Complication**

Here's where the SSH CA gets complicated: every vendor implements SSH differently. IOS-XR has different SSH server behaviors than Juniper, which differs from Arista's EOS, which differs from Linux-based NOSes. Certificate validation, key exchange algorithms, and cipher preferences all vary.

The SSH CA needs to understand these differences. A certificate that works on IOS-XR might fail on Juniper because of different key algorithm expectations. The CA needs to issue certificates that each platform's SSH server will accept. This is invisible to the engineer but critical for the platform.

**The Team That Should Adopt This**

Any team with more than 10 engineers and more than 50 devices should be on certificate-based SSH access. If you're still using key-based authentication, you're one departing engineer away from a key rotation project that nobody wants to do.

The migration path: keep the SSH CA alongside existing key-based access during the transition. Engineers authenticate through the platform, the CA issues certificates, and the old keys remain as a fallback. Once everyone is using the platform, disable key-based access entirely. Audit the transition. Confirm no device still has engineer keys in authorized_keys.

This is the security model that scales. Not because it's complex, but because it eliminates the thing that makes SSH access unmanageable at scale: the exponential growth of key-device relationships that no one can track or audit.]]></content:encoded>
      <category>Security</category>
      <category>SSH</category>
      <category>Access Control</category>
    </item>
  </channel>
</rss>