Web Sources

Enterprise

Automatically crawl and ingest web-based documentation into the Knowledge Base for vector-searchable AI-powered retrieval.

Overview

Web Sources automatically crawl and ingest web-based documentation into the NetStacks Knowledge Base. Instead of manually uploading files or pasting URLs one at a time, you can point a web source at a documentation site and NetStacks will crawl it using breadth-first search (BFS), extract content, and index it for vector search. This is ideal for ingesting vendor documentation (Cisco, Juniper, Arista docs), internal wikis, and any web-accessible knowledge that your team references during troubleshooting.

Web Source Capabilities

Capability	Description
Configurable Crawl Depth	Control how many link levels deep the crawler follows from the starting URL
Maximum Page Limits	Set an upper bound on the number of pages crawled to avoid runaway ingestion
URL Normalization	Trailing slashes, query parameters, and fragments are normalized to prevent indexing the same page twice
Content Extraction	Navigation, headers, footers, and boilerplate are automatically stripped, leaving only the main content body
Crash Recovery	If the crawler is interrupted (server restart, OOM), it resumes from where it left off on next startup
Scheduled Re-crawls	Web sources can be configured to re-crawl periodically, checking content hashes to only re-process changed pages
Tag-Based Organization	Apply tags to web sources for filtering search results by vendor, technology, or topic

Enterprise (Controller) feature

Web Sources run inside the NetStacks Controller (Enterprise). They extend the Knowledge Base with automated web crawling, complementing file uploads and individual URL sources. Personal Edition does not include the Knowledge Base.

How It Works

Crawl Pipeline

When you create a web source, the crawler executes a multi-stage pipeline to discover, extract, and index content:

Source Configuration — You provide a starting URL, maximum crawl depth, maximum page count, and optional tags for organizing the ingested content.
BFS Crawl Loop — The crawler discovers pages starting from the root URL, following internal links up to the configured depth. Pages are visited in breadth-first order to prioritize content closest to the root.
URL Normalization — URLs are normalized before processing. Trailing slashes, query parameters, and fragment identifiers are standardized to prevent indexing the same page under different URL variations.
Content Extraction — Raw HTML is parsed and navigation elements, sidebars, headers, footers, and other boilerplate are stripped, leaving the main content body for indexing.
Ingestion Pipeline — Extracted content flows into the standard Knowledge Base pipeline: chunking → embedding → pgvector storage. The same chunking strategies and embedding models used by other Knowledge Base sources apply.
Crash Recovery — If the crawler is interrupted by a server restart, out-of-memory condition, or other failure, it persists its crawl frontier and resumes from where it left off on next startup. No pages are re-crawled unnecessarily.
Re-crawl Schedule — Web sources can be configured to re-crawl periodically. During a re-crawl, the system checks content hashes for each page and only re-processes pages whose content has changed since the last crawl.

Crawl States

A web source transitions through the following states during its lifecycle:

pending → crawling → indexed (or error if the crawl fails). Progress is tracked with pages_crawled / max_pages counts, giving you real-time visibility into how far along the crawl has progressed.

Crawl Depth Guidance

Start with a depth of 2–3 for most documentation sites. Well-structured documentation sites typically organize content within 2–3 levels of their root. Higher depths can rapidly expand the number of pages discovered, potentially crawling into irrelevant areas of the site.

Step-by-Step Guide

Workflow 1: Add a Web Source

Navigate to AI > Knowledge Base > Web Sources tab — Open the Knowledge Base management page and select the Web Sources tab.
Click Add Web Source — Opens the web source configuration form.
Enter the starting URL — Provide the root URL of the documentation site you want to crawl (e.g., https://docs.arista.com/eos/latest/).
Configure crawl parameters — Set the maximum crawl depth (e.g., 3) and maximum page count (e.g., 500).
Add tags — Apply tags to organize the content for filtered search (e.g., "arista", "eos", "switching").
Click Start Crawl — The crawler begins processing the site in the background. You can navigate away and check progress later.

Workflow 2: Monitor Crawl Progress

View the Web Sources list — Each web source displays its current state (pending, crawling, indexed, or error), pages crawled count, and any errors encountered.
Check progress — The pages crawled counter updates as the crawler discovers and processes new pages. Compare against the max pages limit to gauge completion.
View detailed crawl log — Click on a web source to see a detailed log of crawled URLs, extraction results, and any errors for individual pages.

Workflow 3: Re-crawl an Existing Source

Select a web source — From the Web Sources list, click on the source you want to update.
Click Re-crawl — The crawler revisits all previously discovered URLs and checks for content changes using content hash comparison.
Review results — Only pages with changed content are re-processed through the ingestion pipeline. Unchanged pages retain their existing embeddings.

Large Sites

Be cautious with high depth and page limits on large sites. A depth of 5 on a site like docs.cisco.com could discover thousands of pages. Start with conservative settings and increase as needed.

Code Examples

Create a Web Source

Add a new web source to the Knowledge Base and start crawling:

create-web-source.shbash

# Create a web source for Arista EOS documentation
curl -X POST https://controller.example.com/api/v1/knowledge/web-sources \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.arista.com/eos/latest/",
    "max_depth": 3,
    "max_pages": 500,
    "tags": ["arista", "eos", "switching"]
  }'

# Example response
{
  "id": "ws-a1b2c3d4",
  "url": "https://docs.arista.com/eos/latest/",
  "max_depth": 3,
  "max_pages": 500,
  "tags": ["arista", "eos", "switching"],
  "state": "pending",
  "pages_crawled": 0,
  "created_at": "2026-03-25T10:30:00Z"
}

Check Crawl Status

Query the status of an active or completed crawl:

check-crawl-status.shbash

# Check the status of a web source crawl
curl -s https://controller.example.com/api/v1/knowledge/web-sources/ws-a1b2c3d4 \
  -H "Authorization: Bearer $TOKEN" | jq '.'

# Example response (crawl in progress)
{
  "id": "ws-a1b2c3d4",
  "url": "https://docs.arista.com/eos/latest/",
  "max_depth": 3,
  "max_pages": 500,
  "tags": ["arista", "eos", "switching"],
  "state": "crawling",
  "pages_crawled": 127,
  "errors": [],
  "started_at": "2026-03-25T10:30:05Z",
  "updated_at": "2026-03-25T10:42:18Z"
}

List All Web Sources

Retrieve all configured web sources and their current status:

list-web-sources.shbash

# List all web sources
curl -s https://controller.example.com/api/v1/knowledge/web-sources \
  -H "Authorization: Bearer $TOKEN" | jq '.'

# Example response
[
  {
    "id": "ws-a1b2c3d4",
    "url": "https://docs.arista.com/eos/latest/",
    "state": "indexed",
    "pages_crawled": 487,
    "max_pages": 500,
    "tags": ["arista", "eos", "switching"]
  },
  {
    "id": "ws-e5f6g7h8",
    "url": "https://www.juniper.net/documentation/junos/",
    "state": "crawling",
    "pages_crawled": 203,
    "max_pages": 1000,
    "tags": ["juniper", "junos", "routing"]
  }
]

Trigger a Re-crawl

Re-crawl an existing web source to pick up content changes:

trigger-recrawl.shbash

# Trigger a re-crawl of an existing web source
curl -X POST https://controller.example.com/api/v1/knowledge/web-sources/ws-a1b2c3d4/crawl \
  -H "Authorization: Bearer $TOKEN"

# The web source state transitions back to "crawling"
# Only pages with changed content hashes are re-processed

Delete a Web Source

Remove a web source and all of its indexed content from the Knowledge Base:

delete-web-source.shbash

# Delete a web source and its indexed content
curl -X DELETE https://controller.example.com/api/v1/knowledge/web-sources/ws-a1b2c3d4 \
  -H "Authorization: Bearer $TOKEN"

# Returns 204 No Content on success
# All associated document chunks and embeddings are removed

Questions & Answers

What sites can be crawled?: Any publicly accessible or internally reachable website. The crawler respects robots.txt directives. For authenticated sites that require login, use the file upload or individual URL source types instead, as the crawler does not handle authentication.
How deep should I crawl?: A depth of 2–3 is usually sufficient for well-structured documentation sites. Higher depths may crawl into irrelevant pages such as changelogs, archive pages, or community forums. Start with a conservative depth and increase if you find that important content is missing from search results.
Is there rate limiting?: Yes. The crawler uses configurable delays between requests (default 1 second) to avoid overwhelming target servers. This ensures responsible crawling behavior and reduces the risk of being blocked by the target site's rate limiting or firewall rules.
What about JavaScript-rendered sites?: The crawler fetches raw HTML and does not execute JavaScript. Single-page applications (SPAs) and sites that rely heavily on client-side rendering may not extract content correctly. For these sites, consider exporting the documentation to PDF and using the file upload source type, or adding individual pages via the URL source type.
Can I crawl internal wikis?: Yes, if the Controller can reach the wiki server over the network. Configure network access and firewall rules accordingly. The crawler does not handle authentication, so the wiki must be accessible without login from the Controller's network location.
How are duplicate pages avoided?: The crawler normalizes URLs before processing to eliminate duplicates caused by trailing slashes, query parameters, and fragment identifiers. Additionally, content hashes are computed for each page to detect duplicate content served under different URLs.
What happens if the crawl is interrupted?: The crawler supports crash recovery. If the process is interrupted by a server restart, out-of-memory condition, or other failure, it persists its crawl frontier (the set of discovered but unvisited URLs) and resumes from where it left off on next startup. Pages that were already processed are not re-crawled.

Troubleshooting

Issue	Possible Cause	Solution
Crawl stuck in pending	Controller cannot reach the target URL or crawler process error	Check Controller logs for crawler errors. Verify the URL is reachable from the Controller by testing with `curl` from the Controller host. Check DNS resolution and firewall rules.
Content not extracting properly	JavaScript-rendered site or non-standard HTML structure	Some sites use heavy JavaScript rendering that the crawler cannot process. Try reducing depth or switching to the individual URL source type for specific pages. For SPAs, consider exporting to PDF and using file upload instead.
Too many pages crawled	High crawl depth or large site with many internal links	Reduce `max_depth` or `max_pages`. The crawler follows all internal links within the configured depth, which can expand quickly on large documentation sites. Start with depth 2 and 200 pages, then increase as needed.
Duplicate content in search results	URL normalization edge case or content served under multiple paths	URL normalization should prevent most duplicates. If duplicates persist, the site may serve identical content under different URL paths. Delete the web source and re-crawl, or use tags to filter search results more precisely.
Crawl errors on specific pages	Pages returning non-200 status codes or timeouts	Check the detailed crawl log for the web source to identify which pages are failing. Common causes include authentication walls, rate limiting by the target server, or pages that have been removed. The crawler continues past individual page errors.
Re-crawl re-processing all pages	Content hash mismatch due to dynamic page elements	If the target site includes timestamps, session tokens, or other dynamic content in the page body, content hashes will differ on every crawl. This causes all pages to be re-processed. The content extraction pipeline strips most boilerplate, but some dynamic elements may persist.

Knowledge Base — The parent feature that Web Sources feed into. Manage all document sources, chunking strategies, and search capabilities.
LLM Configuration — Configure the embedding model used for vectorizing crawled content and the LLM providers that power RAG responses.
AI Chat — Uses Knowledge Base content (including web sources) through RAG to provide organization-specific answers grounded in your documentation.