LLM Configuration
ProfessionalSet up AI providers (OpenAI, Anthropic, Ollama), configure BYOK models, manage token budgets, and deploy self-hosted LLMs.
Overview
LLM Configuration controls how NetStacks connects to AI language model providers. It covers setting up providers, managing API keys securely, selecting models for different AI features, configuring token budgets to control costs, and deploying self-hosted models for maximum privacy.
Supported Providers
| Provider | Models | API Key Required | Notes |
|---|---|---|---|
| OpenAI | GPT-4o, GPT-4, GPT-4o-mini | Yes | Widely available, fast inference |
| Anthropic | Claude Sonnet, Claude Opus | Yes | Strong reasoning, recommended for agents |
| Ollama | Llama, Mistral, Qwen, and others | No | Self-hosted, data never leaves your network |
| OpenRouter | Access to many models via unified API | Yes | OpenAI-compatible API, model marketplace |
| Custom | Any OpenAI-compatible endpoint | Varies | Azure OpenAI, vLLM, text-generation-inference |
NetStacks uses a BYOK (Bring Your Own Key) model. You provide your own API keys for cloud providers, giving you full control over costs, rate limits, and data handling. API keys are stored encrypted in the credential vault and are never exposed to Terminal clients.
How It Works
Provider Abstraction Layer
NetStacks provides a unified interface across all LLM providers. AI features (Chat, Suggestions, Agents, Knowledge Base embeddings) interact with a single API, and the provider abstraction layer routes requests to the configured provider. This means you can switch providers without changing any other configuration.
Provider Configuration
Each provider configuration includes:
- Provider Type — The provider platform (OpenAI, Anthropic, Ollama, OpenRouter, or Custom).
- Display Name — A human-readable name for identification (e.g., “OpenAI Production”, “Local Ollama”).
- API Key — Stored as an encrypted credential reference. The actual key is never exposed in API responses. Ollama without authentication does not require a key.
- Base URL — Custom API endpoint for Ollama, Azure OpenAI, or other self-hosted deployments.
- Default Model — The model used when no specific model is requested (e.g.,
gpt-4o,claude-sonnet-4-20250514,llama3.3). - Priority — Failover ordering. Lower priority numbers are tried first. If the primary provider is unavailable, requests automatically route to the next enabled provider.
- Model Parameters — Max input tokens, max output tokens, temperature (0.0–2.0), and top_p for nucleus sampling.
Token Budget System
The token budget system helps control AI costs by setting monthly limits on token consumption:
- Soft Limit — A warning threshold. When usage exceeds the soft limit, administrators receive a notification but AI features continue to operate.
- Hard Limit — A strict cap. When usage reaches the hard limit, AI features are disabled until the next billing period or the limit is raised.
- Warning Threshold — A percentage of the hard limit (default 80%) at which a warning notification is sent.
- Usage Tracking — Every LLM request logs input tokens, output tokens, estimated cost, provider, model, feature (chat, suggestions, agents), and user. Usage can be viewed as time-series charts, broken down by feature, user, or provider.
- Projection — The system projects month-end usage based on current consumption rate, helping administrators anticipate budget overruns before they happen.
Model Routing
Different AI features have different requirements. You can assign different models to different features to optimize cost and performance:
| Feature | Recommended Model | Reason |
|---|---|---|
| Command Suggestions | GPT-4o-mini / Llama 3.3 | Speed matters more than reasoning depth |
| AI Chat | GPT-4o / Claude Sonnet | Balance of speed and accuracy |
| NOC Agents | Claude Sonnet / GPT-4o | Complex reasoning and multi-step investigation |
| Knowledge Base Embeddings | text-embedding-3-small | Optimized for vector search, low cost |
Configuring LLM Providers
Follow these steps to configure LLM providers for your NetStacks deployment.
Terminal (Standalone Mode)
- Open Settings > AI Assistant — Access the AI configuration from the Terminal's settings panel.
- Select provider — Choose from OpenAI, Anthropic, Ollama, OpenRouter, or Custom.
- Enter API key — Paste your API key. For Ollama, enter the local endpoint URL instead (e.g.,
http://localhost:11434). - Choose model — Select the default model for AI features.
- Test connection — Click “Test Connection” to verify the provider is reachable and the API key is valid.
- Save — Save the configuration. AI features are now active.
Controller (Enterprise Mode)
- Navigate to Admin > AI Configuration — Open the AI management page in the Controller admin UI.
- Add LLM provider — Click “Add Provider” and select the provider type.
- Configure connection — Enter the API key (stored encrypted in the credential vault), set the base URL if using a custom endpoint, and select the default model.
- Set priority — Assign a priority number for failover ordering (lower = higher priority). Configure multiple providers for automatic failover.
- Test connection — Verify the provider is reachable and the API key is valid.
- Configure token budgets — Set monthly soft and hard limits for token consumption. Configure the warning threshold percentage.
- Assign models to features — Optionally assign different models to different AI features (chat, suggestions, agents, embeddings) to optimize cost and performance.
Self-Hosted with Ollama
- Install Ollama — Follow the Ollama installation guide for your platform.
- Download a model — Run
ollama pull llama3.3to download a model. - Configure in NetStacks — Add Ollama as a provider with the base URL set to your Ollama instance (e.g.,
http://localhost:11434orhttp://ollama.internal:11434). - Set default model — Enter the model name as shown by
ollama list(e.g.,llama3.3,mistral).
In Enterprise mode, API keys are stored encrypted in the Controller's credential vault and are never sent to Terminal clients. All AI requests are proxied through the Controller, ensuring API keys remain secure and centralized.
Code Examples
OpenAI Provider Configuration
{
"provider_type": "openai",
"name": "OpenAI Production",
"enabled": true,
"priority": 1,
"api_key_credential_id": "cred-a1b2c3d4-...",
"default_model": "gpt-4o",
"max_input_tokens": 128000,
"max_output_tokens": 4096,
"temperature": 0.7,
"top_p": 0.9
}Anthropic Provider Configuration
{
"provider_type": "anthropic",
"name": "Anthropic - Agent Reasoning",
"enabled": true,
"priority": 2,
"api_key_credential_id": "cred-e5f6g7h8-...",
"default_model": "claude-sonnet-4-20250514",
"max_input_tokens": 200000,
"max_output_tokens": 8192,
"temperature": 0.3
}Ollama Self-Hosted Configuration
{
"provider_type": "ollama",
"name": "Local Ollama - Privacy Mode",
"enabled": true,
"priority": 3,
"base_url": "http://ollama.internal:11434",
"default_model": "llama3.3",
"max_output_tokens": 4096,
"temperature": 0.7
}Token Budget Configuration
{
"monthly_soft_limit_tokens": 5000000,
"monthly_hard_limit_tokens": 10000000,
"warning_threshold_percent": 80
}
// Budget status response
{
"current_usage_tokens": 3250000,
"soft_limit_tokens": 5000000,
"hard_limit_tokens": 10000000,
"usage_percent": 32.5,
"status": "ok",
"projected_month_end_tokens": 7800000
}Usage Analytics by Feature
# Get token usage breakdown by feature for the current month
curl -s https://controller.example.com/api/v1/ai/usage/by-feature \
-H "Authorization: Bearer $TOKEN" \
-G -d "start=2026-03-01T00:00:00Z" \
-d "end=2026-04-01T00:00:00Z" | jq '.'
# Example response
[
{
"feature": "chat",
"total_tokens": 1850000,
"total_cost_cents": 925,
"request_count": 342
},
{
"feature": "agents",
"total_tokens": 980000,
"total_cost_cents": 490,
"request_count": 45
},
{
"feature": "suggestions",
"total_tokens": 320000,
"total_cost_cents": 32,
"request_count": 1205
},
{
"feature": "embeddings",
"total_tokens": 100000,
"total_cost_cents": 1,
"request_count": 89
}
]Multi-Provider Failover Setup
# Configure primary provider (priority 1)
curl -X POST https://controller.example.com/api/v1/ai/providers \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"provider_type": "anthropic",
"name": "Anthropic Primary",
"priority": 1,
"api_key_credential_id": "cred-primary-...",
"default_model": "claude-sonnet-4-20250514"
}'
# Configure fallback provider (priority 2)
curl -X POST https://controller.example.com/api/v1/ai/providers \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"provider_type": "openai",
"name": "OpenAI Fallback",
"priority": 2,
"api_key_credential_id": "cred-fallback-...",
"default_model": "gpt-4o"
}'
# If Anthropic is unavailable, requests automatically route to OpenAIQuestions & Answers
- Which LLM providers does NetStacks support?
- NetStacks supports five provider types: OpenAI (GPT-4o, GPT-4, GPT-4o-mini), Anthropic (Claude Sonnet, Claude Opus), Ollama (Llama, Mistral, Qwen, and any locally hosted model), OpenRouter (access to many models via a unified OpenAI-compatible API), and Custom (any OpenAI-compatible endpoint including Azure OpenAI, vLLM, and text-generation-inference). Multiple providers can be configured simultaneously with priority-based failover.
- What is BYOK (Bring Your Own Key)?
- BYOK means you provide your own API keys for cloud LLM providers. NetStacks does not include any bundled AI credits or proxy your requests through its own infrastructure. You create API keys directly with OpenAI, Anthropic, or other providers and enter them into NetStacks. This gives you full control over costs, rate limits, usage policies, and data handling agreements with your provider.
- Can I use a self-hosted LLM?
- Yes. Configure Ollama as your LLM provider and point it to your locally hosted instance. When using Ollama, your data never leaves your network because all inference happens on your own hardware. You can also use the Custom provider type to connect to any OpenAI-compatible endpoint, including vLLM, text-generation-inference, or Azure OpenAI deployed in your own cloud subscription.
- How do token budgets work?
- Token budgets set monthly limits on LLM token consumption. You configure a soft limit (warning only, AI continues) and a hard limit (AI features disabled when exceeded). A warning threshold (default 80% of hard limit) triggers a notification before you reach the soft limit. The system tracks usage per request with input/output token counts and estimated costs, and projects month-end usage based on current consumption rate.
- Which model should I use for each feature?
- For Command Suggestions, use a fast, smaller model (GPT-4o-mini, Llama 3.3) because speed matters more than reasoning depth. For AI Chat, use a balanced model (GPT-4o, Claude Sonnet) that provides good accuracy with reasonable response times. For NOC Agents, use a powerful reasoning model (Claude Sonnet, GPT-4o) because agents need to perform multi-step analysis. For embeddings, use an embedding-specific model (text-embedding-3-small) optimized for vector search.
- Is my data sent to external AI providers?
- When using cloud providers (OpenAI, Anthropic, OpenRouter), your prompts and terminal context are sent to the provider's API. However, all data passes through the credential sanitization pipeline first, which removes passwords, API keys, SNMP communities, and other sensitive information. For maximum privacy, use Ollama with a locally hosted model so no data leaves your network. In Enterprise mode, the Controller proxies all AI requests, ensuring Terminal clients never communicate directly with external providers.
Troubleshooting
| Issue | Possible Cause | Solution |
|---|---|---|
| Provider connection failing | Invalid API key, wrong endpoint, or firewall blocking | Use the “Test Connection” button to diagnose. Verify the API key has not expired or been revoked. For custom endpoints, confirm the base URL is correct and accessible from the Controller. Check firewall rules for outbound HTTPS to the provider's API endpoint. |
| Token budget exceeded | Monthly usage hit the hard limit | Check the budget status page for current usage and projected month-end. Increase the hard limit if budget allows, or wait for the next month to reset. Review usage by feature and user to identify the highest consumers. Consider using a cheaper model for high-volume features like suggestions. |
| Ollama not responding | Service not running or model not loaded | Verify the Ollama service is running: ollama list should show available models. If the model is not listed, run ollama pull <model>. Check that the base URL in NetStacks matches the Ollama listen address. On first request, Ollama loads the model into memory which can take 30–60 seconds. |
| Slow AI responses | Large model, insufficient hardware, or network latency | For cloud providers, check the provider's status page for outages. For Ollama, ensure the host has sufficient RAM and GPU for the model size. Consider using a smaller model for latency-sensitive features. Review max_output_tokens setting — lower limits produce faster responses. |
| Failover not working | Backup provider not enabled or priority misconfigured | Verify that backup providers are enabled and have valid API keys. Check priority ordering — lower numbers are tried first. Test each provider individually to confirm they work. Failover triggers on connection errors and timeouts, not on content-level errors. |
Related Features
- AI Chat — Interactive AI assistance that uses the configured LLM providers for conversational network operations support.
- NOC Agents — Autonomous agents that use configured LLM providers for ReAct reasoning during network event investigation.
- Knowledge Base — Uses embedding models from configured providers to generate vector embeddings for document search.
- Command Suggestions — Context-aware command autocomplete powered by the configured LLM provider and model.
- System Settings — Global system configuration including credential vault settings where API keys are stored encrypted.