Netdata Cloud On-Prem Troubleshooting
Your Netdata Cloud On-Prem deployment relies on several infrastructure components that need proper resources and configuration:
Core Components:
- Databases: PostgreSQL, Redis, Elasticsearch
- Message Brokers: Pulsar, EMQX
- Traffic Controllers: Ingress, Traefik
- Kubernetes Cluster
Monitor and manage these components according to your organization's established practices and requirements.
Common Issues and Solutions
Installation Timeout Error
If your installation fails with this error:
Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error: Context deadline exceeded.
This typically means your cluster doesn't have enough resources. Here's how to diagnose and fix it:
Diagnose the Problem
Before you start:
- Full installation: Ensure you're in the correct cluster context
- Light PoC: SSH into your Ubuntu VM with
kubectl
pre-configured
Always perform complete uninstallation before trying a new installation. There is a safety check in the installation script that will prevent you from attempting installation twice.
./provision.sh uninstall
Troubleshooting individual pods in Kuberentes
Step 1: Check for stuck pods
kubectl get pods -n netdata-cloud
Look for pods that are not in Running
state or they are running properly - i.e. 0/1 Running
Step 2: Examine resource constraints
If you found Pending
pods, check what's blocking them:
kubectl describe pod <POD_NAME> -n netdata-cloud
Look at the Events section for messages about:
- Insufficient CPU
- Insufficient Memory
- Node capacity issues
Check your overall cluster capacity:
# Check resource allocation across nodes
kubectl top nodes
# View detailed node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
Fix the Issue
Compare your available resources against the minimum requirements. Then add more resources to your cluster or free up existing resources by removing unnecessary workloads.
Login Problems After Installation
Your installation might complete successfully, but you can't log in. This usually happens due to configuration mismatches.
Problem | What You'll See | Why It Happens | How to Fix It |
---|---|---|---|
SSO Login Failure | Can't authenticate via SSO providers | Invalid callback URLs, expired SSO tokens, untrusted certificates, incorrect FQDN in global.public | Update SSO configuration in your values.yaml , verify certificates are valid and trusted, ensure FQDN matches your certificate |
MailCatcher Login (Light PoC) | Magic links don't arrive, "Invalid token" errors | Incorrect hostname during installation, modified default MailCatcher values | Reinstall with correct FQDN, restore default MailCatcher settings, ensure hostname matches certificate |
Custom Mail Server Login | Magic links don't arrive | Incorrect SMTP configuration, network connectivity issues | Update SMTP settings in your values.yaml , verify network allows SMTP traffic, check your mail server logs |
Invalid Token Error | "Something went wrong - invalid token" message | Mismatched netdata-cloud-common secret, database hash mismatch, namespace change without secret migration | Migrate secret before namespace change, perform fresh installation, or contact support for data recovery |
Important for Namespace Changes
If you're changing your installation namespace, the netdata-cloud-common
secret will be recreated.
Before you proceed
Back up your existing netdata-cloud-common
secret, or wipe your PostgreSQL database to prevent data conflicts.
Slow or Failing Charts
When your charts take forever to load or fail with errors, the problem usually comes from data collection challenges. Your charts
service needs to gather data from multiple Agents in a Room, requiring successful responses from all queried Agents.
Problem | What You'll See | Why It Happens | How to Fix It |
---|---|---|---|
Agent Connectivity | Queries stall or timeout, inconsistent chart loading | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional Parent nodes to provide reliable backends. Your system will automatically prefer these for queries when available |
Kubernetes Resources | Service throttling, slow data processing, delayed dashboard updates | Resource saturation at your node level or restrictive container limits | Review and adjust your container resource limits and node capacity as needed |
Database Performance | Slow query responses, increased latency across services | PostgreSQL performance bottlenecks in your setup | Monitor and optimize your database resource utilization: CPU usage, memory allocation, disk I/O performance |
Message Broker Issues | Delayed node status updates (online/offline/stale), slow alert transitions, dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks | Review your Pulsar configuration, adjust microservice resource allocation, monitor message processing rates |
Get Help
If problems persist after trying these solutions:
-
Gather the following information:
- Your installation logs
- Your cluster specifications
- Description of the specific issue you're experiencing
-
Contact our support team at
support@netdata.cloud
We're here to help you get your monitoring running smoothly.
Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.