Skip to main content

Netdata Cloud On-Prem Troubleshooting

Your Netdata Cloud On-Prem deployment relies on several infrastructure components that need proper resources and configuration:

Core Components:

  • Databases: PostgreSQL, Redis, Elasticsearch
  • Message Brokers: Pulsar, EMQX
  • Traffic Controllers: Ingress, Traefik
  • Kubernetes Cluster

Monitor and manage these components according to your organization's established practices and requirements.

Common Issues and Solutions

Installation Timeout Error

If your installation fails with this error:

Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error: Context deadline exceeded.

This typically means your cluster doesn't have enough resources. Here's how to diagnose and fix it:

Diagnose the Problem

note

Before you start:

  • Full installation: Ensure you're in the correct cluster context
  • Light PoC: SSH into your Ubuntu VM with kubectl pre-configured

Always perform complete uninstallation before trying a new installation. There is a safety check in the installation script that will prevent you from attempting installation twice.

./provision.sh uninstall
Troubleshooting individual pods in Kuberentes

Step 1: Check for stuck pods
kubectl get pods -n netdata-cloud
tip

Look for pods that are not in Running state or they are running properly - i.e. 0/1 Running

Step 2: Examine resource constraints
If you found Pending pods, check what's blocking them:

kubectl describe pod <POD_NAME> -n netdata-cloud

Look at the Events section for messages about:

  • Insufficient CPU
  • Insufficient Memory
  • Node capacity issues
Step 3: View your cluster resources
Check your overall cluster capacity:

# Check resource allocation across nodes
kubectl top nodes

# View detailed node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

Fix the Issue

Compare your available resources against the minimum requirements. Then add more resources to your cluster or free up existing resources by removing unnecessary workloads.

Login Problems After Installation

Your installation might complete successfully, but you can't log in. This usually happens due to configuration mismatches.

ProblemWhat You'll SeeWhy It HappensHow to Fix It
SSO Login FailureCan't authenticate via SSO providersInvalid callback URLs, expired SSO tokens, untrusted certificates, incorrect FQDN in global.publicUpdate SSO configuration in your values.yaml, verify certificates are valid and trusted, ensure FQDN matches your certificate
MailCatcher Login (Light PoC)Magic links don't arrive, "Invalid token" errorsIncorrect hostname during installation, modified default MailCatcher valuesReinstall with correct FQDN, restore default MailCatcher settings, ensure hostname matches certificate
Custom Mail Server LoginMagic links don't arriveIncorrect SMTP configuration, network connectivity issuesUpdate SMTP settings in your values.yaml, verify network allows SMTP traffic, check your mail server logs
Invalid Token Error"Something went wrong - invalid token" messageMismatched netdata-cloud-common secret, database hash mismatch, namespace change without secret migrationMigrate secret before namespace change, perform fresh installation, or contact support for data recovery
warning

Important for Namespace Changes

If you're changing your installation namespace, the netdata-cloud-common secret will be recreated.

Before you proceed

Back up your existing netdata-cloud-common secret, or wipe your PostgreSQL database to prevent data conflicts.

Slow or Failing Charts

When your charts take forever to load or fail with errors, the problem usually comes from data collection challenges. Your charts service needs to gather data from multiple Agents in a Room, requiring successful responses from all queried Agents.

ProblemWhat You'll SeeWhy It HappensHow to Fix It
Agent ConnectivityQueries stall or timeout, inconsistent chart loadingSlow Agents or unreliable network connections prevent timely data collectionDeploy additional Parent nodes to provide reliable backends. Your system will automatically prefer these for queries when available
Kubernetes ResourcesService throttling, slow data processing, delayed dashboard updatesResource saturation at your node level or restrictive container limitsReview and adjust your container resource limits and node capacity as needed
Database PerformanceSlow query responses, increased latency across servicesPostgreSQL performance bottlenecks in your setupMonitor and optimize your database resource utilization: CPU usage, memory allocation, disk I/O performance
Message Broker IssuesDelayed node status updates (online/offline/stale), slow alert transitions, dashboard update delaysMessage accumulation in Pulsar due to processing bottlenecksReview your Pulsar configuration, adjust microservice resource allocation, monitor message processing rates

Get Help

If problems persist after trying these solutions:

  1. Gather the following information:

    • Your installation logs
    • Your cluster specifications
    • Description of the specific issue you're experiencing
  2. Contact our support team at support@netdata.cloud

We're here to help you get your monitoring running smoothly.


Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.