Validation
This page describes how to use the helm chart to validate if your cluster is in a good state for kdb Insights Enterprise to be installed onto.
Prerequisites
You must have credentials to the appropriate Helm and Docker registries for charts and images. The install prompts you for repository locations and access details. Default registries:
- Helm: https://portal.dl.kx.com/assets/helm/
- Docker: portal.dl.kx.com
Install
Before using helm to install the chart, you need to create an image pull secret named kxi-image-pull-secret
using the following command:
kubectl create secret docker-registry kxi-image-pull-secret \
--docker-username=$USER \
--docker-password=$BEARER \
--docker-server=portal.dl.kx.com
Once the secret has been deployed, you can use helm to install the helm chart. To do this, firstly add the helm repo:
helm repo add --username $USER --password $BEARER \
kxi-portal-repo https://portal.dl.kx.com/assets/helm
Update your Helm repository local chart cache:
helm repo update
You should now be able to deploy the helm chart and passing through your vaues file using:
helm install kxi-cluster-validation kxi-portal-repo/kxi-cluster-validation
This creates a job which carries out certain checks, which you can review once finished. The output gives a clearer indication if the cluster is ready for a deployment of kdb Insights Enterprise. To view the output, find the job and then check the logs:
kubectl logs job/kxi-cluster-validation
Node checks
The details below outline what node checks are carried out.
1. CPU Check
What it does: Retrieves and verifies each worker node’s total and allocatable CPU.
Why it matters: Ensures nodes have properly defined CPU resources.
2. Memory Check
What it does: Retrieves each worker node’s total and allocatable memory, converts it from KiB to GiB, and verifies if it meets the required threshold.
Why it matters: Ensures nodes have sufficient memory for workloads.
Rook-Ceph Checks
The details below outline what rook-ceph checks are carried out.
1. CephFS MDS Lockup Check
What it does: Checks each node’s OS image, kubelet version, and kernel version to detect potential CephFS metadata server (MDS) lockup issues.
Why it matters: Ensures node compatibility with CephFS to prevent file system failures that could impact workloads.
2. Rook-Ceph Version Check
What it does: Retrieves the installed Rook-Ceph version from Helm.
Why it matters: Ensures the cluster is running a supported and stable version of Rook-Ceph.
3. Rook-Ceph Toolbox Deployment
What it does: Checks if the Rook-Ceph toolbox is deployed and installs it if missing.
Why it matters: Provides a debugging environment for troubleshooting Ceph issues.
4. Ceph Cluster Health Check
What it does: Retrieves the Ceph cluster's health status (HEALTH_OK
, WARNING
, or ERROR
).
Why it matters: Identifies storage system issues that could affect data availability and cluster stability.
5. Ceph Storage Availability Check
What it does: Checks available storage in Ceph and ensures it exceeds 10GB.
Why it matters: Prevents storage shortages that could disrupt workloads and cluster operations.
Helm Component Version Checks
The details below outline what helm checks are carried out.
-
Checks if critical Helm components are installed:
- cert-manager, - rook-ceph, - istio, - ingress-nginx, - prometheus
-
Compares their installed versions against the required minimum versions (these can be found here
- Flags missing components as ERROR and outdated versions as WARNING.
Further Kubernetes Information Checks
The details below outline additional checks are carried out.
1. Pods Not in Running State
- Lists pods that are not
Running
orSucceeded
, along with their failure reasons. - Helps identify stuck or failing workloads.
2. Pods with High Restart Counts
- Identifies pods with more than 5 restarts.
- Useful for catching crash loops and unstable applications.
3. Persistent Volume Claims (PVCs) Pending
- Lists any PVCs stuck in
Pending
status, indicating potential storage issues.
4. Persistent Volumes (PVs) in Failed or Released State
- Detects PVs in
Failed
orReleased
states, which may need manual intervention.
5. Node Readiness Check
- Ensures all nodes are
Ready
. - Lists any nodes that are not, helping diagnose potential cluster issues.
6. Problematic Events
- Fetches Kubernetes events indicating failures like OOMKilled, CrashLoopBackOff, Evictions, etc.
- Helps catch systemic issues affecting workloads.
Why it matters
- Provides a quick cluster health snapshot, allowing DevOps engineers to proactively address potential problems.
- Helps reduce downtime by identifying critical issues early.