Validation

This page describes how to use the helm chart to validate if your cluster is in a good state for kdb Insights Enterprise to be installed onto.

Prerequisites

You must have credentials to the appropriate Helm and Docker registries for charts and images. The install prompts you for repository locations and access details. Default registries:

Helm: https://portal.dl.kx.com/assets/helm/
Docker: portal.dl.kx.com

Install

Before using helm to install the chart, you need to create an image pull secret named kxi-image-pull-secret using the following command:

kubectl create secret docker-registry kxi-image-pull-secret  \
   --docker-username=$USER \
   --docker-password=$BEARER \
   --docker-server=portal.dl.kx.com

Once the secret has been deployed, you can use helm to install the helm chart. To do this, firstly add the helm repo:

helm repo add --username $USER --password $BEARER \
kxi-portal-repo https://portal.dl.kx.com/assets/helm

Update your Helm repository local chart cache:

helm repo update

You should now be able to deploy the helm chart and passing through your vaues file using:

helm install kxi-cluster-validation kxi-portal-repo/kxi-cluster-validation

This creates a job which carries out certain checks, which you can review once finished. The output gives a clearer indication if the cluster is ready for a deployment of kdb Insights Enterprise. To view the output, find the job and then check the logs:

kubectl logs job/kxi-cluster-validation

Node checks

The details below outline what node checks are carried out.

1. CPU Check

What it does: Retrieves and verifies each worker node’s total and allocatable CPU.

Why it matters: Ensures nodes have properly defined CPU resources.

2. Memory Check

What it does: Retrieves each worker node’s total and allocatable memory, converts it from KiB to GiB, and verifies if it meets the required threshold.

Why it matters: Ensures nodes have sufficient memory for workloads.

Rook-Ceph Checks

The details below outline what rook-ceph checks are carried out.

1. CephFS MDS Lockup Check

What it does: Checks each node’s OS image, kubelet version, and kernel version to detect potential CephFS metadata server (MDS) lockup issues.

Why it matters: Ensures node compatibility with CephFS to prevent file system failures that could impact workloads.

2. Rook-Ceph Version Check

What it does: Retrieves the installed Rook-Ceph version from Helm.

Why it matters: Ensures the cluster is running a supported and stable version of Rook-Ceph.

3. Rook-Ceph Toolbox Deployment

What it does: Checks if the Rook-Ceph toolbox is deployed and installs it if missing.

Why it matters: Provides a debugging environment for troubleshooting Ceph issues.

4. Ceph Cluster Health Check

What it does: Retrieves the Ceph cluster's health status (HEALTH_OK, WARNING, or ERROR).

Why it matters: Identifies storage system issues that could affect data availability and cluster stability.

5. Ceph Storage Availability Check

What it does: Checks available storage in Ceph and ensures it exceeds 10GB.

Why it matters: Prevents storage shortages that could disrupt workloads and cluster operations.

Helm Component Version Checks

The details below outline what helm checks are carried out.

Checks if critical Helm components are installed:

  - cert-manager, 
  - rook-ceph, 
  - istio, 
  - ingress-nginx, 
  - prometheus

Compares their installed versions against the required minimum versions (these can be found here
Flags missing components as ERROR and outdated versions as WARNING.

Further Kubernetes Information Checks

The details below outline additional checks are carried out.

1. Pods Not in Running State

Lists pods that are not Running or Succeeded, along with their failure reasons.
Helps identify stuck or failing workloads.

2. Pods with High Restart Counts

Identifies pods with more than 5 restarts.
Useful for catching crash loops and unstable applications.

3. Persistent Volume Claims (PVCs) Pending

Lists any PVCs stuck in Pending status, indicating potential storage issues.

4. Persistent Volumes (PVs) in Failed or Released State

Detects PVs in Failed or Released states, which may need manual intervention.

5. Node Readiness Check

Ensures all nodes are Ready.
Lists any nodes that are not, helping diagnose potential cluster issues.

6. Problematic Events

Fetches Kubernetes events indicating failures like OOMKilled, CrashLoopBackOff, Evictions, etc.
Helps catch systemic issues affecting workloads.

Why it matters

Provides a quick cluster health snapshot, allowing DevOps engineers to proactively address potential problems.
Helps reduce downtime by identifying critical issues early.