107 lines
3.5 KiB
Markdown
107 lines
3.5 KiB
Markdown
# Severed-Infra: Health & Diagnostics Guide
|
|
|
|
### 1. The Foundation: Node & Storage Stability
|
|
|
|
Before troubleshooting apps, ensure the physical (Docker) layer is stable.
|
|
|
|
* **Node Readiness:** All 3 nodes (1 server, 2 agents) must be `Ready`.
|
|
|
|
```bash
|
|
kubectl get nodes
|
|
```
|
|
|
|
* **Storage Binding:** Verify that the OpenEBS Persistent Volume Claims (PVCs) for Loki and Prometheus are `Bound`.
|
|
|
|
```bash
|
|
kubectl get pvc -n monitoring
|
|
```
|
|
|
|
[//]: # (todo add: kubectl get pods -n openebs)
|
|
|
|
kubectl get pods -n severed-apps
|
|
kubectl get pods -n monitoring
|
|
kubectl get pods -n kubernetes-dashboard
|
|
kubectl get pods -n openebs
|
|
|
|
kubectl rollout restart deployment grafana -n monitoring
|
|
|
|
---
|
|
|
|
### 2. The Telemetry Bridge: Alloy & Exporter
|
|
|
|
Check if Alloy is successfully translating raw Nginx text into Prometheus numbers.
|
|
|
|
* **Error Scan:** Check Alloy logs specifically for `scrape_uri` or `connection refused` errors.
|
|
|
|
```bash
|
|
kubectl logs -n monitoring -l name=alloy --tail=50
|
|
```
|
|
|
|
[//]: # (kubectl apply -f infra/alloy-setup.yaml)
|
|
[//]: # (kubectl delete pods -n monitoring -l name=alloy)
|
|
[//]: # (kubectl get pods -n monitoring)
|
|
[//]: # (kubectl describe pod alloy-dq2cd -n monitoring)
|
|
[//]: # (kubectl logs -n monitoring -l name=alloy --tail=50)
|
|
[//]: # (kubectl get pod -n monitoring -l app=grafana -o jsonpath='{.items[0].spec.containers[0].env}' | jq)
|
|
|
|
[//]: # (kubectl apply -f apps/severed-blog-config.yaml)
|
|
[//]: # (kubectl rollout restart deployment severed-blog -n severed-apps)
|
|
[//]: # (kubectl logs -n severed-apps -l app=severed-blog -f)
|
|
|
|
[//]: # (kubectl logs loki-0 -n monitoring --tail=20)
|
|
|
|
* **Internal Handshake:** Use your `access-hub.sh` script and visit `localhost:12345`.
|
|
* Find the `prometheus.exporter.nginx.blog` component.
|
|
* Ensure the health status is **Green/Up**.
|
|
|
|
---
|
|
|
|
### 3. The Database: Prometheus Query Test
|
|
|
|
If the exporter is working, the metrics will appear in the Prometheus time-series database.
|
|
|
|
* **Live Traffic Check:** Verify that `nginx_http_requests_total` is returning a data vector (not an empty list `[]`).
|
|
|
|
```bash
|
|
kubectl exec -it prometheus-0 -n monitoring -- \
|
|
wget -qO- "http://localhost:9090/api/v1/query?query=nginx_http_requests_total"
|
|
|
|
```
|
|
|
|
* **Metric Discovery:** List all Nginx-related metrics currently being stored.
|
|
|
|
```bash
|
|
kubectl exec -it prometheus-0 -n monitoring -- \
|
|
wget -qO- "http://localhost:9090/api/v1/label/__name__/values" | grep nginx
|
|
|
|
```
|
|
|
|
---
|
|
|
|
### 4. The "Brain": Horizontal Pod Autoscaler (HPA)
|
|
|
|
The HPA is the final consumer of this data. If this is healthy, the cluster is auto-scaling correctly.
|
|
|
|
* **Target Alignment:** The `TARGETS` column should show a real value (e.g., `0/10`) rather than `<unknown>`.
|
|
|
|
```bash
|
|
kubectl get hpa -n severed-apps
|
|
|
|
```
|
|
|
|
* **Adapter Check:** Ensure the Custom Metrics API is serving the translated Nginx metrics to the Kubernetes master.
|
|
|
|
```bash
|
|
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/severed-apps/pods/*/nginx_http_requests_total"
|
|
|
|
```
|
|
|
|
### Cheat Sheet
|
|
|
|
| Symptom | Probable Cause | Fix |
|
|
|----------------------------|-----------------------------|-------------------------------------------|
|
|
| `502 Bad Gateway` | Node resource exhaustion | Restart K3d or increase Docker RAM |
|
|
| `strconv.ParseFloat` error | Missing Nginx Exporter | Use `prometheus.exporter.nginx` in Alloy |
|
|
| HPA shows `<unknown>` | Prometheus Adapter mismatch | Verify `adapter-values.yaml` metric names |
|
|
| `No nodes found` | Corrupted cluster state | Run `k3d cluster delete` and recreate |
|