Files
Severed-Infra/scripts/README.md

108 lines
3.6 KiB
Markdown

# Severed-Infra: Health & Diagnostics Guide
### 1. The Foundation: Node & Storage Stability
Before troubleshooting apps, ensure the physical (Docker) layer is stable.
* **Node Readiness:** All 3 nodes (1 server, 2 agents) must be `Ready`.
```bash
kubectl get nodes
```
* **Storage Binding:** Verify that the OpenEBS Persistent Volume Claims (PVCs) for Loki and Prometheus are `Bound`.
```bash
kubectl get pvc -n monitoring
```
[//]: # (todo add: kubectl get pods -n openebs)
kubectl get pods -n severed-apps
kubectl get pods -n monitoring
kubectl get pods -n kubernetes-dashboard
kubectl get pods -n openebs
kubectl rollout restart deployment grafana -n monitoring
---
### 2. The Telemetry Bridge: Alloy & Exporter
Check if Alloy is successfully translating raw Nginx text into Prometheus numbers.
* **Error Scan:** Check Alloy logs specifically for `scrape_uri` or `connection refused` errors.
```bash
kubectl logs -n monitoring -l name=alloy --tail=50
```
[//]: # (kubectl apply -f infra/alloy-setup.yaml)
[//]: # (kubectl delete pods -n monitoring -l name=alloy)
[//]: # (kubectl get pods -n monitoring)
[//]: # (kubectl describe pod alloy-dq2cd -n monitoring)
[//]: # (kubectl logs -n monitoring -l name=alloy --tail=50)
[//]: # (kubectl get pod -n monitoring -l app=grafana -o jsonpath='{.items[0].spec.containers[0].env}' | jq)
[//]: # (kubectl apply -f apps/severed-blog-config.yaml)
[//]: # (kubectl rollout restart deployment severed-blog -n severed-apps)
[//]: # (kubectl logs -n severed-apps -l app=severed-blog -f)
[//]: # (kubectl logs loki-0 -n monitoring --tail=20)
[//]: # (kubectl rollout restart deployment/grafana -n monitoring)
* **Internal Handshake:** Use your `access-hub.sh` script and visit `localhost:12345`.
* Find the `prometheus.exporter.nginx.blog` component.
* Ensure the health status is **Green/Up**.
---
### 3. The Database: Prometheus Query Test
If the exporter is working, the metrics will appear in the Prometheus time-series database.
* **Live Traffic Check:** Verify that `nginx_http_requests_total` is returning a data vector (not an empty list `[]`).
```bash
kubectl exec -it prometheus-0 -n monitoring -- \
wget -qO- "http://localhost:9090/api/v1/query?query=nginx_http_requests_total"
```
* **Metric Discovery:** List all Nginx-related metrics currently being stored.
```bash
kubectl exec -it prometheus-0 -n monitoring -- \
wget -qO- "http://localhost:9090/api/v1/label/__name__/values" | grep nginx
```
---
### 4. The "Brain": Horizontal Pod Autoscaler (HPA)
The HPA is the final consumer of this data. If this is healthy, the cluster is auto-scaling correctly.
* **Target Alignment:** The `TARGETS` column should show a real value (e.g., `0/10`) rather than `<unknown>`.
```bash
kubectl get hpa -n severed-apps
```
* **Adapter Check:** Ensure the Custom Metrics API is serving the translated Nginx metrics to the Kubernetes master.
```bash
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/severed-apps/pods/*/nginx_http_requests_total"
```
### Cheat Sheet
| Symptom | Probable Cause | Fix |
|----------------------------|-----------------------------|-------------------------------------------|
| `502 Bad Gateway` | Node resource exhaustion | Restart K3d or increase Docker RAM |
| `strconv.ParseFloat` error | Missing Nginx Exporter | Use `prometheus.exporter.nginx` in Alloy |
| HPA shows `<unknown>` | Prometheus Adapter mismatch | Verify `adapter-values.yaml` metric names |
| `No nodes found` | Corrupted cluster state | Run `k3d cluster delete` and recreate |