Severed-Infra/scripts/README.md

# Severed-Infra: Health & Diagnostics Guide

### 1. The Foundation: Node & Storage Stability

Before troubleshooting apps, ensure the physical (Docker) layer is stable.

* **Node Readiness:** All 3 nodes (1 server, 2 agents) must be `Ready`.

```bash
kubectl get nodes
```

* **Storage Binding:** Verify that the OpenEBS Persistent Volume Claims (PVCs) for Loki and Prometheus are `Bound`.

```bash
kubectl get pvc -n monitoring
```

[//]: # (todo add: kubectl get pods -n openebs)

kubectl get pods -n severed-apps
kubectl get pods -n monitoring
kubectl get pods -n kubernetes-dashboard
kubectl get pods -n openebs

kubectl rollout restart deployment grafana -n monitoring

---

### 2. The Telemetry Bridge: Alloy & Exporter

Check if Alloy is successfully translating raw Nginx text into Prometheus numbers.

* **Error Scan:** Check Alloy logs specifically for `scrape_uri` or `connection refused` errors.

```bash
kubectl logs -n monitoring -l name=alloy --tail=50
```

[//]: # (kubectl apply -f infra/alloy-setup.yaml)
[//]: # (kubectl delete pods -n monitoring -l name=alloy)
[//]: # (kubectl get pods -n monitoring)
[//]: # (kubectl describe pod alloy-dq2cd -n monitoring)
[//]: # (kubectl logs -n monitoring -l name=alloy --tail=50)
[//]: # (kubectl get pod -n monitoring -l app=grafana -o jsonpath='{.items[0].spec.containers[0].env}' | jq)

[//]: # (kubectl apply -f apps/severed-blog-config.yaml)
[//]: # (kubectl rollout restart deployment severed-blog -n severed-apps)
[//]: # (kubectl logs -n severed-apps -l app=severed-blog -f)

[//]: # (kubectl logs loki-0 -n monitoring --tail=20)
[//]: # (kubectl rollout restart deployment/grafana -n monitoring)

* **Internal Handshake:** Use your `access-hub.sh` script and visit `localhost:12345`.
* Find the `prometheus.exporter.nginx.blog` component.
* Ensure the health status is **Green/Up**.

---

### 3. The Database: Prometheus Query Test

If the exporter is working, the metrics will appear in the Prometheus time-series database.

* **Live Traffic Check:** Verify that `nginx_http_requests_total` is returning a data vector (not an empty list `[]`).

```bash
kubectl exec -it prometheus-0 -n monitoring -- \
  wget -qO- "http://localhost:9090/api/v1/query?query=nginx_http_requests_total"

```

* **Metric Discovery:** List all Nginx-related metrics currently being stored.

```bash
kubectl exec -it prometheus-0 -n monitoring -- \
  wget -qO- "http://localhost:9090/api/v1/label/__name__/values" | grep nginx

```

---

### 4. The "Brain": Horizontal Pod Autoscaler (HPA)

The HPA is the final consumer of this data. If this is healthy, the cluster is auto-scaling correctly.

* **Target Alignment:** The `TARGETS` column should show a real value (e.g., `0/10`) rather than `<unknown>`.

```bash
kubectl get hpa -n severed-apps

```

* **Adapter Check:** Ensure the Custom Metrics API is serving the translated Nginx metrics to the Kubernetes master.

```bash
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/severed-apps/pods/*/nginx_http_requests_total"

```

### Cheat Sheet

| Symptom                    | Probable Cause              | Fix                                       |
|----------------------------|-----------------------------|-------------------------------------------|
| `502 Bad Gateway`          | Node resource exhaustion    | Restart K3d or increase Docker RAM        |
| `strconv.ParseFloat` error | Missing Nginx Exporter      | Use `prometheus.exporter.nginx` in Alloy  |
| HPA shows `<unknown>`      | Prometheus Adapter mismatch | Verify `adapter-values.yaml` metric names |
| `No nodes found`           | Corrupted cluster state     | Run `k3d cluster delete` and recreate     |