Building a GitOps-Owned Metrics Stack for the Home Lab

This is the metrics and dashboard setup we built for the Kubernetes home lab. The goal was not to recreate a full enterprise observability platform. The goal was to make the lab easier to operate: see whether the cluster is healthy, know which applications are failing, understand resource pressure before jobs get evicted, and keep the setup small enough that it can survive on home-lab hardware.

The resulting stack uses Prometheus for time-series metrics, Loki for logs, Grafana for dashboards, and OpenTelemetry Collector as the normalization layer for application metrics and log shipping. Everything is GitOps-managed through Flux, runs inside Kubernetes, and stores state on local ZFS-backed persistent volumes.

What We Wanted From Metrics

The home lab had moved from a collection of Docker Compose workloads and host services into Kubernetes. That made workloads more portable, but it also made the old troubleshooting habits weaker. Looking at one host with docker ps, docker logs, and top was no longer enough.

The first set of questions we wanted dashboards to answer were simple:

Is Kubernetes itself healthy?
Are both servers contributing, or did everything collapse onto one node?
Which workloads are restarting?
Are build jobs close to their CPU, memory, or ephemeral storage limits?
Is GitLab healthy enough to keep driving GitOps?
Is Plex alive from the user’s point of view, not just from Kubernetes’
point of view?
Are Home Assistant, Nextcloud, and ingress producing useful metrics?
Can logs and metrics be viewed together from one place?

Those questions shaped the design. The stack needed Kubernetes-level metrics, application-level metrics, log correlation, and dashboards that explain the lab at a glance. It did not need global high availability, object storage, long retention, or cross-cluster federation yet.

The Shape of the Stack

The observability package lives under:

config/kubernetes/platform/observability/

The important pieces are:

helmrelease-prometheus.yaml for the Prometheus server.
helmrelease-loki.yaml for log storage.
helmrelease-grafana.yaml for dashboards and datasources.
helmrelease-opentelemetry-collector.yaml for Kubernetes and Plex log
shipping.
helmrelease-opentelemetry-collector-metrics.yaml for application metrics
scraping and normalization.
grafana-dashboards.yaml for GitOps-provisioned dashboards.
plex-exporter.yaml for the Plex Prometheus exporter.

The high-level flow looks like this:

Applications and Kubernetes metrics
        |
        v
OpenTelemetry Collector metrics Deployment
        |
        v
Prometheus
        |
        v
Grafana

Kubernetes pod logs and Plex application log files
        |
        v
OpenTelemetry Collector log DaemonSet
        |
        v
Loki
        |
        v
Grafana

Prometheus and Loki stay cluster-internal. Grafana is the human interface and is exposed on the LAN at:

https://<internal-metrics-hostname>

Grafana authenticates through the internal GitLab instance, so the metrics UI uses the same identity system as the source and CI platform.

Why Prometheus Stays Small

The Prometheus release is intentionally conservative. It runs a single server with one zfs-local PVC, 15 days of retention, and an 18 GB retention size cap. Alertmanager, Pushgateway, and node exporter are disabled in this slice.

That sounds limited, but it is deliberate. The home lab needs useful signals more than it needs a large monitoring estate. Prometheus currently scrapes:

cAdvisor metrics from Kubernetes nodes.
kube-state-metrics.
the OpenTelemetry metrics collector.
the internal ingress-nginx controller metrics endpoint.

The app-specific scrape work is delegated to OpenTelemetry Collector. That keeps Prometheus focused as the storage and query backend instead of making it own every detail of application discovery and relabeling.

The Prometheus chart is also configured without broad pod/service discovery. For this lab, explicit scrape targets are easier to review in Git and harder to accidentally expand into noisy or sensitive endpoints.

The Metrics Collector as a Boundary

The metrics collector is a separate OpenTelemetry Collector Deployment. It scrapes application endpoints and exports a Prometheus-compatible endpoint on port 9464. Prometheus then scrapes the collector.

That indirection is useful for three reasons.

First, application metrics are not all exposed the same way. GitLab uses Prometheus pod annotations. Nextcloud exposes OpenMetrics at /metrics. Home Assistant exposes metrics through /api/prometheus and requires a token. Plex does not expose Prometheus metrics directly, so it needs a dedicated exporter.

Second, credentials stay scoped. The Home Assistant token is mounted only into the metrics collector. The Plex token is mounted only into the Plex exporter. Prometheus does not need either secret.

Third, labels can be normalized before Prometheus stores the series. The collector adds a stable cluster resource attribute and sets explicit labels such as namespace, service, and app_kubernetes_io_name on scrape targets. That makes dashboards less dependent on whatever label shape each application happens to emit.

The current collector scrape targets include:

GitLab pod metrics from the gitlab namespace.
Nextcloud OpenMetrics from https://<nextcloud-hostname>/metrics.
Home Assistant metrics from
<home-assistant-service>.<namespace>.svc:8123/api/prometheus.
Plex metrics from
<plex-exporter-service>.<namespace>.svc:9594.

Prometheus then scrapes:

<metrics-collector-service>.<namespace>.svc:9464

This gives Prometheus one clean target for application metrics, while the collector owns the application-specific scrape details.

Kubernetes Health Without Node Exporter

The first host-health dashboards use kube-state-metrics and cAdvisor instead of node exporter. That keeps the initial footprint smaller while still giving enough information to operate the cluster.

The Home Lab Overview dashboard shows:

ready node count
cluster CPU usage percentage
cluster memory usage percentage
container filesystem usage
pod restarts over 24 hours
CPU usage by host
memory working set by host
container filesystem usage by host
pod counts by phase

For CPU, memory, and filesystem panels, the cAdvisor scrape labels hosts with kubernetes_io_hostname, so the dashboard uses that label instead of assuming a plain node label exists. That is a small detail, but it matters: dashboards should match the data that is actually being scraped, not the label shape we wish we had.

Metrics Server also exists in the platform, but it serves a different job. It backs Kubernetes’ live resource APIs, such as kubectl top, and is useful for point-in-time scheduling and troubleshooting. It is not the historical metrics store. Prometheus is where the longer-term resource history lives.

Ingress and Load Balancer Metrics

The internal ingress path is important enough to have its own dashboard. A lot of the lab’s user-facing services depend on the internal nginx ingress controller and the shared LAN VIP, so the question is not only “is the ingress pod running?” It is also “is traffic flowing, are reloads healthy, and are responses degrading?”

The internal ingress-nginx HelmRelease enables controller metrics, which expose Prometheus data on port 10254. Prometheus scrapes:

<internal-ingress-metrics-service>.<namespace>.svc:10254

The Internal Load Balancer dashboard includes panels for:

scrape target health
ready ingress hosts
request rate
p95 latency
config reload health
HTTP error rate
status classes
traffic by host and ingress
nginx connection state
bytes sent
ingress host CPU, memory, and filesystem usage

That dashboard is intentionally about the path users actually hit. It combines controller metrics with Kubernetes host metrics so ingress issues can be separated from node pressure.

Application Metrics: GitLab, Nextcloud, Home Assistant, and Plex

Each app needed a slightly different collection strategy.

GitLab already emits Prometheus metrics from chart-managed pods. The collector uses Kubernetes pod discovery in the gitlab namespace and honors the chart’s Prometheus scrape annotations. It also avoids scraping the webservice pods in this slice because those endpoints were not the useful target for the initial dashboard.

Nextcloud uses its OpenMetrics endpoint at:

https://<nextcloud-hostname>/metrics

Using the production hostname is intentional. Nextcloud still sees a trusted virtual host, while cluster DNS and ingress route the request internally. The application package has to allow the Kubernetes pod CIDR as an OpenMetrics client, otherwise the scrape is expected to fail.

Home Assistant uses the official Prometheus integration. The collector scrapes:

http://<home-assistant-service>.<namespace>.svc:8123/api/prometheus

That scrape uses a long-lived access token mounted from the home-assistant-prometheus-token Secret. The Home Assistant app package also ensures the prometheus: block exists in the host-mounted configuration.

Plex was the most awkward because Plex itself does not expose native Prometheus metrics. The solution is plex-media-server-exporter, running in the observability namespace. It talks to Plex over:

https://<plex-service>.<namespace>.svc:32400

The exporter uses a seeded Plex token and exposes /metrics on port 9594. The first working validation showed plex_up=1, media count samples, and session gauges. That matters because the exporter being reachable is not the same as Plex being healthy. The dashboard needs Plex-level signals like sessions, transcodes, and media counts, not only Kubernetes pod readiness.

Logs Are Part of the Metrics Experience

Although this post is about metrics, the dashboards are more useful because Loki is in the same Grafana instance.

The log path uses a separate OpenTelemetry Collector DaemonSet. It collects Kubernetes pod stdout/stderr logs and exports them to Loki over OTLP HTTP. Loki runs in single-binary mode with filesystem storage on a zfs-local PVC, 14 day retention, and a 20 GiB volume.

Plex needed one extra step. The Plex container is often quiet on stdout/stderr, while the useful operational details live in Plex’s own application log files. The log collector mounts the Plex log directory read-only from the host:

/opt/plex/config/Library/Application Support/Plex Media Server/Logs

Inside the collector, that becomes:

/var/log/plex-application/*.log

Those logs are labeled as:

k8s.namespace.name = plex
k8s.container.name = plex-file
service.namespace = plex
service.name = plex
log.source = plex-application-file

The receiver starts tailing at the end to avoid dumping a large backlog into Loki during rollout. That is the right default for a home lab: collect new evidence from now forward, then tune retention and file globs once real volume is known.

Grafana Is Provisioned, Not Click-Managed

The dashboards are stored in:

config/kubernetes/platform/observability/grafana-dashboards.yaml

Grafana loads them from a ConfigMap through the Helm chart’s dashboard provider configuration. The dashboards are not editable drift inside the Grafana UI. Changes go through Git, Flux applies them, and the live dashboards match the repository.

The current provisioned dashboards cover:

Home Lab Overview
Home Assistant
Internal Load Balancer
Plex
GitLab
Nextcloud
OpenTelemetry Collector health
Logs

The overview dashboard is the landing page for cluster health. App dashboards then answer service-specific questions. For example, the Plex dashboard combines Plex exporter metrics with Loki panels for warnings, errors, playback, stream, and session log lines.

This is a pattern worth keeping: dashboards should be owned beside the workloads and platform that produce the data. Otherwise the metrics stack slowly becomes a separate manual system with its own hidden state.

Storage and Retention Choices

The storage design is intentionally local and boring:

Loki: one 20 GiB zfs-local PVC, 14 day retention.
Prometheus: one 20 GiB zfs-local PVC, 15 day retention, 18 GB size cap.
Grafana: one 10 GiB zfs-local PVC.

There is no MinIO-backed object storage for Loki or Prometheus in this slice. That may come later, but it would add another dependency and another recovery problem. The first version should answer operational questions before it tries to become durable infrastructure for every possible failure mode.

The rollback story is also simpler this way. If log shipping causes trouble, the OpenTelemetry log collector can be suspended while kubectl logs and journalctl remain available. If metrics scraping gets noisy, scrape jobs can be removed from the collector without changing application deployments.

What This Setup Already Helped With

The most useful part of the stack has been moving from “the pod is running” to “the service is behaving.”

For Plex, Kubernetes could show a Ready pod while the exporter showed plex_up=0. That forced the scrape path, TLS behavior, and Plex API access to be fixed until real Plex metrics appeared. The final validation showed plex_up=1, a nonzero media count, and session metrics.

For the cluster, the Home Lab Overview dashboard made host pressure visible. That matters for CI jobs, especially builds that can consume large amounts of ephemeral storage. Seeing CPU, memory, filesystem use, pod phases, and restarts in one place makes it easier to tune runner limits based on evidence instead of guessing.

For ingress, enabling nginx metrics gave the internal load balancer its own view. Instead of checking only whether the ingress controller pod is Ready, the dashboard shows traffic, latency, response classes, reload health, and the hosts carrying the workload.

Tradeoffs

This setup is intentionally not perfect.

There is no Alertmanager yet. That is acceptable while dashboards and metrics names are still settling. Alerting too early usually creates noisy rules that people learn to ignore.

There is no node exporter yet. cAdvisor and kube-state-metrics are enough for the first host-health dashboards, but node exporter will probably be useful later for ZFS, disk, network, and host-level signals that Kubernetes does not expose cleanly.

Loki uses filesystem storage rather than object storage. That is fine for a single-cluster home lab with short retention, but it is not the design for long-term archival logs.

Prometheus uses explicit scrape configuration instead of broad automatic discovery. That means new apps need deliberate scrape additions, but it also keeps the metrics surface reviewable.

Where This Goes Next

The next improvements should be driven by observed gaps rather than dashboard completeness.

The first likely addition is better CI runner visibility. Build jobs have already shown that CPU, memory, and ephemeral storage matter. The current cluster panels show the resource pressure, but a runner-specific dashboard would make it clearer which job classes need larger limits and which nodes are absorbing the load.

Node exporter is also likely. Once the Kubernetes-level picture is stable, host-level metrics for ZFS, disks, NICs, and system services will close a real blind spot.

Alerting should come after that. The first alerts should be boring and actionable: Prometheus scrape target down, Loki not ingesting logs, Grafana unavailable, Kubernetes node not Ready, PVC filling, and runner storage approaching eviction thresholds.

The important part is that the foundation is now GitOps-owned. Metrics, dashboards, scrape targets, retention, and access are all declared in the same repository as the rest of the platform. That makes observability part of the home lab rather than an emergency tool bolted on after something breaks.

Tom's Development Blog