Monitoring with Prometheus, Grafana, and Loki

By David Le -- Part 18 of the FhirHub Series

A deployed application isn't done. It's just the beginning. Without monitoring, you're flying blind -- you won't know about failures until users report them, you can't diagnose performance problems without metrics, and you can't trace issues without centralized logs.

This post covers the observability stack I built for FhirHub: Prometheus for metrics collection, Grafana for dashboards and alerting, and Loki for log aggregation. Combined with the developer experience tooling, this completes the full monitoring lifecycle.

Why This Stack?

Prometheus + Grafana + Loki vs. Datadog vs. ELK vs. CloudWatch

Stack	Cost	Self-Hosted	K8s Native	Logs + Metrics
Prometheus + Grafana + Loki	Free (OSS)	Yes	ServiceMonitor CRD	Yes
Datadog	~$15/host/month	No	Agent-based	Yes
ELK (Elastic)	Free (OSS) or paid	Yes	Filebeat/Metricbeat	Yes
CloudWatch	Pay-per-metric	No (AWS only)	Limited	Yes

The Prometheus stack is the Kubernetes-native choice. ServiceMonitor CRDs auto-discover pods to scrape. Grafana dashboards are JSON files that live in Git. Loki uses the same label-based query language as Prometheus. No vendor lock-in, no per-host pricing.

Why Not Datadog?

Datadog is excellent -- unified metrics, logs, traces, and APM in one SaaS. But at ~$15/host/month, costs scale with infrastructure. For an open-source project like FhirHub, self-hosted monitoring means anyone can run the full stack without a subscription. The Prometheus ecosystem also integrates more deeply with Kubernetes through CRDs.

Why Not ELK?

The Elastic stack (Elasticsearch, Logstash, Kibana) is powerful but resource-heavy. Elasticsearch requires significant memory for indexing. Loki takes a fundamentally different approach -- it only indexes log labels (pod name, namespace, container), not the log content. This makes Loki dramatically cheaper to run. You trade full-text search speed for lower resource usage, which is the right trade-off for application logs where you typically filter by service first.

Application Metrics

Adding Prometheus metrics to the .NET API required two lines:

// Program.cs
app.UseHttpMetrics();   // Records request duration, count, status code
app.MapMetrics();       // Exposes /metrics endpoint

The prometheus-net.AspNetCore NuGet package instruments all HTTP requests automatically. No manual counters needed for the basics. Every request records:

http_request_duration_seconds -- Histogram with method, status code, and path labels
http_requests_received_total -- Counter by method and status code
dotnet_gc_collection_count_total -- GC collections by generation
dotnet_threadpool_num_threads -- Active thread count
process_working_set_bytes -- Process memory usage

Why prometheus-net vs. OpenTelemetry?

Library	Protocol	Metrics + Traces	Maturity (.NET)	Setup Complexity
prometheus-net	Prometheus scrape	Metrics only	Mature (v8+)	2 lines
OpenTelemetry .NET	OTLP push	Yes	GA but evolving	Collector + config

prometheus-net is simpler for metrics-only instrumentation. OpenTelemetry is the right choice if you also want distributed tracing, but FhirHub doesn't need trace correlation across services yet. Adding OpenTelemetry later is an additive change -- the Prometheus metrics stay the same.

Checkpoint: Verify Metrics Endpoint

Before continuing, verify the API exposes Prometheus metrics. In a separate terminal:

kubectl port-forward -n fhirhub-dev svc/fhirhub-api 5197:8080

Then in your main terminal:

curl -s http://localhost:5197/metrics | head -20

Expected output:

Should show Prometheus-format metrics with lines starting with # HELP, # TYPE, and metric values
Look for http_request_duration_seconds and dotnet_gc_collection_count_total in the output

If something went wrong:

If the connection is refused, check that the API pod is running: kubectl get pods -n fhirhub-dev -l app.kubernetes.io/name=fhirhub-api
If /metrics returns 404, verify that app.MapMetrics() is called in Program.cs and the prometheus-net.AspNetCore NuGet package is installed

ServiceMonitor CRDs

Each Helm sub-chart includes a ServiceMonitor template that tells Prometheus which pods to scrape:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fhirhub-api
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: fhirhub-api
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

The release: prometheus label is how kube-prometheus-stack discovers ServiceMonitors. Without it, Prometheus ignores the resource.

Why ServiceMonitor vs. Annotations?

Discovery Method	Typed	Scoped	Multiple Endpoints
`prometheus.io/scrape` annotation	No	Pod only	No
ServiceMonitor CRD	Yes	Namespace-scoped	Yes
PodMonitor CRD	Yes	Pod-scoped	Yes

Annotations are the legacy approach -- a boolean on each pod. ServiceMonitors are typed CRDs with validation, namespace scoping, and support for multiple endpoints with different intervals. They're the standard for kube-prometheus-stack.

Grafana Dashboards

Two Grafana dashboards ship as JSON in the repository, loaded via ConfigMaps:

FhirHub API Dashboard

HTTP request rate by method and endpoint
Latency percentiles (p50, p95, p99)
5xx error rate as a percentage
GC collection count by generation
Thread pool thread count
Process working set memory

FhirHub Overview Dashboard

Service health status (up/down) for all 4 services
Total request throughput stacked by service
PostgreSQL active connections by state
Pod CPU and memory usage by component
Keycloak login count (success vs. failed)

Why Dashboards-as-Code?

Approach	Version Controlled	Reproducible	Reviewable
Manual Grafana UI	No	No	No
Grafana provisioning API	Partially	Yes	Depends
JSON in Git + ConfigMap	Yes	Yes	PR review

Storing dashboards as JSON in the repository means they're versioned, reviewable in PRs, and automatically deployed with the rest of the infrastructure. A new team member gets the same dashboards as everyone else. No manual setup.

Alerting Rules

Seven PrometheusRule alerts cover the critical failure modes:

Alert	Condition	For	Severity
ApiDown	`up{job="fhirhub-api"} == 0`	5m	Critical
HapiFhirDown	`up{job="hapi-fhir"} == 0`	5m	Critical
KeycloakDown	`up{job="keycloak"} == 0`	5m	Critical
PostgreSQLDown	`up{job="postgresql"} == 0`	5m	Critical
HighErrorRate	>5% HTTP 5xx	5m	Warning
HighLatency	p99 > 2 seconds	10m	Warning
HighMemory	>90% memory limit	10m	Warning

Critical alerts fire after 5 minutes of downtime. Warning alerts give longer windows to avoid noise from transient spikes.

Why 5 Minutes for Critical?

Threshold	False Positives	Detection Speed	Operator Fatigue
1 minute	High (transient restarts)	Fast	High
5 minutes	Low	Moderate	Low
15 minutes	Very low	Slow	Very low

Five minutes is the standard balance. Kubernetes restarts crashed pods quickly, so a brief restart shouldn't page anyone. If a service is still down after 5 minutes, something is genuinely wrong and needs human attention.

Checkpoint: Verify Monitoring Stack

Before continuing, verify the full monitoring stack is running:

make monitoring-up

kubectl get pods -n monitoring

Expected output:

All monitoring pods (prometheus, grafana, alertmanager, node-exporter, kube-state-metrics) should be Running

Verify Prometheus is scraping targets:

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

Open http://localhost:9090/targets -- scrape targets should show UP state for FhirHub services
Open http://localhost:9090/alerts -- should list all 7 alert rules (ApiDown, HapiFhirDown, KeycloakDown, PostgreSQLDown, HighErrorRate, HighLatency, HighMemory)

Verify Grafana dashboards:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Open http://localhost:3000 -- log in with admin credentials
Navigate to Dashboards and look for the FhirHub folder -- should contain "FhirHub API" and "FhirHub Overview" dashboards
Open "FhirHub API" -- panels should show data for request rate, latency, error rate, and runtime metrics

If something went wrong:

If Grafana shows "No data" on dashboards, verify the Prometheus datasource is configured correctly in Grafana's datasource settings

Log Aggregation with Loki

Loki collects logs from all containers via Promtail. Unlike Elasticsearch, Loki doesn't index log content -- it indexes labels only and stores compressed log chunks. This makes it lightweight enough to run alongside Prometheus without doubling infrastructure costs.

Pipeline Stages

Promtail uses pipeline stages to parse different log formats:

Service	Format	Extracted Fields
FhirHub API	Serilog JSON	Level, RenderedMessage, SourceContext, RequestPath, StatusCode, TraceId
Frontend	Next.js stdout	Regex + JSON stages
HAPI FHIR	Java Logback	Timestamp, level, thread, logger, message
Keycloak	Keycloak logs	Timestamp, level, thread, logger, message

Each service gets a pipeline stage that extracts structured fields from its log format. In Grafana's Explore view, you can query all services with LogQL:

{namespace="fhirhub-prod", app="fhirhub-api"} |= "error"
{namespace="fhirhub-prod"} | json | level="Error" | line_format "{{.RenderedMessage}}"

Why Loki vs. Elasticsearch for Logs?

Feature	Loki	Elasticsearch
Indexing	Labels only	Full-text
Memory usage	Low (~256MB)	High (~2GB+)
Query speed (filtered)	Fast	Fast
Query speed (unfiltered)	Slower	Fast
Query language	LogQL	KQL / Lucene
Grafana integration	Native	Plugin

Loki wins for application log aggregation where you almost always filter by service, namespace, or pod first. Elasticsearch wins when you need full-text search across all logs without knowing which service produced them. For FhirHub, the former is the common case.

Checkpoint: Verify Log Aggregation

Before continuing, verify Loki is collecting logs from FhirHub pods.

In Grafana (http://localhost:3000), go to Explore and select the Loki datasource. Run these queries:

{namespace="fhirhub-dev"} -- should return log lines from all FhirHub pods
{namespace="fhirhub-dev", app="fhirhub-api"} |= "info" -- should return API info-level logs

Expected output:

Log lines appear with timestamps, pod names, and log content. You should see startup messages at minimum

If something went wrong:

If no logs appear, check that Promtail is running: kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
If Promtail is running but Loki shows no data, verify the Loki datasource URL in Grafana is correct (typically http://loki:3100)
If only some pods show logs, check Promtail's scrape config covers the fhirhub-dev namespace

Developer Experience

Makefile

Every common operation is a make target:

make up              # Start all services
make dev             # Start infrastructure only
make dev-frontend    # Run frontend with hot-reload
make test            # Run all tests
make lint            # Run all linters
make build-api       # Build API Docker image
make push-api        # Push to Docker Hub
make k8s-create      # Create local Kind cluster
make k8s-deploy      # Deploy via Helm
make monitoring-up   # Deploy monitoring stack
make clean           # Remove everything

No one should need to remember Docker Compose flags or Helm command syntax.

Local Kubernetes with Kind

The scripts/setup-local-k8s.sh script creates a complete local environment:

Creates a Kind cluster (1 control plane + 2 workers)
Installs nginx-ingress for local routing
Installs cert-manager for TLS
Installs ArgoCD
Builds and loads local Docker images
Deploys FhirHub via Helm
Applies ArgoCD ApplicationSet

One command: make k8s-create. The entire stack running on your laptop.

How It All Connects

The full deployment flow ties every tool together:

Developer pushes to main
  │
  ├── GitHub Actions CI/CD (Post 15)
  │   ├── Run .NET tests
  │   ├── Run frontend tests
  │   ├── Build + push API image to Docker Hub
  │   ├── Build + push frontend image to Docker Hub
  │   ├── Trivy scan both images
  │   └── Update helm/fhirhub/values.yaml with new image tag
  │
  ├── ArgoCD detects values.yaml change (Post 17)
  │   ├── Syncs dev namespace (automated)
  │   ├── Syncs staging namespace (automated)
  │   └── Prod waits for release branch merge
  │
  └── Prometheus + Grafana + Loki (this post)
      ├── Scrapes /metrics from all pods
      ├── Collects logs via Promtail
      ├── Fires alerts if services go down
      └── Dashboards show request rates, latency, errors

Git is the single source of truth. Push code, pipelines build and test it, ArgoCD deploys it, monitoring verifies it's healthy. No manual steps.

What's Next

In Part 19, we'll deploy the entire FhirHub stack -- application services, databases, and monitoring -- on a single machine using k3s. We'll compare k3s vs. MicroK8s vs. kubeadm, adapt the Helm values for constrained resources, and set up Traefik ingress with a backup strategy for single-node PostgreSQL.

Find the source code on GitHub Connect on LinkedIn

Monitoring with Prometheus, Grafana, and Loki

Monitoring with Prometheus, Grafana, and Loki

Why This Stack?

Prometheus + Grafana + Loki vs. Datadog vs. ELK vs. CloudWatch

Why Not Datadog?

Why Not ELK?

Application Metrics

Why prometheus-net vs. OpenTelemetry?

Checkpoint: Verify Metrics Endpoint

ServiceMonitor CRDs

Why ServiceMonitor vs. Annotations?

Grafana Dashboards

FhirHub API Dashboard

FhirHub Overview Dashboard

Why Dashboards-as-Code?

Alerting Rules

Why 5 Minutes for Critical?

Checkpoint: Verify Monitoring Stack

Log Aggregation with Loki

Pipeline Stages

Why Loki vs. Elasticsearch for Logs?

Checkpoint: Verify Log Aggregation

Developer Experience

Makefile

Local Kubernetes with Kind

How It All Connects

What's Next

Related Projects

FhirHub