Monitoring with Prometheus, Grafana, and Loki
By David Le -- Part 18 of the FhirHub Series
A deployed application isn't done. It's just the beginning. Without monitoring, you're flying blind -- you won't know about failures until users report them, you can't diagnose performance problems without metrics, and you can't trace issues without centralized logs.
This post covers the observability stack I built for FhirHub: Prometheus for metrics collection, Grafana for dashboards and alerting, and Loki for log aggregation. Combined with the developer experience tooling, this completes the full monitoring lifecycle.
Why This Stack?
Prometheus + Grafana + Loki vs. Datadog vs. ELK vs. CloudWatch
| Stack | Cost | Self-Hosted | K8s Native | Logs + Metrics |
|---|---|---|---|---|
| Prometheus + Grafana + Loki | Free (OSS) | Yes | ServiceMonitor CRD | Yes |
| Datadog | ~$15/host/month | No | Agent-based | Yes |
| ELK (Elastic) | Free (OSS) or paid | Yes | Filebeat/Metricbeat | Yes |
| CloudWatch | Pay-per-metric | No (AWS only) | Limited | Yes |
The Prometheus stack is the Kubernetes-native choice. ServiceMonitor CRDs auto-discover pods to scrape. Grafana dashboards are JSON files that live in Git. Loki uses the same label-based query language as Prometheus. No vendor lock-in, no per-host pricing.
Why Not Datadog?
Datadog is excellent -- unified metrics, logs, traces, and APM in one SaaS. But at ~$15/host/month, costs scale with infrastructure. For an open-source project like FhirHub, self-hosted monitoring means anyone can run the full stack without a subscription. The Prometheus ecosystem also integrates more deeply with Kubernetes through CRDs.
Why Not ELK?
The Elastic stack (Elasticsearch, Logstash, Kibana) is powerful but resource-heavy. Elasticsearch requires significant memory for indexing. Loki takes a fundamentally different approach -- it only indexes log labels (pod name, namespace, container), not the log content. This makes Loki dramatically cheaper to run. You trade full-text search speed for lower resource usage, which is the right trade-off for application logs where you typically filter by service first.
Application Metrics
Adding Prometheus metrics to the .NET API required two lines:
// Program.cs
app.UseHttpMetrics(); // Records request duration, count, status code
app.MapMetrics(); // Exposes /metrics endpoint
The prometheus-net.AspNetCore NuGet package instruments all HTTP requests automatically. No manual counters needed for the basics. Every request records:
http_request_duration_seconds-- Histogram with method, status code, and path labelshttp_requests_received_total-- Counter by method and status codedotnet_gc_collection_count_total-- GC collections by generationdotnet_threadpool_num_threads-- Active thread countprocess_working_set_bytes-- Process memory usage
Why prometheus-net vs. OpenTelemetry?
| Library | Protocol | Metrics + Traces | Maturity (.NET) | Setup Complexity |
|---|---|---|---|---|
| prometheus-net | Prometheus scrape | Metrics only | Mature (v8+) | 2 lines |
| OpenTelemetry .NET | OTLP push | Yes | GA but evolving | Collector + config |
prometheus-net is simpler for metrics-only instrumentation. OpenTelemetry is the right choice if you also want distributed tracing, but FhirHub doesn't need trace correlation across services yet. Adding OpenTelemetry later is an additive change -- the Prometheus metrics stay the same.
Checkpoint: Verify Metrics Endpoint
Before continuing, verify the API exposes Prometheus metrics. In a separate terminal:
kubectl port-forward -n fhirhub-dev svc/fhirhub-api 5197:8080
Then in your main terminal:
curl -s http://localhost:5197/metrics | head -20
Expected output:
- Should show Prometheus-format metrics with lines starting with
# HELP,# TYPE, and metric values - Look for
http_request_duration_secondsanddotnet_gc_collection_count_totalin the output
If something went wrong:
- If the connection is refused, check that the API pod is running:
kubectl get pods -n fhirhub-dev -l app.kubernetes.io/name=fhirhub-api - If
/metricsreturns 404, verify thatapp.MapMetrics()is called inProgram.csand theprometheus-net.AspNetCoreNuGet package is installed
ServiceMonitor CRDs
Each Helm sub-chart includes a ServiceMonitor template that tells Prometheus which pods to scrape:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fhirhub-api
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: fhirhub-api
endpoints:
- port: http
path: /metrics
interval: 30s
The release: prometheus label is how kube-prometheus-stack discovers ServiceMonitors. Without it, Prometheus ignores the resource.
Why ServiceMonitor vs. Annotations?
| Discovery Method | Typed | Scoped | Multiple Endpoints |
|---|---|---|---|
prometheus.io/scrape annotation | No | Pod only | No |
| ServiceMonitor CRD | Yes | Namespace-scoped | Yes |
| PodMonitor CRD | Yes | Pod-scoped | Yes |
Annotations are the legacy approach -- a boolean on each pod. ServiceMonitors are typed CRDs with validation, namespace scoping, and support for multiple endpoints with different intervals. They're the standard for kube-prometheus-stack.
Grafana Dashboards
Two Grafana dashboards ship as JSON in the repository, loaded via ConfigMaps:
FhirHub API Dashboard
- HTTP request rate by method and endpoint
- Latency percentiles (p50, p95, p99)
- 5xx error rate as a percentage
- GC collection count by generation
- Thread pool thread count
- Process working set memory
FhirHub Overview Dashboard
- Service health status (up/down) for all 4 services
- Total request throughput stacked by service
- PostgreSQL active connections by state
- Pod CPU and memory usage by component
- Keycloak login count (success vs. failed)
Why Dashboards-as-Code?
| Approach | Version Controlled | Reproducible | Reviewable |
|---|---|---|---|
| Manual Grafana UI | No | No | No |
| Grafana provisioning API | Partially | Yes | Depends |
| JSON in Git + ConfigMap | Yes | Yes | PR review |
Storing dashboards as JSON in the repository means they're versioned, reviewable in PRs, and automatically deployed with the rest of the infrastructure. A new team member gets the same dashboards as everyone else. No manual setup.
Alerting Rules
Seven PrometheusRule alerts cover the critical failure modes:
| Alert | Condition | For | Severity |
|---|---|---|---|
| ApiDown | up{job="fhirhub-api"} == 0 | 5m | Critical |
| HapiFhirDown | up{job="hapi-fhir"} == 0 | 5m | Critical |
| KeycloakDown | up{job="keycloak"} == 0 | 5m | Critical |
| PostgreSQLDown | up{job="postgresql"} == 0 | 5m | Critical |
| HighErrorRate | >5% HTTP 5xx | 5m | Warning |
| HighLatency | p99 > 2 seconds | 10m | Warning |
| HighMemory | >90% memory limit | 10m | Warning |
Critical alerts fire after 5 minutes of downtime. Warning alerts give longer windows to avoid noise from transient spikes.
Why 5 Minutes for Critical?
| Threshold | False Positives | Detection Speed | Operator Fatigue |
|---|---|---|---|
| 1 minute | High (transient restarts) | Fast | High |
| 5 minutes | Low | Moderate | Low |
| 15 minutes | Very low | Slow | Very low |
Five minutes is the standard balance. Kubernetes restarts crashed pods quickly, so a brief restart shouldn't page anyone. If a service is still down after 5 minutes, something is genuinely wrong and needs human attention.
Checkpoint: Verify Monitoring Stack
Before continuing, verify the full monitoring stack is running:
make monitoring-up
kubectl get pods -n monitoring
Expected output:
- All monitoring pods (prometheus, grafana, alertmanager, node-exporter, kube-state-metrics) should be
Running
Verify Prometheus is scraping targets:
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
- Open
http://localhost:9090/targets-- scrape targets should showUPstate for FhirHub services - Open
http://localhost:9090/alerts-- should list all 7 alert rules (ApiDown, HapiFhirDown, KeycloakDown, PostgreSQLDown, HighErrorRate, HighLatency, HighMemory)
Verify Grafana dashboards:
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
- Open
http://localhost:3000-- log in with admin credentials - Navigate to Dashboards and look for the FhirHub folder -- should contain "FhirHub API" and "FhirHub Overview" dashboards
- Open "FhirHub API" -- panels should show data for request rate, latency, error rate, and runtime metrics
If something went wrong:
- If Grafana shows "No data" on dashboards, verify the Prometheus datasource is configured correctly in Grafana's datasource settings
Log Aggregation with Loki
Loki collects logs from all containers via Promtail. Unlike Elasticsearch, Loki doesn't index log content -- it indexes labels only and stores compressed log chunks. This makes it lightweight enough to run alongside Prometheus without doubling infrastructure costs.
Pipeline Stages
Promtail uses pipeline stages to parse different log formats:
| Service | Format | Extracted Fields |
|---|---|---|
| FhirHub API | Serilog JSON | Level, RenderedMessage, SourceContext, RequestPath, StatusCode, TraceId |
| Frontend | Next.js stdout | Regex + JSON stages |
| HAPI FHIR | Java Logback | Timestamp, level, thread, logger, message |
| Keycloak | Keycloak logs | Timestamp, level, thread, logger, message |
Each service gets a pipeline stage that extracts structured fields from its log format. In Grafana's Explore view, you can query all services with LogQL:
{namespace="fhirhub-prod", app="fhirhub-api"} |= "error"
{namespace="fhirhub-prod"} | json | level="Error" | line_format "{{.RenderedMessage}}"
Why Loki vs. Elasticsearch for Logs?
| Feature | Loki | Elasticsearch |
|---|---|---|
| Indexing | Labels only | Full-text |
| Memory usage | Low (~256MB) | High (~2GB+) |
| Query speed (filtered) | Fast | Fast |
| Query speed (unfiltered) | Slower | Fast |
| Query language | LogQL | KQL / Lucene |
| Grafana integration | Native | Plugin |
Loki wins for application log aggregation where you almost always filter by service, namespace, or pod first. Elasticsearch wins when you need full-text search across all logs without knowing which service produced them. For FhirHub, the former is the common case.
Checkpoint: Verify Log Aggregation
Before continuing, verify Loki is collecting logs from FhirHub pods.
In Grafana (http://localhost:3000), go to Explore and select the Loki datasource. Run these queries:
{namespace="fhirhub-dev"}-- should return log lines from all FhirHub pods{namespace="fhirhub-dev", app="fhirhub-api"} |= "info"-- should return API info-level logs
Expected output:
- Log lines appear with timestamps, pod names, and log content. You should see startup messages at minimum
If something went wrong:
- If no logs appear, check that Promtail is running:
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail - If Promtail is running but Loki shows no data, verify the Loki datasource URL in Grafana is correct (typically
http://loki:3100) - If only some pods show logs, check Promtail's scrape config covers the
fhirhub-devnamespace
Developer Experience
Makefile
Every common operation is a make target:
make up # Start all services
make dev # Start infrastructure only
make dev-frontend # Run frontend with hot-reload
make test # Run all tests
make lint # Run all linters
make build-api # Build API Docker image
make push-api # Push to Docker Hub
make k8s-create # Create local Kind cluster
make k8s-deploy # Deploy via Helm
make monitoring-up # Deploy monitoring stack
make clean # Remove everything
No one should need to remember Docker Compose flags or Helm command syntax.
Local Kubernetes with Kind
The scripts/setup-local-k8s.sh script creates a complete local environment:
- Creates a Kind cluster (1 control plane + 2 workers)
- Installs nginx-ingress for local routing
- Installs cert-manager for TLS
- Installs ArgoCD
- Builds and loads local Docker images
- Deploys FhirHub via Helm
- Applies ArgoCD ApplicationSet
One command: make k8s-create. The entire stack running on your laptop.
How It All Connects
The full deployment flow ties every tool together:
Developer pushes to main
│
├── GitHub Actions CI/CD (Post 15)
│ ├── Run .NET tests
│ ├── Run frontend tests
│ ├── Build + push API image to Docker Hub
│ ├── Build + push frontend image to Docker Hub
│ ├── Trivy scan both images
│ └── Update helm/fhirhub/values.yaml with new image tag
│
├── ArgoCD detects values.yaml change (Post 17)
│ ├── Syncs dev namespace (automated)
│ ├── Syncs staging namespace (automated)
│ └── Prod waits for release branch merge
│
└── Prometheus + Grafana + Loki (this post)
├── Scrapes /metrics from all pods
├── Collects logs via Promtail
├── Fires alerts if services go down
└── Dashboards show request rates, latency, errors
Git is the single source of truth. Push code, pipelines build and test it, ArgoCD deploys it, monitoring verifies it's healthy. No manual steps.
What's Next
In Part 19, we'll deploy the entire FhirHub stack -- application services, databases, and monitoring -- on a single machine using k3s. We'll compare k3s vs. MicroK8s vs. kubeadm, adapt the Helm values for constrained resources, and set up Traefik ingress with a backup strategy for single-node PostgreSQL.