DL
Back to Blog
TechFebruary 3, 2026·5 min read

Monitoring with Prometheus, Grafana, and Loki

FhirHub's observability stack: Prometheus for metrics, Grafana for dashboards and alerting, and Loki for log aggregation -- fully self-hosted and Kubernetes-native with no vendor lock-in.

D

David Le

Monitoring with Prometheus, Grafana, and Loki

By David Le -- Part 18 of the FhirHub Series

A deployed application isn't done. It's just the beginning. Without monitoring, you're flying blind -- you won't know about failures until users report them, you can't diagnose performance problems without metrics, and you can't trace issues without centralized logs.

This post covers the observability stack I built for FhirHub: Prometheus for metrics collection, Grafana for dashboards and alerting, and Loki for log aggregation. Combined with the developer experience tooling, this completes the full monitoring lifecycle.

Why This Stack?

Prometheus + Grafana + Loki vs. Datadog vs. ELK vs. CloudWatch

StackCostSelf-HostedK8s NativeLogs + Metrics
Prometheus + Grafana + LokiFree (OSS)YesServiceMonitor CRDYes
Datadog~$15/host/monthNoAgent-basedYes
ELK (Elastic)Free (OSS) or paidYesFilebeat/MetricbeatYes
CloudWatchPay-per-metricNo (AWS only)LimitedYes

The Prometheus stack is the Kubernetes-native choice. ServiceMonitor CRDs auto-discover pods to scrape. Grafana dashboards are JSON files that live in Git. Loki uses the same label-based query language as Prometheus. No vendor lock-in, no per-host pricing.

Why Not Datadog?

Datadog is excellent -- unified metrics, logs, traces, and APM in one SaaS. But at ~$15/host/month, costs scale with infrastructure. For an open-source project like FhirHub, self-hosted monitoring means anyone can run the full stack without a subscription. The Prometheus ecosystem also integrates more deeply with Kubernetes through CRDs.

Why Not ELK?

The Elastic stack (Elasticsearch, Logstash, Kibana) is powerful but resource-heavy. Elasticsearch requires significant memory for indexing. Loki takes a fundamentally different approach -- it only indexes log labels (pod name, namespace, container), not the log content. This makes Loki dramatically cheaper to run. You trade full-text search speed for lower resource usage, which is the right trade-off for application logs where you typically filter by service first.

Application Metrics

Adding Prometheus metrics to the .NET API required two lines:

// Program.cs
app.UseHttpMetrics();   // Records request duration, count, status code
app.MapMetrics();       // Exposes /metrics endpoint

The prometheus-net.AspNetCore NuGet package instruments all HTTP requests automatically. No manual counters needed for the basics. Every request records:

  • http_request_duration_seconds -- Histogram with method, status code, and path labels
  • http_requests_received_total -- Counter by method and status code
  • dotnet_gc_collection_count_total -- GC collections by generation
  • dotnet_threadpool_num_threads -- Active thread count
  • process_working_set_bytes -- Process memory usage

Why prometheus-net vs. OpenTelemetry?

LibraryProtocolMetrics + TracesMaturity (.NET)Setup Complexity
prometheus-netPrometheus scrapeMetrics onlyMature (v8+)2 lines
OpenTelemetry .NETOTLP pushYesGA but evolvingCollector + config

prometheus-net is simpler for metrics-only instrumentation. OpenTelemetry is the right choice if you also want distributed tracing, but FhirHub doesn't need trace correlation across services yet. Adding OpenTelemetry later is an additive change -- the Prometheus metrics stay the same.

Checkpoint: Verify Metrics Endpoint

Before continuing, verify the API exposes Prometheus metrics. In a separate terminal:

kubectl port-forward -n fhirhub-dev svc/fhirhub-api 5197:8080

Then in your main terminal:

curl -s http://localhost:5197/metrics | head -20

Expected output:

  • Should show Prometheus-format metrics with lines starting with # HELP, # TYPE, and metric values
  • Look for http_request_duration_seconds and dotnet_gc_collection_count_total in the output

If something went wrong:

  • If the connection is refused, check that the API pod is running: kubectl get pods -n fhirhub-dev -l app.kubernetes.io/name=fhirhub-api
  • If /metrics returns 404, verify that app.MapMetrics() is called in Program.cs and the prometheus-net.AspNetCore NuGet package is installed

ServiceMonitor CRDs

Each Helm sub-chart includes a ServiceMonitor template that tells Prometheus which pods to scrape:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: fhirhub-api
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: fhirhub-api
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

The release: prometheus label is how kube-prometheus-stack discovers ServiceMonitors. Without it, Prometheus ignores the resource.

Why ServiceMonitor vs. Annotations?

Discovery MethodTypedScopedMultiple Endpoints
prometheus.io/scrape annotationNoPod onlyNo
ServiceMonitor CRDYesNamespace-scopedYes
PodMonitor CRDYesPod-scopedYes

Annotations are the legacy approach -- a boolean on each pod. ServiceMonitors are typed CRDs with validation, namespace scoping, and support for multiple endpoints with different intervals. They're the standard for kube-prometheus-stack.

Grafana Dashboards

Two Grafana dashboards ship as JSON in the repository, loaded via ConfigMaps:

FhirHub API Dashboard

  • HTTP request rate by method and endpoint
  • Latency percentiles (p50, p95, p99)
  • 5xx error rate as a percentage
  • GC collection count by generation
  • Thread pool thread count
  • Process working set memory

FhirHub Overview Dashboard

  • Service health status (up/down) for all 4 services
  • Total request throughput stacked by service
  • PostgreSQL active connections by state
  • Pod CPU and memory usage by component
  • Keycloak login count (success vs. failed)

Why Dashboards-as-Code?

ApproachVersion ControlledReproducibleReviewable
Manual Grafana UINoNoNo
Grafana provisioning APIPartiallyYesDepends
JSON in Git + ConfigMapYesYesPR review

Storing dashboards as JSON in the repository means they're versioned, reviewable in PRs, and automatically deployed with the rest of the infrastructure. A new team member gets the same dashboards as everyone else. No manual setup.

Alerting Rules

Seven PrometheusRule alerts cover the critical failure modes:

AlertConditionForSeverity
ApiDownup{job="fhirhub-api"} == 05mCritical
HapiFhirDownup{job="hapi-fhir"} == 05mCritical
KeycloakDownup{job="keycloak"} == 05mCritical
PostgreSQLDownup{job="postgresql"} == 05mCritical
HighErrorRate>5% HTTP 5xx5mWarning
HighLatencyp99 > 2 seconds10mWarning
HighMemory>90% memory limit10mWarning

Critical alerts fire after 5 minutes of downtime. Warning alerts give longer windows to avoid noise from transient spikes.

Why 5 Minutes for Critical?

ThresholdFalse PositivesDetection SpeedOperator Fatigue
1 minuteHigh (transient restarts)FastHigh
5 minutesLowModerateLow
15 minutesVery lowSlowVery low

Five minutes is the standard balance. Kubernetes restarts crashed pods quickly, so a brief restart shouldn't page anyone. If a service is still down after 5 minutes, something is genuinely wrong and needs human attention.

Checkpoint: Verify Monitoring Stack

Before continuing, verify the full monitoring stack is running:

make monitoring-up
kubectl get pods -n monitoring

Expected output:

  • All monitoring pods (prometheus, grafana, alertmanager, node-exporter, kube-state-metrics) should be Running

Verify Prometheus is scraping targets:

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
  • Open http://localhost:9090/targets -- scrape targets should show UP state for FhirHub services
  • Open http://localhost:9090/alerts -- should list all 7 alert rules (ApiDown, HapiFhirDown, KeycloakDown, PostgreSQLDown, HighErrorRate, HighLatency, HighMemory)

Verify Grafana dashboards:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
  • Open http://localhost:3000 -- log in with admin credentials
  • Navigate to Dashboards and look for the FhirHub folder -- should contain "FhirHub API" and "FhirHub Overview" dashboards
  • Open "FhirHub API" -- panels should show data for request rate, latency, error rate, and runtime metrics

If something went wrong:

  • If Grafana shows "No data" on dashboards, verify the Prometheus datasource is configured correctly in Grafana's datasource settings

Log Aggregation with Loki

Loki collects logs from all containers via Promtail. Unlike Elasticsearch, Loki doesn't index log content -- it indexes labels only and stores compressed log chunks. This makes it lightweight enough to run alongside Prometheus without doubling infrastructure costs.

Pipeline Stages

Promtail uses pipeline stages to parse different log formats:

ServiceFormatExtracted Fields
FhirHub APISerilog JSONLevel, RenderedMessage, SourceContext, RequestPath, StatusCode, TraceId
FrontendNext.js stdoutRegex + JSON stages
HAPI FHIRJava LogbackTimestamp, level, thread, logger, message
KeycloakKeycloak logsTimestamp, level, thread, logger, message

Each service gets a pipeline stage that extracts structured fields from its log format. In Grafana's Explore view, you can query all services with LogQL:

{namespace="fhirhub-prod", app="fhirhub-api"} |= "error"
{namespace="fhirhub-prod"} | json | level="Error" | line_format "{{.RenderedMessage}}"

Why Loki vs. Elasticsearch for Logs?

FeatureLokiElasticsearch
IndexingLabels onlyFull-text
Memory usageLow (~256MB)High (~2GB+)
Query speed (filtered)FastFast
Query speed (unfiltered)SlowerFast
Query languageLogQLKQL / Lucene
Grafana integrationNativePlugin

Loki wins for application log aggregation where you almost always filter by service, namespace, or pod first. Elasticsearch wins when you need full-text search across all logs without knowing which service produced them. For FhirHub, the former is the common case.

Checkpoint: Verify Log Aggregation

Before continuing, verify Loki is collecting logs from FhirHub pods.

In Grafana (http://localhost:3000), go to Explore and select the Loki datasource. Run these queries:

  • {namespace="fhirhub-dev"} -- should return log lines from all FhirHub pods
  • {namespace="fhirhub-dev", app="fhirhub-api"} |= "info" -- should return API info-level logs

Expected output:

  • Log lines appear with timestamps, pod names, and log content. You should see startup messages at minimum

If something went wrong:

  • If no logs appear, check that Promtail is running: kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
  • If Promtail is running but Loki shows no data, verify the Loki datasource URL in Grafana is correct (typically http://loki:3100)
  • If only some pods show logs, check Promtail's scrape config covers the fhirhub-dev namespace

Developer Experience

Makefile

Every common operation is a make target:

make up              # Start all services
make dev             # Start infrastructure only
make dev-frontend    # Run frontend with hot-reload
make test            # Run all tests
make lint            # Run all linters
make build-api       # Build API Docker image
make push-api        # Push to Docker Hub
make k8s-create      # Create local Kind cluster
make k8s-deploy      # Deploy via Helm
make monitoring-up   # Deploy monitoring stack
make clean           # Remove everything

No one should need to remember Docker Compose flags or Helm command syntax.

Local Kubernetes with Kind

The scripts/setup-local-k8s.sh script creates a complete local environment:

  1. Creates a Kind cluster (1 control plane + 2 workers)
  2. Installs nginx-ingress for local routing
  3. Installs cert-manager for TLS
  4. Installs ArgoCD
  5. Builds and loads local Docker images
  6. Deploys FhirHub via Helm
  7. Applies ArgoCD ApplicationSet

One command: make k8s-create. The entire stack running on your laptop.

How It All Connects

The full deployment flow ties every tool together:

Developer pushes to main
  │
  ├── GitHub Actions CI/CD (Post 15)
  │   ├── Run .NET tests
  │   ├── Run frontend tests
  │   ├── Build + push API image to Docker Hub
  │   ├── Build + push frontend image to Docker Hub
  │   ├── Trivy scan both images
  │   └── Update helm/fhirhub/values.yaml with new image tag
  │
  ├── ArgoCD detects values.yaml change (Post 17)
  │   ├── Syncs dev namespace (automated)
  │   ├── Syncs staging namespace (automated)
  │   └── Prod waits for release branch merge
  │
  └── Prometheus + Grafana + Loki (this post)
      ├── Scrapes /metrics from all pods
      ├── Collects logs via Promtail
      ├── Fires alerts if services go down
      └── Dashboards show request rates, latency, errors

Git is the single source of truth. Push code, pipelines build and test it, ArgoCD deploys it, monitoring verifies it's healthy. No manual steps.

What's Next

In Part 19, we'll deploy the entire FhirHub stack -- application services, databases, and monitoring -- on a single machine using k3s. We'll compare k3s vs. MicroK8s vs. kubeadm, adapt the Helm values for constrained resources, and set up Traefik ingress with a backup strategy for single-node PostgreSQL.


Find the source code on GitHub Connect on LinkedIn

Related Projects

Featured

FhirHub

A healthcare data management platform built on the HL7 FHIR R4 standard, providing a comprehensive web interface for managing patient clinical data including vitals, conditions, medications, lab orders, and bulk data exports with role-based access control and full audit logging.

Next.js 16
React 19
Typescript
Tailwind CSS 4
+8