Advanced GPU Monitoring with DCGM-Exporter, Prometheus, and Grafana ·

Table of Contents

Introduction
#

Did you know that every GPU in your environment can generate valuable data? When properly analyzed, these metrics can optimize performance, reduce costs, and accelerate strategic decisions.

NVIDIA’s DCGM-Exporter allows you to extract detailed GPU metrics, seamlessly integrating with Prometheus and Grafana for smart dashboards and real-time visualizations.

Why GPU Monitoring Matters
#

Many companies miss opportunities by not properly tracking their GPU resources. With strategic monitoring, you can:

Detect performance bottlenecks before they impact production.
Optimize resource usage and reduce costs.
Make data-driven decisions instead of relying on assumptions.
Monitor multiple clusters in a centralized way.

The real value comes when you can centralize metrics and turn them into actionable insights.

What You’ll Learn in This Post
#

Here you’ll learn, step by step:

Quickly set up DCGM-Exporter in Docker
Deploy in Kubernetes using Helm Chart
Integrate metrics into central Prometheus
Visualize powerful dashboards in Grafana

All with a focus on turning raw metrics into strategic information for your business.

Running DCGM-Exporter in Docker
#

To quickly run DCGM-Exporter on a GPU-enabled machine:

docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:4.4.1-4.5.2-ubuntu22.04

Test the metrics endpoint:

curl localhost:9400/metrics

Test the metrics endpoint:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-604ac76c-d9cf-xxx"} 139

Deploying DCGM-Exporter in Kubernetes
#

NVIDIA maintains an official Helm Chart to install DCGM-Exporter in Kubernetes clusters:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter

Check the pod:

kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" -n default

Access metrics locally:

kubectl port-forward svc/dcgm-exporter 8080:9400
curl http://127.0.0.1:8080/metrics

Setting up Local Prometheus in Docker
#

Add a scrape job to collect DCGM-Exporter metrics in prometheus.yml:

scrape_configs:
  - job_name: "dcgm-exporter"
    static_configs:
      - targets: ["host.docker.internal:9400"]

Use host.docker.internal in Docker Desktop (Windows/Mac). On Linux, replace it with the host machine IP.

Restart Prometheus:

docker run -d --name prometheus --network=host \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Now your DCGM-Exporter metrics will be collected properly.

Grafana Integration
#

NVIDIA provides an official dashboard for metric visualization:

Dashboard: Dashboard: DCGM Exporter Grafana Dashboard #12239
JSON for import: grafana/dcgm-exporter-dashboard.json

Just import the JSON into Grafana and start exploring real-time insights.

To fully leverage your data, send local Prometheus metrics to a central Prometheus, which connects directly with Grafana, enabling:

Accurate, real-time visualizations.
Actionable insights for strategic decisions.
Actionable insights for strategic decisions.

Want to know more or implement this integration in your environment? Get in touch and turn your metrics into smart decisions!

Conclusion
#

With DCGM-Exporter, you can monitor GPUs in on-premise environments or Kubernetes clusters, seamlessly integrating with Prometheus and Grafana.

Introduction#

Why GPU Monitoring Matters#

What You’ll Learn in This Post#

Running DCGM-Exporter in Docker#

Deploying DCGM-Exporter in Kubernetes#

Setting up Local Prometheus in Docker#

Grafana Integration#

Conclusion#