Skip to main content
Featured image for Advanced GPU Monitoring with DCGM-Exporter, Prometheus, and Grafana

Advanced GPU Monitoring with DCGM-Exporter, Prometheus, and Grafana

·427 words·3 mins

Introduction
#

Did you know that every GPU in your environment can generate valuable data? When properly analyzed, these metrics can optimize performance, reduce costs, and accelerate strategic decisions.

NVIDIA’s DCGM-Exporter allows you to extract detailed GPU metrics, seamlessly integrating with Prometheus and Grafana for smart dashboards and real-time visualizations.


Why GPU Monitoring Matters
#

Many companies miss opportunities by not properly tracking their GPU resources. With strategic monitoring, you can:

  • Detect performance bottlenecks before they impact production.
  • Optimize resource usage and reduce costs.
  • Make data-driven decisions instead of relying on assumptions.
  • Monitor multiple clusters in a centralized way.

The real value comes when you can centralize metrics and turn them into actionable insights.


What You’ll Learn in This Post
#

Here you’ll learn, step by step:

  • Quickly set up DCGM-Exporter in Docker
  • Deploy in Kubernetes using Helm Chart
  • Integrate metrics into central Prometheus
  • Visualize powerful dashboards in Grafana

All with a focus on turning raw metrics into strategic information for your business.


Running DCGM-Exporter in Docker
#

To quickly run DCGM-Exporter on a GPU-enabled machine:

docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:4.4.1-4.5.2-ubuntu22.04

Test the metrics endpoint:

curl localhost:9400/metrics

Test the metrics endpoint:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-604ac76c-d9cf-xxx"} 139

Deploying DCGM-Exporter in Kubernetes
#

NVIDIA maintains an official Helm Chart to install DCGM-Exporter in Kubernetes clusters:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter

Check the pod:

kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" -n default

Access metrics locally:

kubectl port-forward svc/dcgm-exporter 8080:9400
curl http://127.0.0.1:8080/metrics

Setting up Local Prometheus in Docker
#

Add a scrape job to collect DCGM-Exporter metrics in prometheus.yml:

scrape_configs:
  - job_name: "dcgm-exporter"
    static_configs:
      - targets: ["host.docker.internal:9400"]

Use host.docker.internal in Docker Desktop (Windows/Mac). On Linux, replace it with the host machine IP.

Restart Prometheus:

docker run -d --name prometheus --network=host \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Now your DCGM-Exporter metrics will be collected properly.


Grafana Integration
#

NVIDIA provides an official dashboard for metric visualization:

Just import the JSON into Grafana and start exploring real-time insights.

To fully leverage your data, send local Prometheus metrics to a central Prometheus, which connects directly with Grafana, enabling:

  • Accurate, real-time visualizations.
  • Actionable insights for strategic decisions.
  • Actionable insights for strategic decisions.

Want to know more or implement this integration in your environment? Get in touch and turn your metrics into smart decisions!


Conclusion
#

With DCGM-Exporter, you can monitor GPUs in on-premise environments or Kubernetes clusters, seamlessly integrating with Prometheus and Grafana.