-
-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Labels
Description
Depends on #30 (Prometheus metric exporter integration).
Energy metrics: How much energy is being consumed? How do users measure savings?
- Grafana dashboard for cluster-wide energy usage and breakdowns to individual training jobs integrated with Zeus
- CPU and DRAM energy measurement (CPU and DRAM energy measurement #36) will help distinguish with DCGM
Experiment managers: Each training experiment can be associated with its energy consumption (aggregate & over-time).
- Weights & Biases
- MLFlow