Measuring AI Energy Use: Methodologies, Tools, and Reporting Standards

FlowRidge

Definition

Measuring the energy consumption of an Artificial Intelligence (AI) workload is a three-layer problem. The bottom layer is the methodology that the organization will use to attribute energy to a workload — top-down attribution from facility-level meters, bottom-up attribution from per-process telemetry, or hybrid approaches that reconcile the two. The middle layer is the instrumentation tool that produces the per-workload telemetry — a software library that samples the accelerator’s power draw, a cloud-provider Application Programming Interface (API) that returns billing-line energy data, or a hardware probe that reads the rack-level Power Distribution Unit (PDU). The top layer is the reporting standard that the energy figure will be expressed against — the Greenhouse Gas (GHG) Protocol Scope 2, the EU Corporate Sustainability Reporting Directive (CSRD), the EU AI Act Article 95 voluntary code, the Hugging Face AI Energy Score leaderboard, or the customer’s RFP-specific format.

This article walks the foundational practitioner through each of the three layers and the practical trade-offs that separate a first-week estimate from a production measurement program.

Layer 1: methodology

The methodology decision is whether to attribute energy top-down, bottom-up, or hybrid.

Top-down attribution starts from the facility-level energy bill. The data center’s total electricity consumption is known from the utility meter. The IT-equipment consumption is derived by dividing by the Power Usage Effectiveness (PUE) ratio. The AI-workload share is derived by allocating IT-equipment consumption proportionally — typically by accelerator-hour share. The advantage of top-down is that the facility number is unambiguous and audit-grade. The disadvantage is that the per-workload allocation is an approximation that may be off by 30% to 50% for any individual workload.

Bottom-up attribution starts from per-workload telemetry. A software library samples the accelerator’s power draw at one-second intervals during a training run, records the cumulative energy consumed, and reports the figure as the workload’s energy consumption. The advantage of bottom-up is that the per-workload figure is directly measured. The disadvantage is that the figure does not include the cooling, power-conversion, networking, and storage overhead that the facility level captures.

Hybrid attribution reconciles the two. The bottom-up per-workload figure is multiplied by the facility’s effective PUE to add the overhead. The reconciled figure is checked against the facility-level total to verify that the sum across all workloads matches the meter. The hybrid approach is the production-grade methodology that most measurement programs converge on.

Layer 2: instrumentation tools

The instrumentation tool decision depends on where the workload runs.

For workloads on cloud accelerators, the cloud provider’s sustainability API is typically the easiest first measurement. Amazon Web Services, Microsoft Azure, and Google Cloud each publish per-account, per-service, per-region carbon-emissions figures with a one-month to three-month lag. The figures use cloud-provider-specific assumptions about PUE and grid mix that the organization should document but does not need to compute itself.

For workloads on owned or co-located hardware, an open-source library is typically the most accurate per-workload measurement. CodeCarbon is the most widely deployed open-source library for measuring training-run energy and emissions; it samples the GPU power draw using the NVIDIA Management Library (NVML), the CPU power draw using the Running Average Power Limit (RAPL) interface, and the memory power draw using vendor-specific tools, then multiplies by a configurable grid emission factor.¹ An alternative is the Carbontracker library, which has similar capabilities and is frequently used in research settings.

For workloads that need to be benchmarked against industry peers, the Hugging Face AI Energy Score leaderboard provides a standardized methodology that the model-provider community has begun to adopt. The leaderboard publishes per-task, per-model energy scores that allow apples-to-apples comparison across different foundation models and serving stacks.² Submitting an internally developed model to the leaderboard, or computing the leaderboard’s energy score for an internally deployed model, is an emerging practice for organizations that need to demonstrate sustainability claims to procurement teams or regulators.

For workloads that need to be measured at the facility level, rack-level PDU telemetry and Building Management System (BMS) integration are the audit-grade instrumentation. The PDU produces second-by-second power-draw data per rack; the BMS produces the cooling, lighting, and power-conversion overhead that the PDU does not see. The combined data feeds a facility-level energy ledger that the carbon-accounting program then attributes to workloads.

Layer 3: reporting standards

The reporting-standard decision is determined by who is asking for the number.

For internal reporting and engineering dashboards, the standard is typically per-workload kilowatt-hours and per-workload tCO2e, broken down by training, inference, and supporting infrastructure. The cadence is typically weekly. The audience is the engineering and platform teams who use the figure to identify optimization opportunities.

For corporate ESG reporting under the GHG Protocol Scope 2 (purchased electricity), the standard is the location-based or market-based emission factor multiplied by the kilowatt-hours consumed.³ The location-based method uses the grid average emission factor at the data-center location; the market-based method uses the contracted Renewable Energy Certificate (REC) or Power Purchase Agreement (PPA) emission factor that the organization has procured. Both methods are typically reported in parallel.

For EU regulatory reporting under the CSRD and the European Sustainability Reporting Standards (ESRS), the standard is the ESRS E1 climate-change disclosure, which requires Scope 1, Scope 2, and material Scope 3 categories with comparable prior-year figures and forward-looking transition-plan data.⁴ AI energy consumption is a Scope 2 category for owned data centers and a Scope 3 category for cloud-procured compute.

For EU AI Act compliance under Article 95’s voluntary codes of conduct on sustainability, providers of general-purpose AI models are encouraged to publish energy-consumption documentation for training and inference.⁵ The format is not fully prescribed at the time of writing but is expected to converge on the Hugging Face AI Energy Score methodology and on per-training-run kilowatt-hour disclosures.

Maturity Indicators

The COMPEL D19 maturity rubric uses the breadth and continuity of measurement as the indicator of progress. At Level 2 (Developing), the organization has done at least one-off measurement of the largest training runs. At Level 3 (Defined), the organization has continuous per-system kilowatt-hour and tCO2e tracking for every production AI system.⁶ The transition from Level 2 to Level 3 is typically the moment at which the measurement is integrated into the MLOps platform — every training run automatically logs its energy figure to a central registry without practitioner intervention. This integration is the single most important investment that the foundational program makes.

McKinsey’s State of AI surveys have documented that organizations with mature MLOps platforms — with experiment tracking, model registries, and automated deployment — are several times more likely to have continuous AI sustainability measurement than organizations without those platforms.⁷ The MLOps platform and the sustainability-measurement platform are the same platform.

Practical Application

A foundational practitioner who is building the measurement layer for the first time should sequence the work in three stages.

Stage 1: bootstrap with the cloud provider’s API. Pull the last twelve months of carbon-emissions data from the cloud provider’s sustainability dashboard. Allocate to AI workloads using accelerator-hour share. Produce the first-cut estimate. This stage takes one to two weeks and produces a Level 2 measurement.

Stage 2: instrument the largest training runs. Integrate CodeCarbon (or equivalent) into the MLOps training loop. Capture per-run energy, per-run grid emission factor, and per-run tCO2e. Reconcile against the cloud-provider figure. This stage takes one to two months and produces measurement that is accurate enough to support model-comparison decisions.

Stage 3: extend to inference and supporting infrastructure. Add per-inference-service telemetry. Add data-pipeline and storage telemetry. Reconcile against facility-level totals. Publish the consolidated figure to the engineering dashboard with weekly refresh. This stage takes three to six months and produces the Level 3 measurement that the rest of the program will build on.

The Organisation for Economic Co-operation and Development (OECD) AI Principles include sustainability as a value-based principle that AI actors should respect across the lifecycle, providing the high-level framing that the measurement program operationalizes.⁸ The IEA Electricity 2024 report provides the contextual data on data-center electricity growth that the program lead will use to set expectations with the executive sponsor.⁹

Summary

Measuring AI energy use is a three-layer problem: the methodology layer (top-down, bottom-up, hybrid), the instrumentation-tool layer (cloud APIs, CodeCarbon and similar libraries, AI Energy Score leaderboards, facility-level PDU and BMS), and the reporting-standard layer (internal engineering dashboards, GHG Protocol Scope 2, CSRD/ESRS E1, EU AI Act Article 95). The COMPEL D19 maturity rubric uses the breadth and continuity of measurement as the indicator of progress from Level 2 to Level 3. The foundational program bootstraps with the cloud-provider API, instruments the largest training runs with an open-source library, and then extends to inference and supporting infrastructure to reach continuous Level 3 measurement. The next article in this module, M1.9Sustainable Model Selection: Smaller Models, Better Outcomes, builds on the measurement layer to inform the model-selection decisions that drive the largest sustainability outcomes.

CodeCarbon, “Track and reduce CO2 emissions from your computing.” https://codecarbon.io/ — accessed 2026-04-26. ↩
Hugging Face, “AI Energy Score Leaderboard.” https://huggingface.co/spaces/AIEnergyScore/Leaderboard — accessed 2026-04-26. ↩
Greenhouse Gas Protocol, “Scope 2 Guidance.” World Resources Institute. https://ghgprotocol.org/ — accessed 2026-04-26. ↩
Directive (EU) 2022/2464 on Corporate Sustainability Reporting. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32022L2464 — accessed 2026-04-26. ↩
Regulation (EU) 2024/1689 (EU AI Act), Article 95 (Codes of conduct for voluntary application of specific requirements). https://artificialintelligenceact.eu/ — accessed 2026-04-26. ↩
COMPEL Domain D19 maturity rubric, Levels 2 and 3. See shared/data/compelDomains.ts. ↩
McKinsey & Company, “The state of AI.” https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai — accessed 2026-04-26. ↩
Organisation for Economic Co-operation and Development, “OECD AI Principles.” https://oecd.ai/en/ai-principles — accessed 2026-04-26. ↩
International Energy Agency, “Electricity 2024.” https://www.iea.org/reports/electricity-2024 — accessed 2026-04-26. ↩