Sustainable Model Selection: Smaller Models, Better Outcomes

FlowRidge

Definition

Sustainable model selection is the discipline of choosing the smallest, most efficient model that satisfies the use case’s accuracy, latency, and reliability requirements — and refusing the over-specified alternative even when the over-specified alternative is the most-discussed model in the market. The principle is straightforward. The practice is harder, because the cultural defaults in many enterprise Artificial Intelligence (AI) programs push toward the largest model that the budget permits, on the assumption that bigger is always better. This article develops the counter-discipline. It establishes that for the majority of enterprise use cases, a right-sized model delivers comparable or better outcomes at a fraction of the energy and emissions footprint of the over-specified alternative.

The principle has its roots in the Green AI paper by Schwartz et al. in Communications of the ACM, which argued that the AI research community’s reward structure had drifted toward “Red AI” — the pursuit of marginal accuracy gains at exponentially growing compute cost — and that “Green AI” should reward efficiency-per-result alongside raw accuracy.¹ The same principle, translated into the enterprise context, becomes the model-selection discipline that this article develops.

The accuracy-energy curve

Empirical work across natural-language, vision, and speech tasks has produced a consistent finding: model accuracy improves logarithmically with model size, while energy consumption grows linearly or super-linearly. The implication is that doubling the model size produces a small fraction of the accuracy improvement at twice the energy cost. The Hugging Face AI Energy Score leaderboard publishes per-task energy figures that allow direct comparison: for many enterprise tasks, a 7-billion-parameter model achieves accuracy within 5% of a 70-billion-parameter model at one-tenth the inference energy.²

The McKinsey State of AI surveys have documented that the most successful enterprise AI deployments are increasingly using smaller, fine-tuned models rather than the largest available foundation models, in part because the inference economics at scale make the larger models prohibitively expensive — and the energy economics follow the cost economics.³

The five-step selection method

A foundational practitioner who is asked to select a model for a new use case should follow a five-step method.

Step 1: define the requirement. State the use case in terms of the minimum acceptable accuracy on a representative evaluation set, the maximum acceptable latency at the expected query volume, the maximum acceptable error rate on safety-critical inputs, and the deployment constraints (on-premise, cloud, edge). The requirement is the upper bound on what the model needs to do — not what the largest available model could do.

Step 2: enumerate candidate models. List the candidate models at three sizes — small (1-10 billion parameters), medium (10-70 billion parameters), and large (70+ billion parameters). For each candidate, record the published accuracy on standard benchmarks, the published energy figures on the AI Energy Score leaderboard or equivalent, the licensing terms, and the deployment constraints.

Step 3: evaluate on the requirement. Run each candidate against the use case’s evaluation set. Record actual accuracy, actual latency, and actual energy consumed per query. Do not substitute published benchmark numbers for actual evaluation on the use case’s own data.

Step 4: compute the efficiency ratio. For each candidate that satisfies the minimum acceptable accuracy, compute the efficiency ratio as accuracy improvement per kilowatt-hour relative to the smallest passing candidate. Models with low efficiency ratios are over-specified for the use case.

Step 5: select the smallest passing candidate. The default selection is the smallest model that satisfies the requirement. The default is overridden only if a larger model offers a meaningful improvement on a dimension that the requirement undervalued — typically latency at extreme query volumes or accuracy on long-tail safety-critical inputs.

Fine-tuning versus prompting

A related sustainability decision is whether to fine-tune a smaller pre-trained model or to prompt a larger general-purpose model. The energy economics typically favor fine-tuning. A one-time fine-tuning run on a 7-billion-parameter model produces emissions on the order of 1-10 tCO2e and produces a model that serves inference at a fraction of the per-query energy of a 100-billion-parameter general-purpose model. Over the lifetime of a high-traffic service, the fine-tuned model accumulates orders-of-magnitude lower emissions than the prompted general-purpose model.

The Stanford Foundation Model Transparency Index (FMTI) compute-layer scores have made it increasingly possible to compare the per-token inference energy of different foundation models, which is the input that the fine-tuning-versus-prompting decision needs.⁴

The retrieval-augmented alternative

For use cases that need to access proprietary or recent knowledge, the third option is retrieval-augmented generation (RAG): a smaller foundation model paired with a vector store of the proprietary knowledge. RAG is typically more sustainable than fine-tuning a large model on the same knowledge, because the retrieval step adds a small constant energy cost per query while allowing a much smaller generation model to produce equivalent or better answers. The Green Software Foundation has documented case studies of RAG architectures producing order-of-magnitude energy reductions versus equivalent large-model approaches.⁵

Maturity Indicators

The COMPEL D19 maturity rubric specifies that at Level 3 (Defined), sustainability criteria are included in model selection and deployment checklists, and that model efficiency metrics (e.g., performance per watt) are included in model cards.⁶ At Level 4 (Advanced), model efficiency optimization is standard practice. The model-selection discipline that this article develops is the practice that produces the Level 3 indicator. An organization that has not standardized the five-step selection method — or its equivalent — cannot satisfy the Level 3 indicator regardless of how mature its measurement layer is.

The EU AI Act Article 95 voluntary code of conduct on sustainability is expected to encourage providers of general-purpose AI models to publish per-token inference energy figures, which would make the five-step selection method significantly easier to apply at scale.⁷

Practical Application

A foundational practitioner who is rolling out the selection discipline across an enterprise should produce three artifacts.

Artifact 1: the model-selection checklist. A one-page document that captures the five steps, the required evidence at each step, and the approval threshold for selecting a model larger than the smallest passing candidate. The checklist is added to the standard MLOps onboarding for any new AI use case.

Artifact 2: the efficiency-ratio dashboard. A platform-wide dashboard that, for every production AI system, displays the model size, the inference energy per query, and the efficiency ratio relative to the smallest passing alternative for that use case. Systems with low efficiency ratios become candidates for the optimization or replacement program.

Artifact 3: the over-specification audit. An annual review that examines every production AI system to identify over-specified models — models that could be replaced with a smaller model that satisfies the original requirement. The audit produces a prioritized backlog of replacement opportunities, ranked by annualized energy reduction.

The Organisation for Economic Co-operation and Development (OECD) AI Principles’ framing of sustainability as a lifecycle responsibility supports the discipline of choosing the right-sized model at the start of the lifecycle, rather than over-specifying and managing the consequences downstream.⁸

Summary

Sustainable model selection is the discipline of choosing the smallest model that satisfies the use case requirement. The accuracy-energy curve is logarithmic accuracy improvement against linear or super-linear energy growth, which means that over-specification is almost always inefficient. The five-step selection method — define, enumerate, evaluate, compute efficiency ratio, select smallest passing — operationalizes the discipline. Fine-tuning a smaller model and retrieval-augmented generation are typically more sustainable than prompting a larger general-purpose model. The COMPEL D19 maturity rubric requires sustainability criteria in selection checklists at Level 3 and standardized efficiency optimization at Level 4. The next article, M1.9Inference Optimization for Sustainability: Quantization, Distillation, Pruning, develops the technical practices that make a selected model more efficient at inference time.

Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. “Green AI.” Communications of the ACM, December 2020. https://cacm.acm.org/research/green-ai/ — accessed 2026-04-26. ↩
Hugging Face, “AI Energy Score Leaderboard.” https://huggingface.co/spaces/AIEnergyScore/Leaderboard — accessed 2026-04-26. ↩
McKinsey & Company, “The state of AI.” https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai — accessed 2026-04-26. ↩
Stanford CRFM, “Foundation Model Transparency Index.” https://crfm.stanford.edu/fmti/ — accessed 2026-04-26. ↩
Green Software Foundation, principles and case studies. https://greensoftware.foundation/ — accessed 2026-04-26. ↩
COMPEL Domain D19 maturity rubric, Levels 3 and 4. See shared/data/compelDomains.ts. ↩
Regulation (EU) 2024/1689 (EU AI Act), Article 95. https://artificialintelligenceact.eu/ — accessed 2026-04-26. ↩
Organisation for Economic Co-operation and Development, “OECD AI Principles.” https://oecd.ai/en/ai-principles — accessed 2026-04-26. ↩