Model Selection Decision Framework

FlowRidge

Model Selection — Capability × Cost

High capability

Low cost

Cheap + capable — default

Mid-tier general purpose

Frontier — reasoning intensive

Complex multi-step, high stakes

Specialised small

Fine-tuned, narrow, efficient

Expensive but weak

Avoid — legacy or lock-in trap

High cost

Low capability

Figure 338. Model selection trades capability for cost and latency. Routing and distillation let portfolios span multiple quadrants for a single product.

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 2 of 35

The first architecture decision the AITE-SAT holder makes on any project is which foundation model the system will use. It is also the decision most often made for wrong reasons. It is made because an executive read an analyst report; because the cloud account has a credit balance with one provider; because an engineer had a good week with one model during a hackathon; because a procurement officer standardized on one vendor two years ago. None of those reasons is architectural. This article gives the architect an eight-criterion decision framework that works whether the candidate is a closed-weight API (Anthropic Claude, OpenAI GPT-class, Google Gemini, Cohere Command, Mistral’s managed service), an open-weight model self-hosted or managed (Meta Llama 3, Mistral open-weight, Alibaba Qwen, DeepSeek), or a cloud-platform offering (AWS Bedrock, Azure AI Foundry, Google Vertex AI, Databricks Foundation Model APIs) that aggregates several.

The eight criteria

Model selection is multi-criteria because no single criterion tells the architect which model to use. A model that wins on capability may lose on data residency. A model that wins on cost may lose on latency. A framework that forces the architect to score every candidate against every criterion produces defensible decisions that survive procurement review, security review, and the first two budget cycles.

The criteria are fit to task, capability, cost, latency, data residency and sovereignty, customization, operational maturity, and exit cost.

Fit to task asks whether the model is suited to the kind of work the use case demands. A customer-support summarization task is not the same as a complex reasoning task; a structured-output extraction task is not the same as a creative-writing task; a short-context question-answer is not the same as a long-context document review. Each model family has a shape of capability that makes it better for some tasks than others. Stanford’s HELM benchmark decomposes general capability into many narrow evaluations; an architect uses HELM and task-specific evaluations to disqualify models that the capability shape does not fit.¹

Capability asks whether the model meets a minimum quality bar on the specific task. Capability is measured by offline evaluation against a golden set for the task, not by general benchmarks. Article 11 develops evaluation architecture; for model selection, the rule is that capability is proved by the use case’s own golden set and by nothing else.

Cost asks what the model costs per unit of work. For managed APIs the unit is input tokens plus output tokens at the published rate, with caching discounts applied. For self-hosted open-weight models the unit is the amortized cost of the GPU or inference accelerator capacity that serves the workload, divided by the throughput achieved. For cloud-platform offerings the unit is the per-token or per-hour cost published by the platform. A model that is 5x cheaper per token but half as capable may cost more per completed task if the application has to retry failed responses; the architect reasons in cost per successful task, not cost per token.

Latency asks how fast the model responds. The relevant figures are time-to-first-token (TTFT), inter-token latency, and end-to-end p95 and p99 for the expected prompt and completion length. Latency varies across providers and across model variants within a provider; a managed API at one region is often faster than the same model at a distant region because network round-trip time dominates for short prompts.

Data residency and sovereignty asks where the model runs and where its inputs and outputs flow. For a system handling EU personal data an architect must confirm that inputs never cross a data-residency boundary the regulator disallows; for a system handling regulated financial data, the architect must confirm that logs and telemetry comply with the supervisor’s expectations. The EU AI Act’s high-risk obligations do not impose a data-residency rule directly, but the GDPR and sector supervisors routinely do.² AWS has published its European Sovereign Cloud roadmap for EU-only operation; Microsoft has published its EU Data Boundary; Google Cloud publishes its Sovereign Controls documentation. Residency is a veto criterion: a model that cannot satisfy residency is disqualified, regardless of how strongly it scores on every other criterion.

Customization asks whether the model can be customized in the ways the use case needs. Customization spans prompt engineering (supported by every model), few-shot examples (supported by every model), retrieval-augmented context (supported by every model), parameter-efficient fine-tuning such as LoRA and QLoRA (supported by most open-weight models and a subset of closed-weight offerings), full fine-tuning (supported by most open-weight and by some closed-weight with provider assistance), and continued pretraining (rare, expensive, and subject to licensing terms). The use case determines how much customization is required; Article 10 develops the decision tree.

Operational maturity asks whether the provider’s operational stance is credible for production. For a managed API, operational maturity is SLA transparency, incident history, post-mortem quality, and deprecation policy. For an open-weight model, operational maturity is the ecosystem: how many serving stacks the weights run on, how many quantizations exist, how many inference frameworks optimize for the architecture, how actively maintained the reference implementation is. OpenAI publishes a deprecation schedule that has retired multiple models since 2023; an architect who depended on a deprecated model without reading the schedule pays the cost of a forced migration.³ A model with a well-documented deprecation policy is more mature than one with none.

Exit cost asks how expensive it will be to switch away from this model later. Exit cost is the cost of abstraction (how much the application code assumes this model’s specific behavior), the cost of data (whether fine-tuning weights are portable), and the cost of integrations (whether the application’s tool-calling schema binds to this provider’s format). A decision that locks the system to one model family for five years is a higher-cost decision than one that keeps the same five-year option open. Exit cost is the criterion most often ignored at selection and most often paid at the second procurement cycle.

[DIAGRAM: MatrixDiagram — aite-sat-article-2-selection-matrix — A 2x2 matrix with axes “Data sensitivity” (low/high) and “Customization need” (low/high), with quadrant recommendations: “Managed API” (low/low), “Managed API + fine-tuning” (low/high), “Open-weight self-hosted” (high/high), “Hybrid with sovereign managed layer” (high/low). Each quadrant lists two concrete examples of the recommended stance.]

The selection scorecard

The selection scorecard is the working artifact of the framework. It lists every shortlisted model on the rows and the eight criteria on the columns, with a score per cell, a weight per criterion, and a weighted total per row. The weights vary by use case: a customer-support assistant may weight cost heavily and residency lightly, while a regulated-document assistant inverts the weighting. The scorecard is not a mechanical selector; it is a structured way of making the decision legible to everyone who will review it.

Criterion	Managed API (e.g., Anthropic Claude, OpenAI GPT-class, Google Gemini)	Open-weight self-hosted (e.g., Llama 3, Mistral, Qwen, DeepSeek)	Cloud-platform (e.g., AWS Bedrock, Azure AI Foundry, Vertex AI)
Fit to task	High on general reasoning; depends on model family	Depends on weights and fine-tuning	Strong on integrated platforms; depends on hosted model
Capability	Usually highest raw capability per dollar of effort	Competitive with the largest closed models in most use cases; requires evaluation	Inherits capability of underlying model; adds platform abstractions
Cost	Pay per token; no infrastructure	Amortized GPU cost; fixed capacity	Pay per token on hosted models; optional reserved capacity
Latency	Depends on region; TTFT typically sub-second	Dependent on serving stack; vLLM and TGI give predictable TTFT	Platform-dependent; comparable to managed API
Data residency	Constrained to provider’s regions; EU boundary options exist	Full control; deploy anywhere	Constrained to cloud region; sovereign options available
Customization	Limited fine-tuning; prompt + few-shot + RAG	Full control: LoRA, QLoRA, full fine-tune, continued pretraining	Platform-managed fine-tuning where supported
Operational maturity	Mature, with published deprecation policies	Depends on model lineage and community	Mature; bound to cloud provider’s operational maturity
Exit cost	Moderate to high depending on API-specific features	Low if deployment is containerized	High for cloud-specific integrations; lower for model-only use

The table is a typology, not a ranking. Every real decision needs its own scorecard with its own weights.

Two worked decisions

BloombergGPT. In 2023 Bloomberg announced BloombergGPT, a 50-billion-parameter language model trained from scratch on a 363-billion-token mixture of Bloomberg’s proprietary financial data and public datasets. The accompanying technical report explains that the team chose to train their own model rather than fine-tune an existing one because they believed the base training mixture for a finance-domain model should differ enough from general web text to justify the cost.⁴ That decision is a legitimate application of the framework: the scorecard weighted customization and fit to task above cost and operational maturity, and the weighting reflected Bloomberg’s estimate of the long-term value of owning the weights. A scorecard is the way to convince the procurement board that such a decision was deliberate rather than ambitious.

Morgan Stanley. In the same year Morgan Stanley partnered with OpenAI to build the wealth-management assistant cited in Article 1. The public press material explains that Morgan Stanley indexed approximately 100,000 internal research documents and exposed the corpus to GPT-class models through a RAG architecture.⁵ That decision is the opposite application of the framework: cost, operational maturity, and time-to-value were weighted above customization and exit cost. Customization was largely unnecessary because RAG carried the domain-specificity; operational maturity mattered because the use case ran in front of regulated advisors. Both Bloomberg and Morgan Stanley reached defensible selections; they reached opposite selections because their scorecards carried different weights.

Eliminating hidden selection

The hardest failure mode in model selection is the hidden selection: the architect lists “managed API” and “self-hosted” as the two candidates but every practical decision the team makes tacitly assumes one of them. If the orchestration layer is scaffolded against OpenAI’s function-calling schema before the evaluation has concluded, the selection was made at scaffold time, not at review time. If the retrieval layer’s chunking has been tuned to one provider’s embeddings without evaluation against another provider’s, the selection happened in chunking. The discipline of the framework is keeping the candidates live until the scorecard is complete.

Hidden selection is prevented by three practices. The first is an evaluation harness (Article 11) that measures every candidate on the same golden set with the same metrics, so no candidate gets an early advantage from partial implementation. The second is an abstraction layer in the orchestration plane that isolates provider-specific APIs behind a shared interface; the implementation detail is swapped by changing a factory, not by rewriting code. The third is a procurement review sign-off before any candidate-specific infrastructure is provisioned; the procurement team gets to see the scorecard, the evaluation results, and the hypothetical migration plan to each other candidate.

[DIAGRAM: StageGateFlow — aite-sat-article-2-selection-gates — A horizontal stage-gate flow from “Candidate longlist (8-12 models)” through “Desk evaluation against fit and residency (5-7 models)” through “Golden-set benchmark” to “Shortlist of 2-3” to “Pilot evaluation” to “Selected model”. Each gate lists the artifact required to pass (scorecard row, benchmark summary, pilot result, signed-off recommendation).]

Selection in regulated contexts

In regulated industries — banking, insurance, health, pharmacy, public sector — selection is also a regulatory event. The EU AI Act’s Article 9 requires a risk management system for high-risk AI, and the risk management system’s records must show what models were considered and why the selected one is fit for purpose.⁶ ISO/IEC 42001:2023 Clause 6.1.2 requires the organization to identify, analyze, and evaluate AI risks, and the model selection record is one of the inputs to that clause.⁷ The Bank of England Prudential Regulation Authority’s SS1/23 supervisory statement on model risk management expects firms to document the model they chose, the alternatives considered, and the basis for the choice.⁸ An architect who submits a scorecard with eight criteria, weighted and signed, has already produced most of what the supervisor wants to see.

For EU AI Act General-Purpose AI Models, the provider of the GPAI model has its own obligations under Articles 50 through 56 of the AI Act, including technical documentation and systemic-risk assessment for models above the threshold.⁹ The downstream architect inherits some of those obligations when they deploy the model in a high-risk context; the model card and technical documentation the provider must publish under Article 53 become the architect’s evidence for their own risk file. Selection of a model whose provider does not publish compliant documentation is therefore a compliance selection as much as a technical selection.

Re-selection

A selection that cannot be re-run is not a selection; it is a commitment. The architect’s scorecard is the living artifact that gets re-scored every six to twelve months as model capabilities change, prices change, and new entrants appear. The open-weight ecosystem of 2024 did not exist in 2022; the next two years will see similar shifts. An architect who treats the first selection as final is building a system that will be obsolete at renewal; an architect who treats the first selection as the current best estimate of an ongoing decision is building a system that remains current.

Summary

Model selection is a multi-criteria decision that the architect owns. Eight criteria — fit to task, capability, cost, latency, data residency, customization, operational maturity, and exit cost — cover the space. The scorecard carries the weights per use case; the weights make the decision legible. Managed APIs, open-weight self-hosted models, and cloud-platform offerings each win in different weighting regimes; Bloomberg and Morgan Stanley reached opposite decisions from the same framework. Hidden selection is prevented by evaluation harnesses, orchestration abstractions, and procurement sign-off. In regulated contexts the scorecard is also the risk-management record, and re-selection every six to twelve months keeps the system current.

Further reading in the Core Stream: The AI Technology Landscape, Generative AI and Large Language Models, and Technology Decision Framework for Transformation Leaders.

Percy Liang et al., “Holistic Evaluation of Language Models,” Transactions on Machine Learning Research, 2023, and the HELM live leaderboard. Stanford Center for Research on Foundation Models. https://crfm.stanford.edu/helm/ — accessed 2026-04-19. ↩
European Data Protection Board, Guidelines 01/2025 on pseudonymisation and related opinions referencing international data transfer under Chapter V GDPR. https://edpb.europa.eu/our-work-tools/general-guidance/guidelines-recommendations-best-practices_en — accessed 2026-04-19. ↩
OpenAI model deprecation policy and schedule. OpenAI Platform Documentation. https://platform.openai.com/docs/deprecations — accessed 2026-04-19. ↩
Shijie Wu et al., “BloombergGPT: A Large Language Model for Finance,” arXiv:2303.17564, March 2023. https://arxiv.org/abs/2303.17564 — accessed 2026-04-19. ↩
Morgan Stanley Wealth Management deploys OpenAI-powered AI @ Morgan Stanley Assistant. Morgan Stanley press release, September 2023. https://www.morganstanley.com/press-releases/key-milestone-in-innovation-journey-with-openai — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 (EU AI Act), Article 9. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩
ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system, Clause 6.1.2. International Organization for Standardization. https://www.iso.org/standard/81230.html — accessed 2026-04-19. ↩
Bank of England Prudential Regulation Authority, Supervisory Statement SS1/23 — Model risk management principles for banks. May 2023. https://www.bankofengland.co.uk/prudential-regulation/publication/2023/may/model-risk-management-principles-for-banks — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689, Articles 50-56 (General-Purpose AI Models). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩