Fine-Tuning Decision Tree: RAG → Few-Shot → PEFT → Full Fine-Tune

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 10 of 35

Fine-tuning is the option every team considers first and should consider last. A new enterprise AI project reaches the end of a feasibility workshop and someone asks, “So, we need to fine-tune the model on our data, right?” The answer is almost always no, and the harder answer is why. Fine-tuning is expensive in compute, expensive in engineering time, expensive in ongoing governance burden, and in most enterprise cases the quality gain it delivers could have been achieved with prompt engineering, retrieval, or few-shot patterns at a small fraction of the cost. The architect’s job is to teach the escalation discipline — try the cheap moves first, measure the gap, escalate only when the gap is real and the cheaper moves cannot close it. This article gives the AITE-SAT learner that escalation ladder and the thresholds at which each rung becomes the right choice.

The escalation ladder

The ladder has five rungs, and the architect climbs them in order.

Rung 1 — Prompt engineering. Rewrite the prompt. Add system-prompt constraints. Add output-format scaffolding. Add explicit instructions for the edge cases the model is getting wrong. Article 3 gave the tools; rung 1 is their disciplined application. Most quality gaps that feel like “the model doesn’t understand our domain” close at rung 1 when a senior architect rewrites the prompt with the same rigor a senior developer would apply to a piece of production code.

Rung 2 — RAG. If rung 1 leaves a gap because the model lacks domain-specific knowledge, add retrieval. Article 4 gave the framework. The model’s training cutoff does not know the organization’s internal policies, product catalog, case law, or regulatory filings; retrieval puts that content in the prompt at query time. A large fraction of “we need to fine-tune” conclusions are actually “we need to retrieve” conclusions, and the architect who skips rung 2 and escalates directly to fine-tuning pays 100× the cost to solve a problem at rung 4 that would have closed at rung 2.

Rung 3 — Few-shot prompting. If the task is structural — the output must match a specific format the model keeps deviating from — add exemplars. Two to eight high-quality input-output pairs in the prompt teach the model the structural pattern more reliably than a verbose instruction. Few-shot is particularly effective for structured extraction, classification with nuanced categories, and rubric-bound generation. The prompt gets longer, but the quality gain on structural consistency usually justifies the cost.

Rung 4 — Parameter-efficient fine-tuning (PEFT). If rungs 1–3 leave a gap that cannot be closed by better prompts, more retrieval, or more exemplars — typically because the task requires a behavioral change the base model’s alignment does not permit or a stylistic register that prompt instructions cannot reliably enforce — consider PEFT. LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA for training on smaller hardware), and DoRA are the dominant parameter-efficient methods; they train a small set of adapter weights that layer on top of the frozen base model.¹ A LoRA adapter is often a few megabytes, trains on a single GPU in hours, and can be swapped at inference time. The cost and complexity are real but orders of magnitude lower than full fine-tuning.

Rung 5 — Full fine-tuning. Full fine-tuning updates all of the model’s weights (or a substantial fraction). It suits workloads where PEFT’s adapter-sized capacity is insufficient — typically because the domain shift is very large (specialized legal, scientific, or clinical vocabulary the base model barely saw in pretraining) or because the training objective itself is non-standard (preference-tuning to the organization’s house style). Full fine-tuning is the most expensive rung in every dimension: training compute, evaluation burden, ongoing maintenance, and the risk of catastrophic forgetting that makes the model worse on general tasks it used to handle well. The architect who climbs to rung 5 does so with a written justification that names the evaluation result at rungs 1–4 and the specific capability gap that full fine-tuning is expected to close.

[DIAGRAM: StageGateFlow — aite-sat-article-10-escalation-ladder — Vertical ladder with five rungs ordered bottom-to-top: “Rung 1: Prompt engineering”, “Rung 2: RAG”, “Rung 3: Few-shot”, “Rung 4: PEFT (LoRA / QLoRA / DoRA)”, “Rung 5: Full fine-tune”. Between each rung, a gate annotated with the decision question (“Does the gap persist after rung N on golden set?”), the measurement (recall, precision, rubric score, task metric), and the cost-of-escalation estimate. Side labels show cumulative compute cost, engineering time, and ongoing governance burden increasing sharply from rung 1 to rung 5.]

The governance risk of premature escalation

Full fine-tuning is not just expensive. It creates a new model artifact the organization is now responsible for. That artifact needs its own evaluation lineage, its own model card, its own registration in the model registry (Article 21), its own re-evaluation cadence when the underlying base model is deprecated, and its own conformity-assessment evidence if the use case is high-risk under the EU AI Act.² Each of those governance obligations is an ongoing cost that does not appear on the training-run invoice but is paid every quarter for as long as the fine-tuned model is in production. An architect who did not need fine-tuning and escalated anyway has committed the organization to that cost structure unnecessarily.

PEFT has lower governance burden because the adapter is lightweight and the base model is shared — but PEFT is still a new artifact that needs registration, evaluation, and lineage. The architect treats PEFT adapters as first-class models in the registry.

When PEFT is the right move

PEFT is the right move when three conditions hold. First, prompt engineering, RAG, and few-shot together leave a measured gap on the golden set that matters to the business outcome. Second, the gap has a behavioral or stylistic character that fits within the capacity of a LoRA adapter — typical examples include enforcing a corporate tone-of-voice register the base model resists, adding a narrow domain vocabulary the model keeps getting wrong, or teaching an instruction-following pattern the model did not learn in its alignment phase. Third, the team has the evaluation infrastructure (Article 11) to validate the adapter and the monitoring infrastructure (Article 13) to catch regressions after deployment.

Replit’s Ghostwriter fine-tuning, described in their 2023 public blog, is an example of PEFT applied to a narrow task — code completion in specific languages with the team’s preferred patterns — where the gap was real, measurable, and closable with adapter-sized training.³ The architectural point is that Replit evaluated the gap, chose the right rung, and documented the decision; other teams looking at similar use cases could calibrate their own escalation against Replit’s published numbers.

When full fine-tuning is the right move

Full fine-tuning is the right move for a small number of cases. Training on a new language or dialect that the base model handles poorly, specialized scientific or clinical domains where terminology, reasoning patterns, and conventions diverge materially from the pretraining distribution, and continued pretraining to extend a model’s knowledge with a large corpus of domain text. BloombergGPT, the first widely cited example of continued pretraining on financial data for a dedicated model, illustrates the pattern and also the cost — months of GPU time on a dedicated cluster, a new evaluation corpus, a new model card, and the ongoing governance of a bespoke model.⁴ BloombergGPT’s publication was valuable precisely because it made the cost structure transparent; subsequent teams could weigh their own case against it.

More recently, some organizations have concluded that the cost of continued-pretraining a specialized model is no longer justified as frontier base models improve on specialized domains faster than specialized models improve on them. The architect does not treat “fine-tune from scratch” as an eternal answer; the base-model capability curve keeps eroding the case for dedicated training.

The evaluation discipline that makes escalation honest

Every rung is evaluated the same way: a fixed golden set, a fixed set of metrics, and a fixed pass/fail threshold. Rung 1 is evaluated, and the result is the baseline. Rung 2 is evaluated and compared to rung 1. Rung 3 is evaluated and compared to rung 2. And so on. The architect who cannot produce a rung-level evaluation sheet should not be climbing to rung 4 or 5. The evaluation is the evidence the business uses to decide whether the escalation was worth the cost.

The golden set must be curated before rung 1 — not after rung 4, when it becomes tempting to define success as “whatever rung 4 produced.” Article 11 develops evaluation architecture at depth; the escalation ladder’s integrity depends on Article 11 being applied rigorously.

[DIAGRAM: BridgeDiagram — aite-sat-article-10-cost-and-quality-bridge — A left-to-right bridge showing five pillars labelled “Prompt engineering”, “RAG”, “Few-shot”, “PEFT”, “Full fine-tune”. The deck of the bridge is annotated with three parallel bands: “Cost” (rising from very low on the left to very high on the right), “Latency at inference” (flat across the first four, rising only at rung 5 because of model-size increase), and “Governance burden” (rising from very low at rung 1 to very high at rung 5 because of model-card, evaluation, registry, and conformity-assessment obligations). An annotation above the bridge reads “Climb only when the rung below leaves a measured gap on the golden set.”]

Rollback and the fine-tune exit plan

Every PEFT or full fine-tune plan includes an exit plan. If the fine-tuned model underperforms or regresses unexpectedly, traffic reverts to the base-model path. The production architecture keeps both the base-model and the fine-tuned-model paths operational during the rollout and canary phases, with a feature flag or traffic-split mechanism that allows instant revert. The architect who does not plan the exit discovers the absence of the exit when they need it most — mid-incident, under time pressure, with the affected users counting up.

Ongoing evaluation continues after launch. The fine-tuned model’s quality is re-measured against the golden set weekly or monthly; drift triggers the rollback or a retraining cycle. The fine-tuned model is re-evaluated when the base model it was trained on is deprecated by the provider; the team knows in advance that base-model deprecation will force a retraining cycle and budgets for it.

Two real-world examples

BloombergGPT. Bloomberg’s 2023 technical paper documented the training of a 50-billion-parameter model on a 363-billion-token mix of financial text and general corpora. The paper was valuable as an honest presentation of the cost-benefit trade-off: the specialized model outperformed general models on financial tasks at the time of publication but required enormous compute and an ongoing evaluation burden that only an organization of Bloomberg’s size and use-case density could justify.⁴ The architect reading BloombergGPT recognizes the ceiling — this is what full fine-tuning from scratch looks like — and calibrates their own escalation accordingly.

Hugging Face PEFT documentation and Llama 3 + QLoRA examples. Hugging Face maintains the reference documentation for parameter-efficient fine-tuning libraries, including production-grade examples of Llama 3 fine-tuned with QLoRA on consumer or prosumer GPUs.⁵ The examples are the open-weight PEFT reference that the AITE-SAT learner can reproduce locally. The architectural point is that PEFT is accessible — not a research curiosity — and a team with a modest budget and a well-prepared training dataset can run a fine-tune cycle in a day. The accessibility is exactly what makes disciplined escalation necessary; without discipline, teams fine-tune because they can, not because they should.

Data preparation is most of the work

Teams that discover fine-tuning for the first time expect the training run to be the hard part. It is not. Data preparation — collecting training examples, labeling them consistently, validating labels, filtering out noisy or duplicate entries, structuring prompts in the format the base model expects, holding out an evaluation split that was not used in training — is typically 70–80% of the effort and the component most responsible for the success or failure of the fine-tune. A training run on poorly prepared data produces a model that memorizes noise; a training run on well-prepared data on a modest budget produces a model that generalizes to the intended distribution.

The architect specifies the data-preparation pipeline as carefully as the training pipeline. Labels are generated by humans, by an LLM-as-judge validated against humans, or by a hybrid; the validation protocol is documented; inter-rater reliability is measured; the final training set is reviewed for distribution coverage against the target workload. A team that cannot document these steps does not have fine-tune-ready data regardless of how many examples it has collected.

Regulatory alignment

Fine-tuning triggers EU AI Act obligations for any high-risk use case. Article 10 on data governance applies to the training data; Article 11 on technical documentation applies to the trained artifact; Article 12 on record-keeping applies to the training and evaluation logs; Article 15 on accuracy, robustness, and cybersecurity applies to the deployed model’s behavior.⁶ The fine-tuned model is its own governed artifact with its own conformity-assessment obligations. A PEFT adapter shares the base model’s risk profile but adds its own obligations on top. The architect plans the regulatory posture of the fine-tune before starting the training run, not after.

Summary

The fine-tuning decision is an escalation discipline. Five rungs — prompt engineering, RAG, few-shot, PEFT, full fine-tune — increase in capability and in cost. The architect climbs in order, evaluates at each rung against the golden set, and justifies each escalation with the measured gap that the cheaper rung could not close. Premature escalation commits the organization to ongoing governance burden for a fine-tuned artifact that did not need to exist. Replit’s Ghostwriter is an example of correct PEFT; BloombergGPT is an example of the ceiling full fine-tuning reaches at scale. Hugging Face’s PEFT documentation is the reference open-weight stack. Rollback and re-evaluation are non-negotiable parts of any fine-tuning deployment. Regulatory alignment under the EU AI Act applies to fine-tuned artifacts as to any other high-risk model.

Further reading in the Core Stream: Model Development Patterns for Enterprise AI and Continuous Evaluation of AI Systems.

Edward Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR 2022 (arXiv 2106.09685). https://arxiv.org/abs/2106.09685 — accessed 2026-04-20. Tim Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” NeurIPS 2023 (arXiv 2305.14314). https://arxiv.org/abs/2305.14314 — accessed 2026-04-20. DoRA paper: Shih-Yang Liu et al., “DoRA: Weight-Decomposed Low-Rank Adaptation” (arXiv 2402.09353). https://arxiv.org/abs/2402.09353 — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Articles 10–15 and Annex IV. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. ↩
Replit engineering blog on Ghostwriter fine-tuning. https://blog.replit.com/ — accessed 2026-04-20. ↩
Shijie Wu et al., “BloombergGPT: A Large Language Model for Finance” (arXiv 2303.17564). https://arxiv.org/abs/2303.17564 — accessed 2026-04-20. ↩ ↩²
Hugging Face PEFT library documentation and Llama 3 QLoRA examples. https://huggingface.co/docs/peft — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Articles 9–15. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. ↩