Hyperparameter Search and Model Selection

FlowRidge

Hyperparameter Search Strategies — Cost × Coverage

High coverage

Low compute

Bayesian optimisation

Adaptive, sample-efficient

Grid search

Exhaustive, reproducible

Manual tuning

SME-led, low coverage

Random search

Cheap baseline

High compute

Low coverage

Figure 317. Search strategies trade compute cost for coverage of the hyperparameter space. Matching strategy to budget is the experimentation design decision.

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 5 of 14

Hyperparameter search sits at a peculiar intersection in AI experimentation: the mechanics are well-understood, the literature is mature, and yet teams routinely waste compute on poorly-designed searches and ship models that are subtly worse than ones a smarter search would have found. Hyperparameter search is also where compute cost and time-to-result compound fastest, because a poorly-bounded search can easily cost ten or a hundred times what a well-bounded one would have. This article teaches how to design a search that finds a good model without burning the budget.

The search space

Hyperparameters are the configuration choices that are not learned from data: learning rate, batch size, regularization strength, architecture depth, embedding dimension, temperature, top-k and top-p for generation, retrieval chunk size, reranker depth, and so on. The search space is the set of hyperparameters the search will explore and the ranges they will take.

Two practical principles govern search-space design.

Distinguish high-impact from low-impact hyperparameters. Not all hyperparameters matter equally. Learning rate almost always does. Batch size often matters but its effect is usually coupled with learning rate. Weight decay, gradient clipping thresholds, and warmup steps often matter less. The published literature and practitioner experience support narrow ranges for low-impact parameters and wider ranges for high-impact ones. The practitioner’s job is to think about the parameter list before the search, not after.

Prefer log-uniform ranges where appropriate. Learning rate is usually explored log-uniformly, not uniformly, because the sensitive region spans orders of magnitude. Regularization strength is similar. Most hyperparameter-search libraries, including those in Optuna, Ray Tune, scikit-learn, Weights & Biases, MLflow, and SageMaker Autopilot, accept log-uniform specifications directly.

The size of the search space is the principal driver of search cost. Before running the search, the practitioner writes down the search space, computes the number of unique configurations a grid search would explore, and decides which strategy is appropriate for the size.

Search strategies

Four strategies cover most of what a practitioner will do.

Grid search. Exhaustively evaluate every combination of a discrete set of values per hyperparameter. Grid search is appropriate when the space is small (2–3 hyperparameters with 3–5 values each) and the per-configuration training cost is modest. It becomes infeasible fast; a 5-hyperparameter grid with 5 values each is 3,125 configurations.

Random search. Sample configurations uniformly (or log-uniformly) from the space. Bergstra and Bengio (JMLR 2012) showed that random search outperforms grid search in most realistic settings because most hyperparameters have low effective dimensionality — a few matter, and grid’s uniform coverage wastes evaluations on dimensions that do not¹. Random search is the correct default for 4+ hyperparameters.

Bayesian optimization. Model the relationship between hyperparameter configurations and objective values as a probabilistic surface (often a Gaussian process or a tree-structured Parzen estimator), then choose the next configuration to maximize an acquisition function (often expected improvement). Bayesian methods use far fewer evaluations than random search to reach a given quality and are the right choice when each evaluation is expensive (long training runs, large datasets). Libraries including Optuna, Hyperopt, Ax, and Ray Tune implement the standard Bayesian methods².

Hyperband and its successors. Hyperband (Li et al., JMLR 2018) treats hyperparameter search as a bandit problem over partial training runs, allocating more compute to promising configurations and early-stopping the unpromising ones³. Hyperband works well when partial-training results are predictive of fully-trained results (usually true for deep learning). BOHB (Bayesian Optimization Hyperband) combines Bayesian and Hyperband advantages. Population-Based Training (PBT) from DeepMind evolves hyperparameters during training rather than choosing them up front.

[DIAGRAM: MatrixDiagram — aitm-eci-article-5-strategy-matrix — 2x2 of “Dimensionality (low vs. high)” by “Per-evaluation cost (low vs. high)” mapping to grid (low/low), random (high/low), Bayesian (low/high), and Hyperband/BOHB (high/high).]

Compute budget

Hyperparameter search is almost always the largest compute consumer in offline evaluation. Budget discipline is how programs stay within their allocations.

A budget is authored in three parts. First, the practitioner estimates the cost of one full training run (GPU-hours, dollars, or tokens, depending on the model class). Second, the practitioner multiplies by the expected number of configurations the chosen strategy will evaluate. For random search this is a planning number; for Bayesian and Hyperband it is a function of the early-stopping schedule. Third, the practitioner adds a multiplier (1.5x to 2x is typical) for rerun overhead, failed configurations, and debugging.

The budget is a ceiling, not a target. An experienced practitioner stops the search when the marginal improvement from additional configurations is small — typically when the best configuration has not improved across the last 20% of evaluations. Stopping rules are specified in advance, alongside the other elements of the hypothesis.

Spot and preemptible compute deserve a mention. Major cloud providers (AWS EC2 Spot, Google Cloud Spot VMs, Azure Spot Virtual Machines) price interruptible compute at 60–90% discount versus on-demand. Hyperparameter search is an almost-ideal workload for spot compute because individual configurations are independent and failed configurations can be restarted. A practitioner who designs the search to tolerate preemption (checkpointing every N steps, using a job scheduler that re-queues preempted configurations) can multiply the budget effectively by 3–5x. The saving moves the decision on what to search.

The tuning-to-validation pitfall

The single most common hyperparameter-search error is over-tuning the validation set. The validation set is used to choose between configurations. Each configuration produces a validation score. The best validation score is, by construction, an optimistic estimate of the model’s true generalization: the search has explicitly optimized for validation performance, and the winning configuration is the one that happened to align best with the validation set’s idiosyncrasies.

The correction is to reserve a held-out test set that the hyperparameter search never touches. The search chooses the configuration on validation; the final metric on the unseen test set is what the practitioner reports. This re-anchors the reported performance to a slice the search has not overfit to.

Two complementary practices help.

Nested cross-validation. The outer loop estimates generalization; the inner loop selects hyperparameters. Nested cross-validation is more expensive than flat cross-validation, but it produces honest generalization estimates and is appropriate when data is small and the risk of validation overfitting is high.

Validation-set refresh. If the validation set is small and hyperparameters have been searched against it many times, refresh the validation set before the next major search. The refreshed set is unpolluted by prior tuning.

[DIAGRAM: ScoreboardDiagram — aitm-eci-article-5-search-leaderboard — A table with columns for configuration, validation score, test score, compute cost (GPU-hours), and a “winning” flag, illustrating that the winning configuration is decided on validation and confirmed on test.]

Documenting the search

The winning configuration is not the full artifact. The full artifact is the search: the space that was explored, the strategy that was used, the budget that was spent, every configuration that was evaluated, and its resulting metrics. Without the search record, the winning configuration cannot be reproduced, cannot be audited, and cannot be improved.

The search record is a structured artifact:

The search space, exactly as specified to the search library, with per-hyperparameter ranges and scales.
The strategy (grid, random, Bayesian, Hyperband, PBT, other), with library version.
The seed used for randomization.
The budget, expressed in GPU-hours or dollars, and the actual spend.
The complete configuration-and-metric table for every evaluated configuration.
The winning configuration, clearly flagged.
A held-out test-set metric for the winning configuration.
A brief narrative: what surprised the practitioner, which hyperparameters mattered, which did not.

Modern experiment-tracking tools (MLflow, Weights & Biases, Neptune, Comet, Aim, Kubeflow Metadata, and Azure ML, SageMaker, Vertex AI, Databricks native tracking) store the search record natively; the practitioner’s job is to use them, not to pretend a spreadsheet is sufficient⁴.

Two real references in the hyperparameter vocabulary

Bergstra and Bengio — random over grid. Bergstra and Bengio (JMLR 2012) established that random search dominates grid search in high-dimensional spaces for most realistic problems¹. The paper is the single most-cited reference in hyperparameter-search practice and should be read by every practitioner before their first large search.

Li et al. — Hyperband. Li et al. (JMLR 2018) established Hyperband as a compute-efficient strategy when partial training is predictive³. Hyperband’s successors (BOHB, ASHA, and others) are standard in Ray Tune, Optuna, and Determined AI. The paper formalized the bandit approach to search, and its lessons generalize to any setting where partial-training results are informative.

Summary

Hyperparameter search has four main strategies. Grid is for small spaces. Random is the default for high-dimensional spaces. Bayesian is for expensive evaluations. Hyperband and its relatives are for settings where partial training is predictive. Budget discipline comes from estimating per-evaluation cost, multiplying by evaluation count, and adding overhead. Spot and preemptible compute effectively multiply the budget. Over-tuning the validation set is the main pitfall; a held-out test set and nested cross-validation mitigate it. The search record — space, strategy, budget, full configuration table, winning configuration, test metric — is the reproducibility artifact. Tooling is broad and vendor-neutral.

Further reading in the Core Stream: Machine Learning Fundamentals for Decision Makers.

James Bergstra, Yoshua Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, Vol. 13, 2012. https://www.jmlr.org/papers/v13/bergstra12a.html — accessed 2026-04-19. ↩ ↩²
Optuna, Hyperopt, Ax, and Ray Tune documentation. https://optuna.org/ ; http://hyperopt.github.io/hyperopt/ ; https://ax.dev/ ; https://docs.ray.io/en/latest/tune/ — accessed 2026-04-19. ↩
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, Vol. 18, 2018. https://jmlr.org/papers/v18/16-558.html — accessed 2026-04-19. ↩ ↩²
MLflow, Weights & Biases, Neptune, Aim, Comet, and Kubeflow Metadata documentation. https://mlflow.org/ ; https://wandb.ai/ ; https://neptune.ai/ ; https://aimstack.io/ ; https://www.comet.com/ ; https://www.kubeflow.org/ — accessed 2026-04-19. ↩