AI Disaster Recovery: Backup and Restore Patterns

FlowRidge

This article maps the unique characteristics of AI workloads onto disaster recovery patterns, describes the recovery time objective (RTO) and recovery point objective (RPO) considerations specific to AI, and outlines the testing discipline that distinguishes a real plan from a paper plan.

Why AI Recovery Is Distinctive

Conventional disaster recovery focuses on databases, applications, and storage. AI workloads add four distinctive components that all need recovery treatment.

First, model artefacts. A trained model is often gigabytes to terabytes of data, costly to recompute, and not always reproducible exactly even with the original data and code (per the reproducibility article in Module 1.22). A model that cannot be restored from backup must either be retrained — slow and expensive — or replaced with a less-capable fallback.

Second, feature stores and embeddings. Modern AI workloads depend on materialised features and pre-computed embeddings whose recomputation can take hours or days. Restoring the underlying raw data is necessary but not sufficient; the derived state must also be restorable.

Third, prompt and configuration libraries. For Generative AI workloads, the system prompt, retrieval configuration, and tool definitions are the live “code” of the system. They must be version-controlled and recoverable as deployment artefacts, not as ad-hoc strings in production.

Fourth, external AI service dependencies. A workload that calls a third-party Large Language Model (LLM) provider has a recovery dependency outside its control. Recovery planning must include vendor-failure scenarios, not just internal failure scenarios.

The U.S. National Institute of Standards and Technology Special Publication 800-34 on Contingency Planning Guide for Federal Information Systems at https://csrc.nist.gov/publications/detail/sp/800-34/rev-1/final provides the foundational framework that AI extensions build on.

Recovery Tiers and Objectives

Different AI workloads warrant different recovery tiers based on business criticality.

Tier 0 (mission-critical). AI workloads whose unavailability causes immediate, material business or safety impact. Examples: fraud detection in payments, clinical decision support in active care settings, agentic systems controlling physical processes. Recovery Time Objective (RTO) measured in minutes to low single-digit hours; Recovery Point Objective (RPO) near-zero.

Tier 1 (business-critical). AI workloads whose extended unavailability causes material impact but can be tolerated for hours. Examples: most customer-facing recommendation systems, content moderation in social platforms, automated underwriting. RTO measured in hours; RPO measured in minutes.

Tier 2 (operationally important). AI workloads that support but do not directly drive customer-facing operations. Examples: marketing campaign optimisation, internal analytics, knowledge management. RTO measured in single-digit days; RPO in hours.

Tier 3 (development and exploratory). Pre-production workloads where recovery is desirable but not time-critical. Best-effort recovery from backup with no formal RTO commitment.

The Information Systems Audit and Control Association (ISACA) Disaster Recovery Plan resources at https://www.isaca.org/resources/it-audit/audit-resources frame the tiering exercise in business-impact terms that translate well to AI.

Backup Patterns for Each Component

Model Artefacts

Model weights, configuration files, vocabularies, and tokeniser artefacts should be stored in immutable, content-addressed object storage with cross-region replication. The model registry (typically MLflow, Vertex AI Model Registry, SageMaker Model Registry, or a custom system) should treat backup as a first-class concern.

For very large models (10+ GB), incremental backup of weight changes between fine-tunes is more economical than full backup of every checkpoint.

Training and Reference Data

The data versioning systems described in Module 1.22 should themselves be backed up. Data Version Control (DVC), Delta Lake time travel, and Apache Iceberg snapshots all provide the technical mechanism; the operational discipline is to verify backup integrity and to test restore.

Vector Stores and Embeddings

Vector databases (Pinecone, Weaviate, Milvus, Qdrant, pgvector) require either native backup support or scheduled export of the embeddings. Embeddings can usually be regenerated from the source documents, but regeneration may take hours and should be considered a fallback rather than a primary recovery path.

Configuration and Prompts

Application configuration, system prompts, retrieval templates, and tool definitions should live in version control and be backed up as part of the conventional source control backup. They should be deployed by the same release pipeline as application code, enabling restore-by-redeploy.

Audit Trails and Decision Logs

Per the audit trail discussion in Module 1.21, decision-level logs should be backed up to immutable storage with retention that satisfies regulatory requirements. Backup of audit trails is itself an audit-relevant control.

Recovery Patterns

Three recovery patterns dominate AI workloads.

Active-Active Multi-Region

The workload runs continuously in two or more regions, with traffic split across them. Failure of one region results in traffic redistribution to the others with no service interruption. Active-active is the highest-availability pattern but also the most expensive — model serving, vector stores, and feature stores must all be replicated and kept consistent.

Active-Passive Hot Standby

The workload runs continuously in one region with a parallel deployment kept warm in a second region. Failover to the standby is fast (minutes) but not instantaneous. Most regulated AI workloads in financial services and healthcare adopt this pattern. The Federal Reserve Supervisory Letter SR 20-19 on Interagency Guidance on Outsourcing of Operations to Service Providers at https://www.federalreserve.gov/supervisionreg/srletters/sr2024.htm articulates the supervisory expectations that drive this pattern.

Cold Restore From Backup

The workload is rebuilt in a target region from backup. RTO measured in hours to days. Acceptable for tier 2 and tier 3 workloads.

Vendor Failure Scenarios

Recovery planning must include scenarios where an external AI service is unavailable.

Single-vendor outage. The primary LLM provider experiences a regional or global outage. Plans should include either an alternate vendor with comparable capability and pre-tested integration, or a degraded-mode fallback (smaller model, rule-based response, human routing). The Open AI status page at https://status.openai.com/ and equivalent vendor status mechanisms should be monitored, with automated failover triggers where feasible.

Vendor deprecation. The vendor announces end-of-life for the model the workload depends on. Plans should include a migration window and a tested replacement path before the deprecation date.

Vendor commercial failure. The vendor exits the market or is acquired and its service is discontinued. Plans should include data and prompt portability — the ability to migrate to a different vendor with manageable rework.

The European Union Digital Operational Resilience Act (DORA) at https://eur-lex.europa.eu/eli/reg/2022/2554/oj imposes formal third-party risk management obligations on financial sector AI deployments that are useful templates for other sectors.

Testing Discipline

Plans that have never been tested will fail when needed. The testing discipline distinguishes paper plans from real ones.

Quarterly tabletop exercises. The team walks through a scenario without actually executing the recovery. Tabletops surface assumptions and gaps cheaply.

Semi-annual partial restore tests. The team actually restores a model artefact, a feature store snapshot, or a vector index in a non-production environment. Restore time, integrity, and operational effects are measured.

Annual full-failover tests. For tier 0 and tier 1 workloads, the team executes a real failover to the standby environment, runs the workload there, and fails back. Full failover tests are disruptive and require executive support, but they produce the only credible evidence the plan works.

The U.S. Federal Financial Institutions Examination Council Information Technology Examination Handbook on Business Continuity Management at https://ithandbook.ffiec.gov/it-booklets/business-continuity-management/ describes test patterns that translate well to AI workloads.

Looking Forward

The next article in Module 1.24 turns to capacity planning — the upstream discipline that ensures the resources needed for normal operation and recovery are available when called for. Recovery and capacity planning are two sides of the same operational-readiness coin.