This article provides governance professionals with the methodology, architecture, and practical guidance for implementing policy-to-code within the COMPEL framework.
The Enforcement Gap
Every organisation with an AI governance programme has policies. Many have comprehensive, well-written policies covering risk classification, data governance, model documentation, fairness assessment, human oversight, and incident reporting. The problem is not policy quality — it is policy enforcement.
A policy that states “all high-risk AI systems must have a completed fairness assessment before production deployment” is only effective if: every high-risk system is correctly classified, every system classified as high-risk is actually prevented from reaching production without a fairness assessment, and the fairness assessment meets quality standards.
In most organisations, this enforcement relies on manual checks: a governance reviewer examines a deployment request, verifies that the system has been classified, checks whether the fairness assessment exists, evaluates its quality, and approves or rejects the deployment. This process is slow, inconsistent, and does not scale.
Policy-to-code addresses the enforcement gap by encoding governance rules into the CI/CD pipeline, the governance platform, and the deployment infrastructure so that compliance is checked automatically, continuously, and consistently.
What Can Be Encoded — And What Cannot
Not all governance requirements can be translated into machine-enforceable rules. The policy-to-code landscape divides into three categories:
Fully Automatable Rules
These rules can be evaluated by a machine with no human judgment required:
- Existence checks. “Does this system have a model card?” — a binary check against the evidence repository.
- Completeness checks. “Are all required fields populated in the impact assessment?” — structural validation against a schema.
- Threshold checks. “Does the fairness metric disparity exceed the acceptable threshold?” — numerical comparison against a configured value.
- Temporal checks. “Was the bias audit completed within the last 12 months?” — date arithmetic against a deadline.
- Classification-triggered requirements. “If the system is classified as high-risk, is a conformity assessment present?” — conditional logic against registry metadata.
Partially Automatable Rules
These rules can be partially checked by a machine, with human judgment required for final determination:
- Quality checks. “Is the model card adequate?” — a machine can verify completeness and structural quality, but evaluating the substantive adequacy of explanations requires human judgment.
- Proportionality assessments. “Is the risk classification proportionate?” — auto-classification rules can suggest a classification, but edge cases require human review.
- Stakeholder consultation adequacy. “Was meaningful community engagement conducted?” — a machine can verify that consultation records exist, but assessing whether the engagement was genuinely meaningful requires human evaluation.
Human-Only Rules
These rules cannot be meaningfully automated and must remain human-evaluated:
- Ethical judgment. “Is the residual risk acceptable?” — this requires weighing values, not computing metrics.
- Strategic alignment. “Does this AI system align with our ethical principles?” — this requires interpretive judgment.
- Contextual appropriateness. “Is this the right AI solution for this problem?” — this requires understanding the organisational and social context.
The governance professional’s task is to maximise the proportion of routine compliance checking that falls into the fully automatable category, freeing human capacity for the judgment-intensive work that genuinely requires it.
Architecture for Policy-to-Code
Layer 1: Policy Decomposition
Natural-language policies must be decomposed into atomic, testable assertions. A single policy statement may contain multiple enforceable rules:
Policy: “All high-risk AI systems must complete a fairness assessment, approved by the ethics review board, within 30 days of risk classification, and the assessment must be renewed annually.”
Decomposed rules:
- Rule 1: IF system.risk_tier = ‘high’ THEN evidence.fairness_assessment.exists = true
- Rule 2: IF system.risk_tier = ‘high’ THEN evidence.fairness_assessment.approved_by IN ethics_review_board.members
- Rule 3: IF system.risk_tier = ‘high’ THEN (evidence.fairness_assessment.date - system.classification_date) <= 30 days
- Rule 4: IF system.risk_tier = ‘high’ THEN (current_date - evidence.fairness_assessment.date) <= 365 days
Each decomposed rule is independently testable and can be assigned a severity (blocking vs. warning), a remediation owner, and an escalation path.
Layer 2: Rule Engine
The rule engine evaluates decomposed rules against the governance data model. It operates at four integration points:
Design time: When a new AI system is proposed, classification rules automatically suggest a risk tier and identify the governance requirements triggered by that classification.
Build time: CI/CD pipeline gates check that required governance artefacts exist and meet quality thresholds before the build proceeds to the next stage.
Deployment time: Deployment gates verify that all pre-deployment requirements are satisfied before the system reaches production.
Runtime: Continuous monitoring rules check that ongoing requirements (metric thresholds, assessment currency, monitoring configurations) remain satisfied in production.
Layer 3: Exception Handling
No rule system covers every scenario. The policy-to-code architecture must include a structured exception process:
- Exception request: When a rule blocks a deployment or flags a violation that the system owner believes is inappropriate, they can request an exception.
- Exception review: A governance authority (not the system owner) evaluates the exception request against the policy’s intent, the specific circumstances, and the risk implications.
- Exception grant: If approved, the exception is recorded with rationale, scope, duration, and conditions. Exceptions are time-limited and subject to periodic review.
- Exception audit: All granted exceptions are visible in governance reports and subject to audit.
Layer 4: Feedback and Refinement
Policy-to-code is not a one-time translation exercise. The rules must be continuously refined based on:
- False positives: Rules that block legitimate deployments indicate that the rule is too restrictive or the policy is ambiguous.
- False negatives: Governance failures that the rules did not catch indicate missing rules or inadequate data.
- Policy changes: When policies are updated, the corresponding rules must be updated in lockstep.
- Regulatory changes: New regulatory requirements must be translated into new rules and integrated into the rule engine.
Implementation Guidance
Start with High-Value, Low-Ambiguity Rules
Begin with rules that are: clearly defined in existing policy (no interpretation required), high-value (catching violations prevents significant risk), and low-ambiguity (the rule can be evaluated from available structured data).
Existence checks and temporal checks are the best starting points. Threshold checks follow once fairness metrics are systematically captured. Quality checks come last because they require more sophisticated evaluation logic.
Maintain the Policy-Rule Traceability Matrix
Every machine-enforceable rule must be traceable to the policy statement it implements. This matrix serves as the authoritative record of how policies are operationalised and is essential for: audit (demonstrating that policies are enforced), policy revision (understanding which rules are affected when a policy changes), and rule justification (explaining to system owners why a rule exists).
Avoid Rule Proliferation
The temptation is to encode every possible governance check as a rule. Resist this. A rule engine with 500 rules that no one understands is worse than 50 well-designed rules that the governance team can explain, maintain, and defend. Quality over quantity.
Test Rules Before Enforcement
Before a new rule blocks deployments, run it in observation mode — flag violations but do not block. Analyse the flagged violations to verify that the rule is working as intended. Only switch to enforcement mode after a confidence period (typically 2–4 weeks of observation).
The Limits of Policy-to-Code
Policy-to-code is a powerful tool for scaling governance enforcement, but it has fundamental limits:
It enforces the letter, not the spirit. A rule that checks for the existence of a fairness assessment cannot evaluate whether the assessment was conducted in good faith. Compliance with the rule is necessary but not sufficient for compliance with the policy’s intent.
It can create a compliance mindset. Teams may focus on passing the automated checks rather than genuinely engaging with the governance process. The governance programme must maintain a culture where automated checks are a floor, not a ceiling.
It requires data quality. Rules can only evaluate data that exists in the governance platform. If system metadata is incomplete or inaccurate, the rules will produce incorrect results. Data quality is a prerequisite for policy-to-code effectiveness.
Policy-to-code is most effective when combined with human governance oversight — automated rules handle the routine checks, freeing human governance professionals to focus on judgment, stakeholder engagement, and the strategic dimensions of AI governance that no rule engine can address.
This article is part of the COMPEL Body of Knowledge v2.5 and supports the AI Transformation Governance Professional (AITGP) certification.