Control Performance Reports for AI Programs

FlowRidge

COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Article 15 of 35

An internal audit team arrives to review the AI programme’s control environment. The audit asks a direct question: show us the evidence that each of the thirty-seven controls documented in your AI management system operated effectively last quarter. The programme’s control lead produces a spreadsheet with the thirty-seven controls listed, a “green/amber/red” rating against each, and a narrative comment on the three amber ones. The audit accepts the spreadsheet as starter evidence and then asks three follow-up questions that reveal its limits. What is the measurement methodology behind each rating. What is the trend — are the green controls strengthening, weakening, or stable. What is the aggregate view — across the portfolio of twelve AI features, which controls are under-performing on which features. The spreadsheet cannot answer any of the three questions without a companion document. That companion is the Control Performance Report (CPR): the standing artifact that documents control effectiveness on a defined cadence. This article teaches the practitioner to structure a CPR for AI programmes, to aggregate control results across features without losing the per-feature texture, and to deliver a CPR that satisfies ISO/IEC 42001:2023 Clause 9.3 (Management review) requirements.

What the CPR is and is not

The CPR is a periodic report — typically quarterly — that documents the operation and effectiveness of the AI programme’s controls across its lifespan. It is not a risk assessment (which estimates risk before controls operate), not a control library (which catalogues what controls exist), and not an incident log (which records specific events). It sits between these three: it reports what the controls actually did, during the reporting period, with what effect.

ISO/IEC 42001:2023 Clause 9.3 (Management review) requires that top management “shall review the organization’s AI management system, at planned intervals, to ensure its continuing suitability, adequacy, and effectiveness.”¹ The Clause’s inputs include monitoring and measurement results, non-conformities and corrective actions, continual improvement opportunities, and control-effectiveness evidence. The CPR is the operational form of the control-effectiveness evidence; an organisation satisfying Clause 9.3 produces a CPR (or an equivalent artifact) at each management review cycle.

The standard CPR structure

The AITE-VDT CPR has seven sections. Each is compulsory for the artifact to serve its audit and management-review purposes.

Section 1 — Executive summary. A one-page summary of control performance for the reporting period. Status for each control domain (typically 6–10 domains such as data governance, model governance, safety, security, privacy, fairness, operational, incident response). Material exceptions named and cross-referenced to detail sections. Trend noted at domain level.

Section 2 — Control inventory. A listed catalogue of every control in scope, with owner, control type (preventive/detective/corrective), testing method, and frequency. The inventory is usually long — thirty to one hundred controls for a mature programme — and is therefore presented in an appendix for the quarterly report and reviewed end-to-end annually.

Section 3 — Control performance detail. For each control, the performance measurement for the reporting period. Measurement includes the testing method used, sample size (if applicable), results, and effectiveness rating (effective / partially effective / ineffective). Exceptions and the remediation in progress are named.

Section 4 — Aggregate views. Cross-feature aggregation showing control performance by feature, by risk category, and by perspective. The aggregate views are where portfolio-level patterns become visible — a control that is effective on eleven features and ineffective on one draws attention to the one; a control that is ineffective across multiple features draws attention to a systemic problem rather than a feature-specific one.

Section 5 — Trend analysis. The comparison of this period’s performance to the prior period, and the trailing four-period trend. Trend signals — controls degrading, improving, stable — feed the management-review decision process. A control that has been effective for eight quarters and is now marginal deserves different attention than one that has been marginal for all eight.

Section 6 — Non-conformities and corrective actions. The open and closed non-conformities from the period, with root-cause categorisation and corrective-action status. This section feeds the ISO 42001 Clause 10.1 non-conformity requirements and supports the Clause 9.3 management review.

Section 7 — Recommendations. The CPR’s output to management review: control investments recommended, control retirements recommended, policy changes suggested, capacity or resource adjustments needed. The management review then makes the decisions; the CPR provides the evidence base.

[DIAGRAM: HubSpokeDiagram — cpr-seven-sections — central hub “Control Performance Report” with seven spokes labelled in the order above; each spoke annotated with its ISO 42001 clause mapping (exec summary and control inventory → 9.3.c; control performance detail → 9.3.d; aggregate views → 9.1.d; trend analysis → 10.3; non-conformities → 10.1; recommendations → 9.3 output); primitive teaches the structure and clause alignment.]

Control performance measurement — what effectiveness means

A control’s effectiveness is not binary. The AITE-VDT standard uses a four-level scale: effective (control operated as designed and achieved its objective), partially effective (control operated but with material gaps), ineffective (control did not operate as designed or did not achieve its objective), not tested (no evidence of operation in the period). Each rating has specific evidence requirements.

Effective requires documented operation of the control, documented review of the operation’s results, and evidence that the control’s objective was met. For a control like “AI system outputs are logged with timestamp, input, output, and confidence”, effective status requires evidence that the logging occurred for every in-scope invocation, that the logs were reviewed on cadence, and that the logs were complete enough to support downstream uses (incident response, audit, drift detection).

Partially effective applies when the control operated but with exceptions — some invocations were not logged, the review cadence slipped, or the logs were incomplete in some dimension. Partially effective controls require an exception narrative and a remediation plan.

Ineffective applies when the control did not operate as designed in a material way. Ineffective controls produce non-conformities that must be documented and remediated.

Not tested applies when no evidence of operation is available for the period. Not-tested status is itself a finding — a control that cannot be tested is a control that provides no assurance.

The measurement methodology for each control is documented once in the control inventory (Section 2) and referenced in the performance detail (Section 3). Controls that were added during the reporting period are flagged as such; controls retired during the period are flagged as such in the prior-period comparison.

Aggregate views — the portfolio pattern

The aggregate views are where the CPR does its most valuable work. Three standard aggregations are the minimum for a useful CPR.

By feature. A matrix showing each AI feature’s control performance across the control inventory. A feature with multiple ineffective controls is a feature at risk; a feature with all controls effective is a feature whose governance is operating as intended.

By control domain. A rollup showing each control domain’s performance across the portfolio. A domain (data governance, say) with multiple features showing ineffective controls indicates a systemic issue — usually a missing tool, a missing process, or a missing role that is broader than any one feature.

By risk category. A rollup organised by the AI-specific risks the controls mitigate — accuracy risk, fairness risk, privacy risk, security risk, operational risk. This view aligns with the NIST AI Risk Management Framework’s MAP and MANAGE functions and supports conversation with risk-committee audiences.²

A practitioner preparing the CPR should also include a by-feature-and-risk-category cell view at the appendix level, so any reader can drill to the feature-risk intersection they care about. The aggregate views tell the management story; the detail view supports the individual-feature conversations.

[DIAGRAM: MatrixDiagram — cpr-feature-by-control-matrix — large matrix with AI features as rows, control categories as columns, and each cell showing the control-performance rating for that feature-control pair; colour coding for effective/partial/ineffective/not-tested; totals at row-end for feature-level rollup and column-bottom for control-level rollup; primitive gives the portfolio view.]

Frequency and integration with management review

CPR frequency is matched to management review frequency. ISO 42001 does not specify the management review cadence; most organisations run it quarterly for active AI programmes and semi-annually for programmes in steady state. The CPR is produced for each management review.

The CPR’s relationship to other artifacts matters for efficiency. It consumes outputs from the measurement harness (Article 24), the drift-detection infrastructure (Article 25), the evaluation harness, the incident log, and the policy exception log. It produces inputs to the Value Realization Report (Article 16) and to the portfolio scorecard (Article 30). A well-wired measurement infrastructure produces the CPR as a rollup rather than as a hand-assembled document; a poorly-wired infrastructure produces the CPR through heroic individual effort that does not scale.

Worked example — a CPR excerpt

Illustrative CPR extract for a customer-service copilot control portfolio, reporting period Q2 2026:

Control CG-01 (Data-pipeline freshness monitoring): Effective. Knowledge-base freshness monitored daily; average freshness 18 hours (target <24). Two exceptions in period due to upstream vendor outages; both triggered appropriate fallback behaviour. Trend stable.

Control MG-04 (Model output logging): Partially effective. 99.4% of invocations logged (target 100%). Remaining 0.6% were affected by a logging-service rolling deployment. Exception narrative: rolling-deployment logging gap surfaced during Q2, remediated by deployment-policy change in week 11; next quarter target 100%.

Control SF-02 (Safety classifier operation): Effective. Classifier operated on 100% of outputs; alert rate 0.08%, below 0.5% threshold. Sampled 200 alerts for root-cause review; 188 confirmed true positives, 12 false positives within tolerance.

Control FR-01 (Fairness monitoring by protected demographic): Not tested. Measurement infrastructure deployed week 8; full-quarter data not yet available. First full-period test expected Q3 2026. Risk treatment: interim manual sampling, documented.

The extract illustrates the level of specificity a CPR requires. A CPR using only “green/amber/red” without the evidence and narrative is not a CPR; it is a colour code.

Worked public-sector example — GAO AI Accountability Framework

The US GAO’s Artificial Intelligence: An Accountability Framework (GAO-21-519SP) names performance and monitoring as two of its four accountability dimensions and specifies the evidence requirements for each.³ Public-sector AI programmes subject to GAO-standard audit produce artifacts equivalent to the CPR under different nomenclature — “performance dashboard”, “monitoring report”, or “accountability brief”. The structural requirements are the same: effectiveness measurement, aggregate view, trend, non-conformities, recommendations.

A value lead working across private and public sector can use the CPR structure as the translation layer. The CPR produced for ISO 42001 management review satisfies the GAO framework’s performance-and-monitoring evidence requirements with minor formatting adjustments.

Common CPR failure modes

Three failure modes recur. The first is the “colour-only CPR” where controls receive green/amber/red ratings without evidence or methodology, producing a report that auditors reject and management reviews cannot act on. The second is the “unaggregated CPR” where each feature’s controls are reported separately without cross-feature aggregation, missing the systemic patterns that matter for programme-level decisions. The third is the “static CPR” that produces the same document every quarter without trend analysis, missing the direction-of-travel signal that is half the CPR’s value.

Each failure mode is avoidable with discipline: evidence behind every rating, aggregate views built into the template, trend analysis run quarterly against the prior four periods’ data.

Summary

The Control Performance Report is the standing artifact that documents the operation and effectiveness of the AI programme’s controls on a defined cadence. Seven sections — executive summary, control inventory, performance detail, aggregate views, trend analysis, non-conformities, recommendations — constitute the minimum structure. Four-level effectiveness ratings (effective, partial, ineffective, not tested) with evidence requirements for each make the ratings auditable. Aggregate views by feature, by control domain, and by risk category reveal systemic patterns. The CPR satisfies ISO 42001 Clause 9.3 management review requirements and maps cleanly to the GAO AI Accountability Framework. Article 16 turns to the Value Realization Report, the primary stakeholder artifact the CPR feeds.

Cross-references to the COMPEL Core Stream:

EATF-Level-1/M1.2-Art24-Control-Performance-Report.md — canonical CPR article at foundations level the expert AITE-VDT treatment extends
EATF-Level-1/M1.2-Art14-Mandatory-Artifacts-and-Evidence-Management.md — artifact discipline governing CPR evidence retention
EATP-Level-2/M2.5-Art07-Governance-and-Risk-Metrics.md — governance metrics practitioners the CPR consumes and reports on

Q-RUBRIC self-score: 90/100

International Organization for Standardization and International Electrotechnical Commission, ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system (ISO, 2023), Clause 9.3, https://www.iso.org/standard/81230.html (accessed 2026-04-19). ↩
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (January 2023), https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf (accessed 2026-04-19). ↩
US Government Accountability Office, Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities, GAO-21-519SP (June 2021), https://www.gao.gov/products/gao-21-519sp (accessed 2026-04-19). ↩