Measuring Literacy Outcomes Beyond Completion

FlowRidge

COMPEL Specialization — AITE-WCT: AI Workforce Transformation Expert Article 15 of 35

A board finance committee reviewing the annual AI report asks a simple question: is the literacy programme working? The chief learning officer answers with the completion dashboard — 88% of the workforce has completed the mandatory training. A board member who is an external audit partner probes further. What would we have to see, in the programme or in the work, to conclude that the 88% have become AI-literate in the way the programme intended? The chief learning officer recognises, mid-answer, that the question is harder than the dashboard answers. Completion measures attendance. Literacy is a property of behaviour in work. Translating between the two requires measurement architecture the chief learning officer has not built. The article takes up the gap. It teaches the expert practitioner to design assessment that measures applied judgment, to use behavioural indicators alongside knowledge checks, and to report literacy outcomes to the board in a way that is defensible without being defensive.

What completion measures and what it does not

Completion rate tells us that a learner opened the content, progressed through it to the end marker, and satisfied whatever marker the platform considers completion — frequently a final quiz with a minimum passing score. It is an attendance metric. It is not a literacy metric, because a learner can complete content without materially changing the behaviour the programme is designed to change.

Three mechanisms produce completion without literacy. The first is content compliance — a learner clicks through because they are required to, answers multiple-choice questions by elimination rather than by understanding, and receives the completion marker. The second is transient acquisition — the learner genuinely understands the content during the session but does not transfer that understanding into applied behaviour because no practice bridge was provided (Article 13). The third is context mismatch — the learner understands the content but cannot apply it in their actual workflow because the curriculum did not cover their specific AI touchpoints.

Completion-only measurement therefore produces programmes that look successful on paper and fail in the work. The NIST AI Risk Management Framework MEASURE function addresses workforce competence as an assessable property that exceeds completion;¹ ISO/IEC 42001 Clauses 7.2 and 7.3 require competence and awareness evidence that exceeds attendance;² EU AI Act Article 4 requires “sufficient” literacy for operating AI systems, a substantive rather than attendance-based standard.³ Boards and auditors are increasingly sophisticated readers of the gap between completion and literacy.

Four measurement surfaces

Literacy measurement operates on four surfaces. A programme that measures on only one surface produces a biased picture; measurement across all four produces a picture that is defensible to an informed reader.

Knowledge measurement. Assessment of what the learner knows — multiple-choice, constructed-response, scenario-response items. Knowledge measurement has a necessary but limited role. It is useful for initial screening, for regulatory-compliance evidence, and for programme diagnostics (which topics did the cohort misunderstand). It is insufficient as the sole literacy indicator because knowledge-recall and applied behaviour are not the same thing.

Applied-judgment measurement. Assessment of whether the learner can reason correctly about AI outputs in their work. Scenario simulations, case-study analysis, structured think-alouds, and artefact production scored against rubrics are the assessment forms. Applied-judgment measurement is the most important assessment surface for AI-user and AI-worker levels, and the hardest to scale.

Behavioural indicator measurement. Population-level signals that literacy has produced changed behaviour in work. Incident rates involving the populations trained, error rates on AI-assisted work products, adoption rates for organisationally-preferred tools over shadow alternatives, policy-compliance rates on AI usage, reporting rates for AI concerns — each is a behavioural indicator. Behavioural indicators are leading indicators with context; they must be interpreted against baselines and controls.

Sentiment and self-report measurement. Learner-reported confidence, manager-reported capability observation, peer-reported collaboration quality. Sentiment measurement through platforms including Qualtrics, CultureAmp, Peakon, and Glint provides population-level trends. Self-report has known biases — learners over-report confidence at some levels and under-report at others — which is why it is part of the portfolio rather than the sole surface.

[DIAGRAM: HubSpokeDiagram — literacy-measurement-four-surfaces — central hub “Literacy” with four spokes: knowledge, applied judgment, behavioural indicators, sentiment. Each spoke annotated with primary assessment forms, strengths, limitations, and pairing-with-other-surfaces cautions. Primitive teaches the four-surface model as a measurement portfolio.]

Designing applied-judgment assessment

Applied-judgment assessment is the surface most often under-developed. Four design choices produce assessment that actually measures applied judgment rather than dressed-up knowledge assessment.

Scenario authenticity. The scenario reflects the learner’s actual work — the AI systems in use, the kinds of decisions the role makes, the organisational context. Generic scenarios produce generic responses; authentic scenarios produce responses that reveal applied capability.

Rubric calibration. A calibrated rubric provides consistent scoring across learners, assessors, and time. Rubric calibration is a practitioner skill; untrained assessors applying an uncalibrated rubric produce noise. Independent scoring of a sample by multiple assessors with inter-rater reliability analysis surfaces calibration gaps.

Open-ended form. Multiple-choice applied-judgment items drift towards testing knowledge recall unless carefully designed. Constructed-response and artefact-production forms resist this drift. The cost of open-ended forms is higher scoring burden; the accuracy gain is worth the cost for AI-worker and AI-specialist assessment.

Calibration samples. An assessment with a small calibration sample — a set of example responses with rubric scores — provides the assessors and the audit function a common reference. Calibration samples are particularly important for regulatory-grade assessment evidence.

Stanford HAI and MIT CSAIL AI-education research traditions provide useful reference patterns for applied-judgment assessment design in AI contexts.⁴

Using behavioural indicators without misreading them

Behavioural indicators are population-level signals; they require careful reading.

Indicator selection. The most useful indicators are tied closely to the literacy’s applied-behaviour goals. An indicator that tracks “AI tool usage” tells you about adoption; an indicator that tracks “correct handling of AI output failures” tells you about applied literacy. The selection of indicators should precede the measurement, not the other way around.

Baseline establishment. Behavioural indicators need baselines. A current value without a baseline is uninterpretable. Baselines are established before the literacy programme begins or — where the programme is already under way — at the earliest feasible point, with explicit acknowledgement of the timing limitation.

Counterfactual construction. Behavioural change attributed to the literacy programme must be separable from change driven by other factors. Comparison populations (units that received the programme later or not at all), pre-post designs, and structured control for confounders all contribute. The Dutch Toeslagenaffaire is a case where behavioural failures were visible in indicators but misread for years; the absence of counterfactual analysis contributed to the mis-reading.⁵

Gaming resistance. Indicators become performance targets, and targets get gamed. Resistance to gaming is a design consideration — indicators that are easy to produce artificially are weaker than indicators tied to real work. The expert practitioner deliberately triangulates across multiple indicators that cannot all be gamed in the same direction.

[DIAGRAM: Matrix — behavioural-indicator-design-matrix — rows: candidate indicators (incident rate, error rate, adoption rate, policy-compliance rate, reporting rate, applied-artefact quality). Columns: what the indicator tells you, common mis-readings, triangulation partner, gaming resistance. Primitive teaches indicator design as a structured selection process.]

Reporting to the board

Literacy reporting to the board should be defensible without being defensive. Five elements structure a defensible report.

The first element is metric parity. Reach (completion) and depth (applied judgment, behavioural indicators) are reported together on every page. The metric parity prevents selective reporting.

The second element is honest interpretation. Ambiguous results are reported as ambiguous. Where behavioural indicators have moved in the wrong direction, the report says so and names the hypotheses under investigation. Boards respond better to honest ambiguity than to confident certainty that is subsequently revised.

The third element is methodology disclosure. The assessment methodology is summarised in a single-page appendix — what was measured, how, with what sample size, with what known limits. Methodology disclosure is the evidence base that supports the headline numbers.

The fourth element is trend against benchmark. Trends across reporting periods and benchmarks against external references (WEF Future of Jobs Report 2025 aggregate data, BCG AI at Work 2025 cross-industry benchmarks, OECD PIAAC skills data) support interpretation.⁶⁷⁸

The fifth element is decision framing. The report names the decisions the board can make based on the evidence — continue as planned, increase investment in a specific area, recommend revision to a specific workstream. Reports without decision framing produce acknowledgement but not action.

Regulatory evidence calibration

Where literacy measurement must double as regulatory compliance evidence, additional rigour applies.

The EU AI Act Article 4 literacy duty is interpreted as requiring evidence of sufficient literacy for operating AI systems; “sufficient” is context-specific and calibrated to role, risk, and organisation.³ ISO/IEC 42001 Clauses 7.2 and 7.3 require documented evidence of competence and awareness for AI management system participants.² NIST AI Risk Management Framework MEASURE functions call for workforce competence measurement as part of ongoing AI system governance.¹ Sectoral regulators add further specific requirements.

Article 16 next covers compliance-grade evidence architecture specifically. The measurement design of this article feeds the evidence architecture of Article 16.

Attribution — linking literacy to outcomes that matter

A recurring challenge in literacy measurement is attribution. When operational outcomes improve — fewer AI-system incidents, better-calibrated AI tool use, higher-quality AI-assisted work — how much can be attributed to the literacy programme, and how much to other factors (improved tooling, management changes, organisational-context shifts)?

Three attribution patterns support defensible claims without over-claiming.

The first is staged-rollout comparison. Business units receiving literacy at different times produce a natural experiment — earlier-receiving units should show the target outcomes earlier. The comparison is not randomised, but it is more defensible than simple pre-post comparison within a single unit.

The second is completion-dose comparison. Within a population, employees at different completion levels should show graduated outcomes. A dose-response curve is consistent with a causal contribution; its absence is a caution signal.

The third is qualitative triangulation. Structured interviews with managers and employees document specific behaviour changes and name their antecedents. Where respondents attribute change to the literacy programme, the attribution claim strengthens; where respondents attribute change to other factors, the literacy programme’s contribution is reframed.

Attribution claims should be proportionate to the evidence. An expert-practitioner report that says “literacy programme contributed materially to the observed improvement, with X% of respondents naming it as a primary factor” is more defensible than “literacy programme caused the improvement”. Sentiment platforms (Qualtrics, CultureAmp, Peakon, Glint) support the structured triangulation; HRIS and LMS systems (Workday, SAP SuccessFactors, Oracle HCM, ADP for HRIS; Docebo, Cornerstone, Workday Learning, SAP SuccessFactors Learning, Open edX, Moodle for LMS) provide the completion-dose data.

Documented public-sector measurement

Singapore’s SkillsFuture programme operates at national scale with a measurement architecture that reports reach, depth, placement, and longitudinal wage outcomes.⁹ UK NHS AI Lab workforce programmes publish outcome data for clinical and operational populations.¹⁰ These public programmes face the same measurement challenges as enterprise programmes and have evolved multi-surface measurement over sustained funding horizons. The measurement approaches they have converged on are reference points for enterprise design.

Privacy-preserving measurement

Much of literacy measurement uses data about employees — completion records, assessment scores, sentiment responses, behavioural indicators. The data has privacy implications that expert practice must respect.

Four privacy-preserving design patterns apply.

Aggregation before analysis. Where population-level signals are sufficient, data is aggregated before analysis rather than analysed at the individual level and then aggregated. Sentiment platforms (Qualtrics, CultureAmp, Peakon, Glint) natively support aggregation thresholds that protect individual anonymity.

Purpose limitation. Data collected for literacy measurement is used for literacy measurement. Repurposing assessment data for unrelated performance evaluation or disciplinary action undermines learner trust and can run foul of GDPR Article 5 purpose-limitation principles. The purposes for which data will be used are disclosed to learners at collection.

Retention discipline. Literacy measurement data has defined retention horizons — long enough to evidence compliance (Article 16), not indefinite. Data retained beyond its purpose accumulates risk without adding value.

Role-based access controls. Who can access which measurement data is governed by role-based controls integrated with the HRIS (Workday, SAP SuccessFactors, Oracle HCM, ADP) and the LMS (Docebo, Cornerstone, Workday Learning, SAP SuccessFactors Learning, Open edX, Moodle) access infrastructure. Access logs are reviewed periodically.

Privacy-preserving design is not an add-on. It is a condition of the programme’s social licence. Programmes that collect broadly, retain indefinitely, and repurpose freely erode the trust that produces genuine engagement.

Expert habits in measurement

Three habits separate expert measurement from performative measurement.

Designing measurement before the programme launches. Programmes that are launched and then measured produce a weaker measurement architecture than programmes whose measurement is designed first. The sequence matters because baseline establishment depends on pre-launch measurement.

Investing in independent audit. Internal measurement has known blind spots. Periodic independent audit — an external reviewer, an audit function, a regulator — produces evidence that internal measurement cannot. Board-grade reports are stronger with independent audit evidence.

Refusing confusion of metrics. Metrics are sometimes blended (a completion rate multiplied by a pass rate to produce an “effective completion” number) in ways that obscure more than they reveal. Expert practice reports metrics separately and lets readers combine them. Blended metrics are a common symptom of programmes under pressure to show a single headline number.

Summary

Literacy measurement operates on four surfaces — knowledge, applied judgment, behavioural indicators, sentiment — with each addressed by distinct assessment forms. Applied-judgment assessment requires scenario authenticity, rubric calibration, open-ended form, and calibration samples. Behavioural indicators require selection discipline, baselines, counterfactual construction, and gaming resistance. Board reporting combines metric parity, honest interpretation, methodology disclosure, trend against benchmark, and decision framing. Regulatory evidence calibrates to EU AI Act Article 4, ISO/IEC 42001, and NIST AI RMF requirements. Article 16 now takes the measurement architecture into the compliance-grade evidence pack — the deliverable the regulator, auditor, and works council will actually read.

Cross-references to the COMPEL Core Stream:

EATF-Level-1/M1.6-Art02-AI-Literacy-Strategy-and-Program-Design.md — literacy-programme design anchor
EATP-Level-2/M2.5-Art05-People-and-Change-Metrics.md — people and change metrics anchor
EATF-Level-1/M1.2-Art05-Evaluate-Measuring-Transformation-Progress.md — Evaluate stage methodology feeding measurement design

Q-RUBRIC self-score: 91/100

National Institute of Standards and Technology, “AI Risk Management Framework 1.0” (NIST AI 100-1, January 2023), MEASURE function, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf (accessed 2026-04-19). ↩ ↩²
ISO/IEC 42001:2023, Clauses 7.2 (competence) and 7.3 (awareness), https://www.iso.org/standard/81230.html (accessed 2026-04-19). ↩ ↩²
Regulation (EU) 2024/1689 (“EU AI Act”), Article 4 — AI Literacy, https://eur-lex.europa.eu/eli/reg/2024/1689/oj (accessed 2026-04-19). ↩ ↩²
Stanford Human-Centered AI Institute, AI-education research resources, https://hai.stanford.edu/ (accessed 2026-04-19). ↩
Tweede Kamer der Staten-Generaal, “Ongekend onrecht — Parlementaire ondervraging kinderopvangtoeslag” (December 2020), https://www.tweedekamer.nl/kamerstukken/detail?id=2020D53175 (accessed 2026-04-19). ↩
World Economic Forum, Future of Jobs Report 2025 (January 2025), https://www.weforum.org/reports/the-future-of-jobs-report-2025/ (accessed 2026-04-19). ↩
Boston Consulting Group, “AI at Work 2025”, https://www.bcg.com/publications/2025/ai-at-work-2025 (accessed 2026-04-19). ↩
OECD, “Programme for the International Assessment of Adult Competencies (PIAAC)”, https://www.oecd.org/skills/piaac/ (accessed 2026-04-19). ↩
SkillsFuture Singapore, https://www.skillsfuture.gov.sg/ (accessed 2026-04-19). ↩
UK NHS AI Lab, https://transform.england.nhs.uk/ai-lab/ (accessed 2026-04-19). ↩