Advanced Fairness Metrics — Beyond Demographic Parity

FlowRidge

This article provides governance professionals with the deep understanding of fairness metrics needed to make and defend metric selection decisions in high-stakes AI governance contexts.

The Impossibility Landscape

Before exploring individual metrics, governance professionals must internalise a foundational mathematical result: you cannot have it all.

Chouldechova (2017) proved that when base rates differ between groups, no classifier can simultaneously satisfy calibration, predictive parity, and error rate balance (equalized odds). Kleinberg, Mullainathan, and Raghavan (2016) proved a closely related result showing that calibration and classification parity cannot coexist unless the classifier is perfect or base rates are equal.

The practical implication is profound: for any AI system operating in a domain where different demographic groups have different outcome rates — which is nearly every consequential domain due to historical inequality — the governance professional must choose which fairness properties to prioritise. This is not a technical decision. It is a value judgment that determines who bears the cost of the system’s errors.

Consider a criminal justice risk assessment tool where the recidivism base rate differs between racial groups (due to systemic factors including differential policing, sentencing, and socio-economic conditions). If the tool is calibrated — meaning a risk score of 7 means the same probability of recidivism regardless of race — then it will inevitably have different false positive rates between groups. Black defendants at a given risk level will have the same recidivism probability as white defendants at that level, but more Black defendants will be incorrectly classified as high-risk because the base rate is higher.

Calibration says: “the score means what it says for everyone.” Equalized odds says: “the error burden should be distributed equally.” These are both reasonable fairness goals, and they cannot both be satisfied simultaneously when base rates differ. The governance professional must decide — and document — which goal takes precedence and why.

The Metric Families

Group Fairness Metrics

Group fairness metrics compare outcomes across predefined demographic groups. They are the most widely used fairness metrics and the basis for most regulatory fairness requirements.

Demographic Parity requires equal selection rates across groups. It is the most intuitive metric but the most controversial: it can conflict with accuracy, it ignores legitimate differences in qualifications, and it can be satisfied trivially by random selection. The EEOC’s 80% rule (four-fifths rule) operationalises a relaxed version: the selection rate for any group should be at least 80% of the highest group’s rate.

Equalized Odds requires equal true positive rates and false positive rates across groups. It is more nuanced than demographic parity because it conditions on the true outcome. But it assumes that ground truth labels are reliable — a problematic assumption when labels themselves reflect historical bias (e.g., arrest data as a proxy for criminal behaviour).

Predictive Parity requires equal positive predictive value across groups. It ensures that a positive prediction means the same thing regardless of group membership. It is critical in domains where positive predictions trigger resource allocation (e.g., treatment, intervention).

Calibration requires that predicted probabilities match actual outcomes across groups. A model predicting 70% risk should be right about 70% of the time for every group. This is essential when risk scores are used directly for decision thresholds.

Treatment Equality requires equal ratios of false negatives to false positives across groups. It captures whether the pattern of errors — not just the rate — differs between groups.

Individual Fairness

Individual fairness requires that similar individuals receive similar predictions. Unlike group fairness, it does not partition people into demographic groups. Instead, it asks whether the model treats comparable people comparably, using a task-specific distance metric.

The challenge of individual fairness is defining the distance metric. What makes two job applicants “similar”? Education? Experience? Skills? Potential? These are normative questions with no objectively correct answer. The distance metric encodes a theory of justice, and different theories yield different fairness conclusions.

Individual fairness is most useful when: the context demands person-level assessment (sentencing, admissions), group-level metrics are insufficient because groups are internally heterogeneous, and a meaningful similarity metric can be constructed with domain expert input.

Counterfactual Fairness

Counterfactual fairness asks: would the prediction have been the same if the individual’s protected attribute were different? This requires a causal model (a directed acyclic graph) specifying how the protected attribute influences other features and the outcome.

Counterfactual fairness is philosophically elegant but practically demanding. It requires specifying a causal model that may be contested, it raises metaphysical questions about counterfactual identity (“what would it mean for this person to be a different race?”), and it cannot be verified from observational data alone.

Its value lies in surfacing proxy discrimination: when ostensibly neutral features (postcode, name, school) carry causal information about protected attributes and influence predictions through those pathways.

Intersectional Fairness

Standard group fairness metrics assess fairness along single demographic dimensions. Intersectional fairness — inspired by Crenshaw’s (1989) intersectionality theory — assesses fairness at the intersection of multiple dimensions. A model might satisfy demographic parity for gender and for race independently, but Black women might experience significantly worse outcomes than any single-dimension analysis reveals.

The practical challenge is the curse of intersectionality: as the number of protected attributes increases, the number of intersectional subgroups grows exponentially, and sample sizes per subgroup shrink to the point where statistical assessment becomes unreliable. Governance professionals must balance the imperative to detect intersectional discrimination with the statistical limitations of small subgroup analysis.

Metric Selection: A Governance Decision

The choice of fairness metric is a governance decision, not a technical one. Governance professionals should follow a structured selection process:

Step 1: Understand the domain context. What decisions does the AI system inform? What are the consequences of false positives versus false negatives? Who bears those consequences? What does “fairness” mean to the affected communities?

Step 2: Consult affected stakeholders. Different stakeholder groups may have different fairness intuitions. Applicants for a loan may prioritise equal acceptance rates (demographic parity). The bank may prioritise equal default rates among accepted applicants (calibration). Both are legitimate perspectives.

Step 3: Consider regulatory requirements. Some jurisdictions prescribe specific fairness metrics. The EEOC’s four-fifths rule effectively mandates a relaxed demographic parity standard. The EU AI Act’s risk management requirements imply the need for systematic fairness evaluation without prescribing specific metrics.

Step 4: Acknowledge trade-offs explicitly. Document which fairness properties the chosen metric satisfies and which it sacrifices. Explain why the chosen trade-off is appropriate for the specific context.

Step 5: Commit to ongoing evaluation. Fairness metrics should be monitored continuously in production, not assessed once and forgotten. Fairness properties can degrade as the model, the population, and the environment change over time.

Communicating Fairness to Non-Technical Stakeholders

Governance professionals must translate technical fairness analysis into language that board members, regulators, and affected communities can understand and engage with.

Avoid: “The model satisfies equalized odds with TPR differential < 0.05 across racial subgroups.”

Use instead: “Among people who would actually repay their loans, the model approves the same percentage regardless of race. And among people who would default, the model flags the same percentage regardless of race. The difference in these rates between racial groups is less than 5 percentage points.”

The translation must preserve accuracy — oversimplification that misrepresents the metric or hides its limitations is worse than technical jargon. But the translation must be accessible enough that a non-technical stakeholder can evaluate whether the fairness standard is adequate for the decisions at stake.

This article is part of the COMPEL Body of Knowledge v2.5 and supports the AI Transformation Governance Professional (AITGP) certification.