ERROR ANALYSIS¶


What this section does:

Instead of looking at averages, we inspect the mistakes. Where does the model fail? Who gets false positives (innocent people blocked)? Who gets false negatives (fraud missed)? Are errors concentrated in specific groups?

This is one of the most underrated steps in an AI audit. Aggregate performance metrics tell you how often the model is right. Error analysis tells you who pays the price when it is wrong.

In a fraud detection context:

  • A false positive means a legitimate customer gets blocked. Their transaction is stopped. Their account may be frozen. They have to go through a dispute process to get their money.
  • A false negative means a fraud transaction goes through undetected. The victim loses money. The fraudster escapes.

Neither type of error is acceptable, but their consequences are different for different user groups. That is what we are measuring here.


In [4]:
# ── SECTION 1: LOAD CHECKPOINT ───────────────────────────────────────────────

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import LabelEncoder
from scipy.stats import ks_2samp
from fairlearn.metrics import demographic_parity_difference, equalized_odds_difference
from sklearn.metrics import matthews_corrcoef, log_loss, confusion_matrix, precision_recall_curve, auc, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
bins_proxy   = [-np.inf, 0.0, 50397.0, np.inf]
labels_proxy = ['Low-Balance', 'Mid-Balance', 'High-Balance']

df = pd.read_csv('checkpoint.csv')

print(f'Checkpoint loaded: {df.shape[0]:,} rows x {df.shape[1]} columns')
print(df.columns.tolist())
Checkpoint loaded: 6,362,620 rows x 11 columns
['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud']
In [5]:
import pandas as pd

results = pd.read_csv('model_results.csv')

y_test = results['y_test']
y_pred = results['y_pred']
y_prob = results['y_prob']


s_test = results[['balance_group', 'tx_type_group']].astype(str)

print("All audit variables (including s_test) are now defined.")
All audit variables (including s_test) are now defined.
In [6]:
fairness_results = {}

for proxy in ['tx_type_group', 'balance_group']:
    sf     = s_test[proxy].astype(str).values
    groups = np.unique(sf)

    fairness_results[proxy] = {'group_stats': {}}

    for g in groups:
        mask = sf == g

        y_true_g = y_test[mask]
        y_pred_g = y_pred[mask]

        # confusion matrix: tn, fp, fn, tp
        tn, fp, fn, tp = confusion_matrix(y_true_g, y_pred_g, labels=[0,1]).ravel()

        fairness_results[proxy]['group_stats'][g] = {
            'tn': tn,
            'fp': fp,
            'fn': fn,
            'tp': tp
        }
In [7]:
# ── SECTION 9: ERROR ANALYSIS ─────────────────────────────────────────────────
print('=' * 65)
print('ERROR ANALYSIS — WHERE DOES THE MODEL FAIL?')
print('=' * 65)

for proxy in ['tx_type_group', 'balance_group']:
    sf     = s_test[proxy].astype(str).values
    groups = sorted(np.unique(sf))

    if proxy == 'balance_group':
        order  = ['Low-Balance', 'Mid-Balance', 'High-Balance']
        groups = [g for g in order if g in groups]

    print(f'\nProxy: {proxy}')
    print(f'{"Group":<14} {"False Positives":>16} {"FP Rate":>10} '
          f'{"False Negatives":>16} {"FN Rate":>10} {"Miss Rate":>10}')
    print('-' * 82)

    for g in groups:
        stats_g = fairness_results[proxy]['group_stats'][g]
        n_legit = stats_g['tn'] + stats_g['fp']
        n_fraud = stats_g['tp'] + stats_g['fn']
        fp_rate = stats_g['fp'] / n_legit if n_legit > 0 else 0
        fn_rate = stats_g['fn'] / n_fraud if n_fraud > 0 else 0
        miss_pct = stats_g['fn'] / n_fraud * 100 if n_fraud > 0 else 0

        print(f'{g:<14} {stats_g["fp"]:>16,} {fp_rate:>10.6f} '
              f'{stats_g["fn"]:>16,} {fn_rate:>10.6f} {miss_pct:>9.1f}%')

print('\nKey question: Are false positives and false negatives')
print('concentrated in specific groups, or spread evenly?')
=================================================================
ERROR ANALYSIS — WHERE DOES THE MODEL FAIL?
=================================================================

Proxy: tx_type_group
Group           False Positives    FP Rate  False Negatives    FN Rate  Miss Rate
----------------------------------------------------------------------------------
CASH_OUT                  2,866   0.006420                8   0.010025       1.0%
OTHER                     3,751   0.004550                4   0.004734       0.5%

Proxy: balance_group
Group           False Positives    FP Rate  False Negatives    FN Rate  Miss Rate
----------------------------------------------------------------------------------
Low-Balance                 168   0.000399                2   0.333333      33.3%
Mid-Balance               4,171   0.009958                6   0.026667       2.7%
High-Balance              2,278   0.005279                4   0.002833       0.3%

Key question: Are false positives and false negatives
concentrated in specific groups, or spread evenly?

Note on miss rates: These percentages measure what fraction of each group's own fraud cases the model failed to catch. Low-Balance users have a 33.3% internal miss rate, meaning one in three fraud cases targeting this group goes completely undetected. High-Balance users have a 0.3% miss rate, meaning the model almost never misses their fraud. This 110-fold difference in miss rates is the clearest expression of the protection gap this audit set out to find. The absolute fraud case counts are small for Low-Balance users, but the direction and magnitude of the disparity is unambiguous and consistent with the EOD finding in the fairness and bias section.

In [8]:
# ── SECTION 9: ERROR DISTRIBUTION CHARTS ─────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle(
    'Section 9 — Error Analysis: Who Bears the Cost of Model Mistakes?\n'
    'ClearBoxAI Audit CBA-2026-002',
    fontsize=12, fontweight='bold'
)

bg_groups = ['Low-Balance', 'Mid-Balance', 'High-Balance']
bg_stats  = fairness_results['balance_group']['group_stats']
existing  = [g for g in bg_groups if g in bg_stats]

fp_counts = [bg_stats[g]['fp'] for g in existing]
fn_counts = [bg_stats[g]['fn'] for g in existing]
colors_bg = ['#FF6B6B', '#FFC107', '#4CAF50'][:len(existing)]

# False positives (innocent people wrongly blocked)
axes[0].bar(existing, fp_counts, color=colors_bg, edgecolor='black', alpha=0.85)
axes[0].set_title('False Positives by Balance Tier\n(Legitimate customers wrongly blocked)',
                  fontweight='bold')
axes[0].set_ylabel('Number of false positives')
for i, v in enumerate(fp_counts):
    axes[0].text(i, v + 5, f'{v:,}', ha='center', fontsize=10, fontweight='bold')

# False negatives (fraud missed)
axes[1].bar(existing, fn_counts, color=colors_bg, edgecolor='black', alpha=0.85)
axes[1].set_title('False Negatives by Balance Tier\n(Fraud missed — victims unprotected)',
                  fontweight='bold')
axes[1].set_ylabel('Number of false negatives')
for i, v in enumerate(fn_counts):
    pct_of_all = v / sum(fn_counts) * 100 if sum(fn_counts) > 0 else 0
    axes[1].text(i, v + 0.1,
                 f'{v}\n({pct_of_all:.1f}% of all\nmissed fraud)',
                 ha='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig('fig_error_01.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

Note on chart percentages: The percentages shown above represent each group's share of the total missed fraud cases across all groups combined. Mid-Balance accounts for 50% of all missed fraud in absolute terms, High-Balance 33.3%, and Low-Balance 16.7%. These figures reflect how missed fraud is distributed, not how reliable the model is within each group. A group with few fraud cases can have a small share of total misses while still having a very high internal miss rate. The per-group miss rates in the table below tell that more important story.

Error Analysis Findings¶

False Positives - Innocent Customers Wrongly Blocked¶

False positives follow the size of each group. Mid-Balance accounts generate the most in absolute terms at 4,171, followed by High-Balance at 2,278, and Low-Balance at 168. On the surface this looks proportionate.

But the false positive rate tells a different story. Mid-Balance users are blocked at a rate of 0.996% of their legitimate transactions. High-Balance users at 0.528%. Low-Balance users at just 0.040%.

This means Low-Balance users are actually the least likely to be wrongly blocked. That sounds like good news until you consider what it reflects: the model barely flags this group at all, for fraud or for anything else. It is not protecting them more carefully. It has simply learned so little about their transaction patterns that it defaults to calling almost everything they do legitimate - including actual fraud.


False Negatives - Fraud Missed, Victims Unprotected¶

This is where the numbers become most serious.

Group Actual Fraud in Test Missed Miss Rate
Low-Balance 6 2 33.3%
Mid-Balance 225 6 2.7%
High-Balance 1,412 4 0.3%

The miss rate climbs sharply as account balance falls. High-Balance users lose 0.3% of their fraud cases to model failure. Mid-Balance users lose 2.7%. Low-Balance users lose 33.3%.

That is a 110-fold difference in miss rate between the best-protected and worst-protected group.

In absolute terms, Mid-Balance has the most missed fraud cases at 6. But miss rate is the right metric here, not absolute count. A group with only 6 fraud cases in the test set is not well protected simply because it has fewer misses than a group with 1,412. The question is what fraction of each group's own fraud the model is failing to catch. For Low-Balance users, that fraction is one in three.


What This Means Practically¶

Think about what a 33.3% miss rate means in a real deployment. For every three low-balance mobile money users who experience fraud, the AI system catches two and lets one go completely undetected. That user has no automated protection at the moment it matters most. If they report it afterwards, a human investigator may follow up. But the first line of defence has already failed them.

Compare that to a high-balance user. For every 333 high-balance fraud cases, the model misses one. The protection is not perfect but it is reliable. For low-balance users it is not reliable. One in three fraudsters targeting this group walks away unchallenged by the system.

The market trader in Kumasi with GHS 200 stolen from her account faces a one-in-three chance that the fraud detection AI does not respond at all. That is not a modelling edge case. It is a structural feature of how the system was built and what data it was trained on.


Conclusion¶

This error analysis confirms what the fairness metrics in the fairness and bias section measured statistically. The model is not equally wrong for everyone. It is most wrong for the users who are least equipped to absorb the consequences.


Regulatory Assessment¶

Regulation Provision Status
BoG CISD 2026, Annexure E §l(i) Material bias confirmed through error distribution analysis TRIGGERED
NIST AI RMF 1.0, MEASURE 2.11 Fairness and bias evaluated and documented Complete
NIST AI RMF 1.0, §3.7 Harmful Bias Managed — trustworthiness characteristic Breached

Risk Level: HIGH - Error analysis confirms the fairness finding from the fairness and bias section. The miss rate disparity of 33.3% versus 0.3% between the lowest and highest balance tiers is a structural protection gap, not statistical noise.

Auditor: Kwadwo Amponsah, ClearBoxAI - April 2026

In [10]:
df.to_csv('checkpoint_v2.csv', index=False)
print("Checkpoint v2 saved.")
Checkpoint v2 saved.
In [ ]: