PRE-TRAINING PROXY DETECTION¶
Goal: Before training any model, check whether the data itself already treats our proxy groups differently. If the data is biased going in, the model will be biased coming out.
What tools are we using?
Crosstab — A table that counts fraud cases broken down by group. Shows us the raw numbers.
Chi-Square Test — A statistical test that asks: is the difference between groups real, or could it just be random variation? If p is less than 0.05, the difference is real and not due to chance.
Cramer's V — A score from 0 to 1 that measures how strong the relationship between group and fraud actually is. A statistically significant result does not automatically mean a practically important one. Cramer's V separates those two things.
V below 0.10 means the relationship is weak in practical terms. V from 0.10 to 0.30 is moderate. V above 0.30 is strong.
Understanding What These Tests Are Looking For:
The model learns from examples. If 87% of all fraud examples in the training data come from one type of user, the model becomes an expert on that user. Everyone else gets protection built from the remaining 13%.
This is not a model flaw. It is a data problem. And a data problem going in becomes a fairness problem coming out.
The proxy detection step tests whether that unevenness exists before training begins, so we know exactly what we are walking into.
import pandas as pd
import numpy as np
from scipy import stats
DATA_FILE = 'PS_20174392719_1491204439457_log.csv'
df = pd.read_csv(DATA_FILE)
# ── SECTION 3: PROXY VARIABLE ENGINEERING ─────────────────────────────────────
print('Engineering proxy columns...')
# Proxy 1: Binary economic role
# CASH_OUT = informal economy users (market traders, gig workers, smallholder farmers)
# OTHER = broader mix of formal and informal activity
df['tx_type_group'] = df['type'].apply(
lambda x: 'CASH_OUT' if x == 'CASH_OUT' else 'OTHER'
)
# Proxy 2: Wealth tier using fixed GHS boundaries
# These boundaries are derived from the actual data distribution and held constant
# across all sections to ensure consistency
bins_proxy = [-np.inf, 0.0, 50397.0, np.inf]
labels_proxy = ['Low-Balance', 'Mid-Balance', 'High-Balance']
df['balance_group'] = pd.cut(
df['oldbalanceOrg'],
bins=bins_proxy,
labels=labels_proxy
)
print(f' tx_type_group : {df["tx_type_group"].value_counts().to_dict()}')
print(f' balance_group : {df["balance_group"].value_counts().to_dict()}')
print('Proxy columns created.')
Engineering proxy columns...
tx_type_group : {'OTHER': 4125120, 'CASH_OUT': 2237500}
balance_group : {'High-Balance': 2163289, 'Low-Balance': 2102449, 'Mid-Balance': 2096882}
Proxy columns created.
# ── SECTION 3: CHI-SQUARE AND CRAMER'S V ──────────────────────────────────────
def cramers_v(ct):
"""
Calculates Cramer's V with bias correction.
Measures association strength between 0 and 1.
1 = perfect association, 0 = no association.
"""
chi2 = stats.chi2_contingency(ct)[0]
n = ct.sum()
phi2 = chi2 / n
r, k = ct.shape
phi2corr = max(0, phi2 - ((k - 1) * (r - 1)) / (n - 1))
rcorr = r - ((r - 1) ** 2) / (n - 1)
kcorr = k - ((k - 1) ** 2) / (n - 1)
return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))
proxy_results = {}
for proxy, label in [
('tx_type_group', 'Transaction Type (Economic Role Proxy)'),
('balance_group', 'Account Balance Tier (Wealth Proxy)')
]:
print(f'\n{"=" * 58}')
print(f'Proxy: {label}')
print('=' * 58)
ct_raw = pd.crosstab(df[proxy], df['isFraud'])
chi2, p, dof, _ = stats.chi2_contingency(ct_raw.values)
v = cramers_v(ct_raw.values)
ct_display = ct_raw.copy()
ct_display.columns = ['Legitimate', 'Fraud']
ct_display['Total'] = ct_display.sum(axis=1)
ct_display['Fraud_Rate_%'] = (
ct_display['Fraud'] / ct_display['Total'] * 100
).round(4)
print('\nCrosstab results:')
print(ct_display.to_string())
strength = ('NEGLIGIBLE' if v < 0.10 else
'WEAK' if v < 0.20 else
'MODERATE' if v < 0.40 else 'STRONG')
sig = p < 0.05
print(f'\nStatistical Test Results:')
print(f' Chi-square statistic : {chi2:>12,.2f}')
print(f' p-value : {p:>12.2e}')
print(f" Cramer's V : {v:>12.4f}")
print(f' Significance : {"SIGNIFICANT (real, not random)" if sig else "NOT significant"}')
print(f' Association strength : {strength}')
proxy_results[proxy] = {
'chi2': chi2, 'p': p, 'v': v,
'strength': strength, 'ct': ct_display
}
print('\nProxy detection complete.')
==========================================================
Proxy: Transaction Type (Economic Role Proxy)
==========================================================
Crosstab results:
Legitimate Fraud Total Fraud_Rate_%
tx_type_group
CASH_OUT 2233384 4116 2237500 0.1840
OTHER 4121023 4097 4125120 0.0993
Statistical Test Results:
Chi-square statistic : 805.43
p-value : 3.57e-177
Cramer's V : 0.0112
Significance : SIGNIFICANT (real, not random)
Association strength : NEGLIGIBLE
==========================================================
Proxy: Account Balance Tier (Wealth Proxy)
==========================================================
Crosstab results:
Legitimate Fraud Total Fraud_Rate_%
balance_group
Low-Balance 2102408 41 2102449 0.0020
Mid-Balance 2095860 1022 2096882 0.0487
High-Balance 2156139 7150 2163289 0.3305
Statistical Test Results:
Chi-square statistic : 10,494.67
p-value : 0.00e+00
Cramer's V : 0.0406
Significance : SIGNIFICANT (real, not random)
Association strength : NEGLIGIBLE
Proxy detection complete.
[Proxy detection findings]¶
What These Numbers Are Telling Us¶
We just ran pre-training proxy detection. Before handing any data to the model, we checked whether fraud is already unevenly distributed across our proxy groups. If it is, the model will learn that unevenness as if it is the truth.
Proxy 1: Transaction Type — CASH_OUT vs OTHER
CASH_OUT users face a fraud rate almost double that of OTHER users. The chi-square test confirms this difference is real and not random. The p-value is essentially zero.
But look at Cramer's V, it is 0.0112. Association strength: NEGLIGIBLE.
This seems contradictory. How can something be statistically significant and negligible at the same time?
Think of it this way. A hospital-grade scale can detect the weight of a single sheet of paper. If you weigh yourself before and after drinking a glass of water, the scale will show a real measurable difference. But you would not say water makes you gain weight. The difference is real but tiny in practical terms.
The chi-square test is that hospital scale. With 6.3 million transactions, it finds real patterns in almost anything. Cramer's V asks whether that pattern is strong enough to matter. At 0.0112 out of a possible 1.0, the answer is no. Transaction type alone is a weak signal.
Proxy 2: Account Balance Tier
Now look at the balance tier results. The chi-square statistic is 10,494.67. The p-value is effectively zero. This pattern is real and structural.
Cramer's V is 0.0406. Still NEGLIGIBLE by our threshold but meaningfully stronger than the transaction type result. And in the context of model training, the practical consequence is severe.
The model will see 7,150 High-Balance fraud examples and approximately 41 Low-Balance fraud examples. It will learn what High-Balance fraud looks like with high confidence. For Low-Balance fraud, it will have almost nothing to anchor its learning.
This is not a flaw in the algorithm we are about to train. It is a problem with the data the algorithm is being asked to learn from. And it is the exact kind of structural problem that a pre-training audit is designed to catch before it becomes a deployed system affecting real users.