DATA AUDIT AND INTEGRITY CHECKS¶
What this section does:
Before we trust any data to train an AI model, we must interrogate it. A model trained on dirty, biased, or misunderstood data will produce dirty, biased, or wrong predictions.
We run five checks in this section:
- Load the dataset and measure basic shape and fraud rate
- Check for missing values — are they random or systematic?
- Check for duplicate rows
- Check for impossible values — negative amounts, structural contradictions
- Check for data leakage — does any feature contain information the model should not have access to at prediction time?
The output of this section is a data quality certificate. If anything fails, we document it and decide how to handle it before proceeding.
# ── SECTION 2: DATA LOADING AND INITIAL SCAN ─────────────────────────────────
import pandas as pd
DATA_FILE = 'PS_20174392719_1491204439457_log.csv'
print('Scanning full dataset for fraud rate...')
print('Scanning in chunks to manage memory on 6M+ row file.\n')
total_rows = 0
total_fraud = 0
for chunk in pd.read_csv(DATA_FILE, chunksize=200_000):
total_rows += len(chunk)
total_fraud += chunk['isFraud'].sum()
full_fraud_rate = total_fraud / total_rows * 100
print(f' Total transactions : {total_rows:>12,}')
print(f' Total fraud cases : {total_fraud:>12,}')
print(f' Fraud rate : {full_fraud_rate:>11.4f}%')
print(f'\n Approximately 1 in every {round(1/full_fraud_rate*100):,} transactions is fraud.')
print(f'\n Observation: Target variable is highly imbalanced (0.13%).')
print(f' Standard accuracy is not a valid performance metric for this dataset.')
print(f' This justifies the use of MCC and PR-AUC as primary performance metrics.')
print(f' It also justifies SMOTE during training to counter the imbalance.')
# Now load full dataset into memory for detailed inspection
print('\nLoading full dataset into memory for quality checks...')
df = pd.read_csv(DATA_FILE)
print(f'Loaded: {df.shape[0]:,} rows x {df.shape[1]} columns')
print(f'\nColumn names: {df.columns.tolist()}')
Scanning full dataset for fraud rate... Scanning in chunks to manage memory on 6M+ row file. Total transactions : 6,362,620 Total fraud cases : 8,213 Fraud rate : 0.1291% Approximately 1 in every 775 transactions is fraud. Observation: Target variable is highly imbalanced (0.13%). Standard accuracy is not a valid performance metric for this dataset. This justifies the use of MCC and PR-AUC as primary performance metrics. It also justifies SMOTE during training to counter the imbalance. Loading full dataset into memory for quality checks... Loaded: 6,362,620 rows x 11 columns Column names: ['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud']
# ── SECTION 2: DATA QUALITY REPORT ───────────────────────────────────────────
print('=' * 60)
print('DATA QUALITY REPORT')
print('=' * 60)
# A. Missing values
print('\n[A] Missing Values:')
missing = df.isnull().sum()
if missing.sum() == 0:
print(' No missing values found across all columns. Data is complete.')
else:
print(' Missing values found:')
print(missing[missing > 0])
# B. Duplicate rows
print('\n[B] Duplicate Rows:')
dupes = df.duplicated().sum()
if dupes == 0:
print(' No duplicate rows found.')
else:
print(f' WARNING: {dupes:,} duplicate rows detected. Review required.')
# C. Impossible values
print('\n[C] Impossible Values:')
neg_amounts = (df['amount'] < 0).sum()
neg_old_orig = (df['oldbalanceOrg'] < 0).sum()
neg_new_orig = (df['newbalanceOrig'] < 0).sum()
print(f' Negative transaction amounts : {neg_amounts:,} {"Clean" if neg_amounts==0 else "FLAG"}')
print(f' Negative origin old balance : {neg_old_orig:,} {"Clean" if neg_old_orig==0 else "FLAG"}')
print(f' Negative origin new balance : {neg_new_orig:,} {"Clean" if neg_new_orig==0 else "FLAG"}')
# D. Data leakage check
print('\n[D] Data Leakage Check:')
print(' isFlaggedFraud column — is this a leakage risk?')
flagged = df['isFlaggedFraud'].sum()
actual_fraud = df['isFraud'].sum()
print(f' Actual fraud cases : {actual_fraud:,}')
print(f' Cases flagged by existing rule: {flagged:,}')
print(f' System catch rate : {flagged/actual_fraud*100:.2f}%')
print(f'\n Decision: isFlaggedFraud will be EXCLUDED from model features.')
print(f' Reason: it is a rule-based system output derived from the same data.')
print(f' Including it would allow the model to "cheat" by learning from another')
print(f' system\'s predictions rather than raw transaction patterns.')
print(f' It is also nearly useless — it catches only {flagged/actual_fraud*100:.2f}% of actual fraud.')
# E. nameOrig and nameDest check
print('\n[E] Customer ID Columns (nameOrig, nameDest):')
print(f' Unique senders : {df["nameOrig"].nunique():,}')
print(f' Unique recipients : {df["nameDest"].nunique():,}')
print(f' Decision: Both ID columns will be EXCLUDED from model features.')
print(f' Reason: Customer IDs are identifiers, not behavioural signals.')
print(f' A model that learns individual IDs will fail on new customers (overfitting).')
print(f' It would also raise serious customer privacy concerns under the Data Protection')
print(f' Act 2012 (Act 843) and BoG CISD 2026 §99.')
print('\n' + '=' * 60)
print('DATA QUALITY SUMMARY: Dataset is clean and suitable for modelling.')
print('Two columns excluded for leakage/privacy reasons: isFlaggedFraud, nameOrig, nameDest.')
print('=' * 60)
============================================================ DATA QUALITY REPORT ============================================================ [A] Missing Values: No missing values found across all columns. Data is complete. [B] Duplicate Rows: No duplicate rows found. [C] Impossible Values: Negative transaction amounts : 0 Clean Negative origin old balance : 0 Clean Negative origin new balance : 0 Clean [D] Data Leakage Check: isFlaggedFraud column — is this a leakage risk? Actual fraud cases : 8,213 Cases flagged by existing rule: 16 System catch rate : 0.19% Decision: isFlaggedFraud will be EXCLUDED from model features. Reason: it is a rule-based system output derived from the same data. Including it would allow the model to "cheat" by learning from another system's predictions rather than raw transaction patterns. It is also nearly useless — it catches only 0.19% of actual fraud. [E] Customer ID Columns (nameOrig, nameDest): Unique senders : 6,353,307 Unique recipients : 2,722,362 Decision: Both ID columns will be EXCLUDED from model features. Reason: Customer IDs are identifiers, not behavioural signals. A model that learns individual IDs will fail on new customers (overfitting). It would also raise serious customer privacy concerns under the Data Protection Act 2012 (Act 843) and BoG CISD 2026 §99. ============================================================ DATA QUALITY SUMMARY: Dataset is clean and suitable for modelling. Two columns excluded for leakage/privacy reasons: isFlaggedFraud, nameOrig, nameDest. ============================================================
Data quality findings¶
The dataset is structurally clean. No missing values, no duplicates, no impossible values were found.
Three columns are excluded from the model before training:
isFlaggedFraud is excluded because it is the output of an existing rule-based system applied to the same data. Including it would allow the model to learn from another system's outputs rather than raw transaction patterns. This is a form of data leakage. It would produce artificially inflated performance metrics that would not generalise to real deployment. Additionally, the existing system only catches 0.19% of actual fraud, making it nearly useless as a signal.
nameOrig and nameDest are customer identifiers. A model that memorises individual customer IDs would fail completely on any new customer who was not in the training data. Beyond the performance problem, using customer IDs as model features raises serious privacy concerns under the Data Protection Act 2012 (Act 843) and the requirements in BoG CISD 2026 §99 on data privacy and confidentiality.
These exclusion decisions are documented here before training begins, consistent with BoG CISD 2026 Annexure E §f(i) on data lineage and provenance.