FEATURE ENGINEERING¶
What this section does:
Feature engineering is where raw data gets transformed into the inputs a machine learning model can learn from. This is also where many bias problems get introduced or amplified without anyone noticing.
We document every transformation here, explain why it was made, and flag any risk it carries.
A note on encoding:
You might wonder why we do not one-hot encode the transaction type column. One-hot encoding converts a categorical column with N categories into N separate binary columns. For a 5-category column like type, that means 5 new columns.
For linear models this is necessary because those models assume numeric inputs have ordinal meaning i.e the number 3 is "higher" than 2. Encoding TRANSFER as 4 and PAYMENT as 3 would imply TRANSFER is somehow greater than PAYMENT, which is nonsensical.
XGBoost does not have this problem. It builds decision trees by finding the best split point for each feature at each node. It discovers the optimal split empirically from the data without assuming numeric order means anything. One-hot encoding with XGBoost would add 4 extra columns, increase memory use, and slow training with no performance benefit. Label encoding is the correct choice here.
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
bins_proxy = [-np.inf, 0.0, 50397.0, np.inf]
labels_proxy = ['Low-Balance', 'Mid-Balance', 'High-Balance']
DATA_FILE = 'PS_20174392719_1491204439457_log.csv'
df = pd.read_csv(DATA_FILE)
# ── SECTION 5: FEATURE ENGINEERING ───────────────────────────────────────────
print('Step 1: Encoding transaction type...')
le = LabelEncoder()
df['type_encoded'] = le.fit_transform(df['type'])
print(f' Encoding map: {dict(zip(le.classes_, le.transform(le.classes_)))}')
print(f' Method: Label Encoding (not one-hot)')
print(f' Reason: XGBoost uses tree-based splits and does not need ordinal separation.')
print('\nStep 2: Engineering balance-derived features...')
print(' These capture the behavioural signature of fraud.')
print(' Fraud often involves abnormal movement or disappearance of funds.\n')
# Balance movement features
df['balance_diff_orig'] = df['oldbalanceOrg'] - df['newbalanceOrig']
df['balance_diff_dest'] = df['newbalanceDest'] - df['oldbalanceDest']
print(' balance_diff_orig: Drop in sender balance.')
print(' Large positive values may indicate account drainage.')
df['amount_to_orig_balance'] = np.where(
df['oldbalanceOrg'] > 0,
df['amount'] / (df['oldbalanceOrg'] + 1e-6),
0
)
print('\n amount_to_orig_balance: Fraction of sender balance being moved.')
print(' Values close to 1.0 indicate full account depletion.')
df['orig_balance_zeroed'] = (df['newbalanceOrig'] == 0).astype(int)
print('\n orig_balance_zeroed: Did sender account end at zero?')
print(' Captures account depletion behavior seen more often in fraud.')
df['dest_balance_zeroed'] = (df['oldbalanceDest'] == 0).astype(int)
print('\n dest_balance_zeroed: Was destination empty before receiving funds?')
print('\n hour_of_day: Already engineered in E.D.A section.')
print(' Captures time-based fraud patterns within a 24-hour cycle.')
print('\nStep 3: Defining model features...')
FEATURES = [
'step', 'type_encoded', 'amount',
'hour_of_day',
'oldbalanceOrg', 'newbalanceOrig',
'oldbalanceDest', 'newbalanceDest',
'balance_diff_orig', 'balance_diff_dest',
'amount_to_orig_balance',
'orig_balance_zeroed', 'dest_balance_zeroed'
]
print(f' Total features: {len(FEATURES)}')
for f in FEATURES:
print(f' {f}')
# EXCLUDED features — document why
EXCLUDED = {
'isFlaggedFraud': 'Data leakage — another system output on same data',
'nameOrig': 'Customer ID — overfitting and privacy risk (BoG CISD 2026 §99)',
'nameDest': 'Customer ID — overfitting and privacy risk (BoG CISD 2026 §99)',
'type': 'Raw string replaced by type_encoded (numeric version)',
'tx_type_group': 'Proxy variable — fairness testing only, never a model feature',
'balance_group': 'Proxy variable — fairness testing only, never a model feature',
}
print('\nExcluded from model (with reasons):')
for col, reason in EXCLUDED.items():
print(f' {col:<20} {reason}')
print('\nFeature engineering complete.')
Step 1: Encoding transaction type...
Encoding map: {'CASH_IN': np.int64(0), 'CASH_OUT': np.int64(1), 'DEBIT': np.int64(2), 'PAYMENT': np.int64(3), 'TRANSFER': np.int64(4)}
Method: Label Encoding (not one-hot)
Reason: XGBoost uses tree-based splits and does not need ordinal separation.
Step 2: Engineering balance-derived features...
These capture the behavioural signature of fraud.
Fraud often involves abnormal movement or disappearance of funds.
balance_diff_orig: Drop in sender balance.
Large positive values may indicate account drainage.
amount_to_orig_balance: Fraction of sender balance being moved.
Values close to 1.0 indicate full account depletion.
orig_balance_zeroed: Did sender account end at zero?
Captures account depletion behavior seen more often in fraud.
dest_balance_zeroed: Was destination empty before receiving funds?
hour_of_day: Already engineered in E.D.A section.
Captures time-based fraud patterns within a 24-hour cycle.
Step 3: Defining model features...
Total features: 13
step
type_encoded
amount
hour_of_day
oldbalanceOrg
newbalanceOrig
oldbalanceDest
newbalanceDest
balance_diff_orig
balance_diff_dest
amount_to_orig_balance
orig_balance_zeroed
dest_balance_zeroed
Excluded from model (with reasons):
isFlaggedFraud Data leakage — another system output on same data
nameOrig Customer ID — overfitting and privacy risk (BoG CISD 2026 §99)
nameDest Customer ID — overfitting and privacy risk (BoG CISD 2026 §99)
type Raw string replaced by type_encoded (numeric version)
tx_type_group Proxy variable — fairness testing only, never a model feature
balance_group Proxy variable — fairness testing only, never a model feature
Feature engineering complete.
Feature Engineering Audit Summary¶
13 features will be used for training. All are derived from legitimate transaction behaviour. None encode protected attributes directly.
Two proxy variables (tx_type_group, balance_group) are deliberately excluded from model features. The model must never see the variables we use to test it. If we trained on them, we would be influencing the model rather than auditing it.
One documented risk: The feature amount_to_orig_balance uses account balance as its denominator. For a high-balance user, a large transaction produces a small ratio. For a low-balance user, the same transaction produces a large ratio that looks like an account drain. This creates an implicit correlation between the feature and our balance tier proxy.
This is not a reason to exclude the feature, it is a genuine fraud signal. But it is a reason to watch what SHAP tells us about this feature. If the model is heavily weighting this feature, it may be partially driving the fairness failure we expect to find in the fairness and bias section.