Who Does AI Protect?¶
A Fairness Audit of Mobile Money Fraud Detection in Ghana's Informal Economy¶
Kwadwo Amponsah
ClearBoxAI / Baystate Bank
kwadwo_amponsah@yahoo.com
Track: Practitioner | Format: Conference | April 2026
ClearBoxAI · Audit CBA-2026-002 · AI Summit Ghana 2026
Abstract¶
Ghana lost GHS 346 million to mobile money fraud in 2023. Financial institutions have responded by deploying machine learning models to detect fraud automatically. These systems make decisions in milliseconds, clearing or flagging transactions before any human review occurs. The question this paper asks is whether these systems work equally well for all users.
This paper presents an independent fairness and performance audit of an XGBoost fraud detection model trained on 6.3 million mobile money transactions from the PaySim simulation dataset. The audit applies SHAP values for explainability and evaluates findings against the Bank of Ghana Cyber and Information Security Directive 2026, NIST AI RMF, and the EU AI Act.
The headline finding is a confirmed fairness failure. The model achieves 99.27% overall fraud recall but shows a True Positive Rate of 66.67% for low-balance users compared to 99.72% for high-balance users. The Equalized Odds Difference is 0.3305, more than three times the 0.10 threshold. The miss rate disparity is 33.3% versus 0.3%, a 110-fold gap. The root cause is structural: only 41 of 8,213 fraud cases in the dataset come from low-balance accounts, leaving the model with almost no examples of fraud affecting that group.
These findings are evaluated against the Bank of Ghana CISD 2026, which mandates fairness testing before AI deployment and would trigger mandatory review and potential suspension under Annexure E Section l(i) when material bias is detected. The audit methodology is fully reproducible and directly transferable to live Ghanaian datasets.
Keywords: AI fairness audit, mobile money fraud detection, model bias, equalized odds, Bank of Ghana CISD 2026, financial inclusion, informal economy, XGBoost, SHAP explainability, PaySim
1. Introduction¶
Mobile money has changed how people in Ghana manage their finances. MTN Mobile Money, Telecel Cash, and AirtelTigo Money together serve millions of users, many of whom have no traditional bank account. For a market trader in Kumasi who collects daily payments through her phone, mobile money is not a convenience. It is her bank.
As transaction volumes have grown, so has fraud. Ghana lost GHS 346 million to mobile money fraud in 2023. Financial institutions have responded by deploying machine learning models to detect fraudulent transactions automatically. These models make decisions in milliseconds, flagging transactions as suspicious or clearing them as legitimate before any human intervention.
The promise is faster, more accurate fraud detection. The question this paper asks is whether that promise holds equally for all users.
Machine learning models learn from data. When a fraud detection model is trained, it studies historical examples of fraud and builds a picture of what fraud looks like. If the training data contains far more fraud examples from wealthy, high-balance accounts than from low-balance informal economy accounts, the model becomes expert at protecting the first group and unreliable for the second. It does not discriminate intentionally. It learns what it is shown.
This creates a problem that does not show up in standard performance reports. A model can achieve outstanding accuracy metrics while simultaneously failing an entire segment of users, because those users contribute so few fraud cases to the overall count that their failures barely register in aggregate numbers. This is the accuracy paradox, and it is the central concern this paper addresses.
The broader implication connects directly to the conference theme of economic growth and inclusion. If mobile money AI systems protect wealthier accounts more reliably than low-balance ones, the infrastructure built to extend financial access to Ghana's informal economy is quietly delivering different levels of safety depending on how much money a user already has. Adoption of mobile money in the informal sector depends on trust. If trust erodes because users experience unaddressed fraud, digitization of the informal economy slows, and that has measurable GDP implications.
This paper presents a complete practitioner audit of an XGBoost fraud detection model trained on the PaySim mobile money simulation dataset. It documents a confirmed and measurable fairness failure, traces the root cause to a structural data coverage problem, evaluates the finding against the Bank of Ghana Cyber and Information Security Directive 2026, and provides a replicable methodology for institutions required to conduct this testing.
The paper proceeds as follows. Section 2 describes the dataset and audit methodology. Section 3 presents results across performance and fairness metrics. Section 4 discusses implications for financial inclusion, regulatory compliance, and the compounding nature of the problem. Section 5 concludes with recommendations.
2. Methodology¶
2.1 Dataset¶
The PaySim synthetic mobile money dataset contains 6,362,620 simulated transactions generated from real transaction logs of a mobile financial service operating across 14 African countries. The simulation replicates the transaction types, balance mechanics, and fraud patterns of real mobile money systems. It contains five transaction types: CASH_OUT, TRANSFER, PAYMENT, CASH_IN, and DEBIT. The fraud rate is 0.13%, with 8,213 confirmed fraud cases out of 6.3 million total transactions.
Real mobile money transaction data from Ghanaian operators is not publicly available due to confidentiality and commercial restrictions. Researchers at Makerere University encountered the same barrier when conducting fraud detection research in Uganda, requiring them to build a synthetic dataset because operators declined to share real transaction logs (Azamuke et al., 2025). PaySim was built specifically to address this gap for the African mobile money context.
This audit demonstrates a credible and measurable risk pattern that is likely present in Ghanaian systems, given that the structural data distribution conditions found in PaySim (specifically, the concentration of fraud examples in high-balance accounts) reflect the same data collection dynamics that exist in live operator environments. The methodology is directly transferable to any live Ghanaian dataset the moment access becomes available.
2.2 Why This Problem Is Hard in Africa¶
The low fraud count for low-balance users in this dataset is not simply a matter of those users experiencing less fraud. Several structural factors reduce how many low-balance fraud cases appear in any mobile money training dataset in the African context.
Underreporting is significant. Informal economy users may not report fraud when it happens due to limited trust in resolution mechanisms, the perceived cost of pursuing small amounts, or simply not knowing how. Fraud that goes unreported never enters training data as a confirmed case.
Detection and labelling gaps compound the problem. Fraud must be detected and confirmed to enter training data. Existing systems that are already weaker at identifying low-value patterns produce fewer confirmed low-balance fraud labels, creating a feedback loop that this audit traces through from data to model to deployment consequences.
Digital footprint thinness matters as well. Informal users often transact less frequently and with less consistent patterns. This makes their behaviour harder to characterise and their fraud harder to distinguish from legitimate anomalous activity.
These factors mean the 41 low-balance fraud cases in this dataset are not the true number of fraud events affecting low-balance users. They are the number that was successfully detected, labelled, and recorded. The actual number is almost certainly higher and unknown.
2.3 Proxy Variable Construction¶
PaySim contains no demographic information. There is no column for income level, occupation, or geographic location. To test fairness across economically meaningful groups, two proxy variables were constructed, consistent with Bank of Ghana CISD 2026 Annexure E Section e(i)(3), which requires assessment of risks of bias and unfair outcomes based on proxy attributes when protected class data is not directly available.
These proxies are imperfect and may not fully capture socioeconomic status in every case, but they provide a practical approximation consistent with regulatory guidance and established practice in proxy-based fairness auditing.
The first proxy uses transaction type as a stand-in for economic role. CASH_OUT transactions are associated with informal economy users including market traders and gig workers, for whom cash withdrawal is the primary mode of accessing and spending mobile money. All other transaction types are grouped as OTHER, representing a broader mix of activity.
The second proxy uses pre-transaction account balance as a stand-in for wealth. Accounts with a balance of exactly GHS 0.00 at the time of the transaction are classified as Low-Balance. Accounts between GHS 0.01 and GHS 50,397 are classified as Mid-Balance. Accounts above GHS 50,397 are classified as High-Balance.
2.4 Audit Structure and Threshold Design¶
The audit follows a 13-section methodology covering data integrity checks, pre-training proxy detection, exploratory data analysis, feature engineering, model training, performance evaluation, fairness and bias measurement, error analysis, explainability using SHAP values (Lundberg and Lee, 2017), deployment readiness assessment, monitoring planning, and a final scorecard.
All pass and fail thresholds were defined before any model results were examined. This prevents threshold adjustment after the fact, which would undermine the integrity of the audit. The fairness library used throughout is Fairlearn, an industry-standard open-source toolkit developed by Microsoft Research for measuring and improving fairness in machine learning systems.
The performance thresholds were: MCC above 0.50, Log Loss below 0.40, PR-AUC above 0.70, KS Statistic above 0.30. The fairness thresholds were: absolute SPD below 0.10 and absolute EOD below 0.10. These thresholds are consistent with BoG CISD 2026 Annexure E §g(iii)(3), which requires fairness testing against defined criteria before deployment.
2.5 Model Design¶
An XGBoost binary classifier (Chen and Guestrin, 2016) was trained with 300 decision trees, a maximum depth of 6, and a learning rate of 0.1. The training set comprised 5,090,096 transactions (80% of the dataset) and the test set comprised 1,272,524 transactions (20%). SMOTE oversampling (Chawla et al., 2002) was applied to the training data to address the overall class imbalance.
Thirteen features were used for training, all derived from legitimate transaction behaviour. Three columns were excluded: isFlaggedFraud (data leakage, being the output of an existing rule-based system on the same data), nameOrig, and nameDest (customer identifiers creating overfitting risk and raising privacy concerns under the Data Protection Act 2012 and BoG CISD 2026 Section 99). The two proxy variables were stored separately and never shown to the model.
2.6 Fairness Metrics¶
Statistical Parity Difference (SPD) measures whether the model predicts fraud at the same rate across groups. Equalized Odds Difference (EOD) measures whether the model makes errors of the same quality across groups, comparing True Positive Rates and False Positive Rates simultaneously.
EOD was treated as the primary fairness metric for this audit because the central concern is whether all user groups receive equally reliable fraud detection, not merely whether they are flagged at equal rates. A model could pass SPD while failing EOD if it flags all groups equally but catches fraud reliably only for some groups.
2.7 Regulatory Framework¶
Three regulatory frameworks govern the evaluation in this paper. The Bank of Ghana Cyber and Information Security Directive 2026 is the primary instrument. It is a binding directive applicable to all Regulated Financial Institutions in Ghana deploying AI systems. Key provisions include Annexure E Section e(i)(3) requiring proxy-based fairness assessment, Section g(iii)(3) requiring fairness testing before deployment, Section l(i) on bias mitigation and suspension requirements, and Section 115(2)(b) on notification obligations.
The NIST AI Risk Management Framework 1.0 (NIST, 2023) provides the measurement framework, specifically MEASURE 2.11 which requires fairness and bias evaluation results to be documented. The EU AI Act serves as a reference classification standard, particularly Article 6(3) which classifies AI systems that perform profiling of natural persons as High-Risk, a definition that covers fraud scoring systems.
| Instrument | Provision | Relevance to This Audit |
|---|---|---|
| BoG CISD 2026, Annexure E §e(i)(3) | Proxy-based fairness risk assessment required | Justifies proxy variable methodology |
| BoG CISD 2026, Annexure E §g(iii)(3) | Fairness testing required before deployment | Mandates the testing this audit performs |
| BoG CISD 2026, Annexure E §l(i) | Material bias triggers review and potential suspension | Regulatory consequence of EOD finding |
| BoG CISD 2026, §115(2)(b) | Notify BoG of systemic bias or customer harm | Notification obligation triggered |
| NIST AI RMF 1.0, MEASURE 2.11 | Fairness and bias evaluated and documented | Measurement framework |
| EU AI Act, Article 6(3) | Systems performing user profiling are High-Risk | Risk classification reference |
3. Results¶
3.1 Data Audit Findings¶
The dataset was structurally clean. No missing values, duplicates, or impossible values were found across 6.3 million rows. Three columns were excluded for leakage or privacy reasons as described in Section 2.5.
Pre-training proxy detection confirmed that fraud is not evenly distributed across proxy groups. A chi-square test on the balance tier proxy produced a statistic of 10,494.67 with a p-value of effectively zero, confirming the distributional imbalance is structural and not random. Cramer's V was 0.0406, indicating weak practical association in isolation but severe training implications given the absolute case counts involved.
The core finding from exploratory analysis (referred to throughout this audit as the Invisible 41) is that only 41 of the 8,213 total fraud cases in the dataset come from Low-Balance accounts. That is 0.50% of all fraud cases. High-Balance accounts contribute 7,150 fraud cases, or 87.06% of total fraud. After the 80/20 train-test split, the training set contained approximately 35 Low-Balance fraud cases and 5,738 High-Balance fraud cases, a ratio of 164 to 1.
The fact that only 6 Low-Balance fraud cases appear in the test set is not a problem with the audit design. It is the primary evidence of the structural exclusion this audit is measuring. The system produces so few Low-Balance fraud labels that even a 20% test split returns only 6 cases. That scarcity is the finding.
Fraud is also structurally confined to two transaction types. Of all TRANSFER transactions in the dataset, 0.769% are fraudulent. Of all CASH_OUT transactions, 0.184% are fraudulent. PAYMENT, CASH_IN, and DEBIT transactions have zero fraud cases across 6.3 million transactions combined.
| Balance Group | Fraud Cases | Share of All Fraud |
|---|---|---|
| High-Balance (above GHS 50,397) | 7,150 | 87.06% |
| Mid-Balance (GHS 0.01 to 50,397) | 1,022 | 12.44% |
| Low-Balance (GHS 0.00) | 41 | 0.50% |
3.2 Performance Results¶
Three of four performance metrics pass. MCC fails because the model generates 6,617 false positives against 1,631 true positives, producing a precision of 19.77% at the default 0.5 threshold. Recall is 99.27%, meaning the model catches nearly all fraud in aggregate. It is worth acknowledging this directly: with a precision rate of 19.77% the model has a low precision problem, that means most fraud flags are false alarms (false positives). If deployed, this operational inefficiency compounds the fairness issue, since the fairness failure occurs within an already imperfect system rather than an otherwise optimal one.
| Metric | Value | Threshold | Result |
|---|---|---|---|
| Matthews Correlation Coefficient (MCC) | 0.4419 | above 0.50 | FAIL |
| Log Loss | 0.0135 | below 0.40 | PASS |
| Precision-Recall AUC (PR-AUC) | 0.9357 | above 0.70 | PASS |
| KS Statistic | 0.9916 | above 0.30 | PASS |
The per-group MCC breakdown reveals the structural gradient. High-Balance users receive MCC 0.6155 on 1,412 test fraud cases. Mid-Balance users receive MCC 0.2192 on 225 test fraud cases. Low-Balance users receive MCC 0.1245 on only 6 test fraud cases, barely above random guessing. The gradient is direct: the more fraud examples a group contributes to training, the better the model performs for that group.
3.3 Fairness Results¶
The transaction type proxy passes both fairness metrics. CASH_OUT users have a TPR of 99.00% versus 99.53% for OTHER users, a difference of 0.53 percentage points well within acceptable limits. The model does not discriminate against informal economy users at the transaction type level.
The balance tier proxy produces the critical finding.
| Metric | Value | Threshold | Result |
|---|---|---|---|
| SPD (Transaction Type) | +0.0026 | below 0.10 | PASS |
| EOD (Transaction Type) | +0.0053 | below 0.10 | PASS |
| SPD (Balance Tier) | +0.0101 | below 0.10 | PASS |
| EOD (Balance Tier) | +0.3305 | below 0.10 | FAIL |
The EOD of 0.3305 is more than three times the 0.10 threshold. The per-group TPR breakdown shows the full extent of the disparity.
| Group | Actual Fraud (Test) | Recall Catch Rate (TPR) | Miss Rate (FNR) |
|---|---|---|---|
| Low-Balance | 6 | 0.6667 | 33.3% |
| Mid-Balance | 225 | 0.9733 | 2.7% |
| High-Balance | 1,412 | 0.9972 | 0.3% |
The miss rate disparity between Low-Balance and High-Balance users is 110-fold. One in three fraud cases targeting a low-balance user goes undetected. For high-balance users, the model misses fewer than one in three hundred.
3.4 Error Analysis¶
False negatives are not evenly distributed. Low-Balance users have a 33.3% internal miss rate. Mid-Balance users have a 2.7% miss rate. High-Balance users have a 0.3% miss rate.
False positives in absolute terms concentrate in Mid-Balance and High-Balance accounts, reflecting group size. The false positive rate for Low-Balance users is actually the lowest of the three groups at 0.040%. This appears reassuring until the underlying reason is considered. The model has learned so little about Low-Balance transaction patterns that it defaults to calling almost everything they do legitimate, including actual fraud. The low false positive rate is not careful protection. It is indifference.
3.5 Explainability¶
SHAP values (Lundberg and Lee, 2017) were computed on a stratified sample of 2,000 test transactions. The top five features by mean absolute SHAP value were newbalanceOrig (5.2540), oldbalanceOrg (4.9669), amount (2.2655), step (0.7908), and oldbalanceDest (0.5503).
All five are legitimate fraud signals. None encode protected attributes directly. The model is not using anything it should not be using.
However, the two highest-weighted features are both raw balance values. Together, they track one thing: how much money was in the account before and after the transaction. For a High-Balance user with GHS 100,000 even a large transfer of GHS 50,000 leaves plenty of money behind. The model sees a normal account and moves on. For a Low-Balance user with GHS 500 buying groceries from a market trader, that single payment can bring the account from GHS 500 to zero. The model sees an empty account after a transaction and flags it as fraud, because that is exactly the pattern it learned to detect. The user did nothing wrong. But the outcome looks identical to fraud. High-balance users rarely trigger this because their accounts almost never hit zero, no matter how much they spend. The bias is not intentional. It is built into what it means to have a low balance.
But 56.68% of all legitimate transactions also result in a zero origin balance, because many normal transactions (particularly payments to merchants) drain small accounts entirely. The model needs enough Low-Balance fraud examples to learn that this pattern means something different in small accounts than in large ones. It had 35 training examples. It never learned that distinction.
This is the mechanism connecting the feature engineering risk documented before training, the fairness failure measured in Section 3.3, and the error distribution in Section 3.4 into one coherent chain. The features are legitimate. The training data coverage was insufficient for one group. The consequence is a model that cannot reliably contextualise its own signals for Low-Balance users.
| Feature | Mean |SHAP| | What It Measures |
|---|---|---|
| newbalanceOrig | 5.2540 | Sender balance after transaction. Full drain is a fraud signal. |
| oldbalanceOrg | 4.9669 | Sender balance before transaction. Together with above, measures liquidation. |
| amount | 2.2655 | Raw transaction size. Large amounts carry elevated fraud risk. |
| step | 0.7908 | Time step of transaction. Fraud clusters at specific hours. |
| oldbalanceDest | 0.5503 | Recipient balance before transfer. Empty staging accounts are a fraud pattern. |
3.6 Deployment Readiness¶
Threshold sensitivity analysis showed that lowering the decision threshold from 0.5 to 0.1 improves Low-Balance TPR from 0.6667 to 0.8333. However, this comes at a cost of approximately 11,000 additional false positives to achieve one additional Low-Balance fraud catch. That tradeoff is not operationally acceptable without a more targeted intervention.
Against the Bank of Ghana CISD 2026 deployment checklist, the model did not meet the requirements for production deployment. Material bias has been found, a monitoring and drift plan is needed, human oversight mechanisms must be defined, and a dispute resolution process for AI decisions must be in place.
4. Discussion¶
4.1 The Root Cause Is Data Coverage, Not Algorithm Failure¶
The fairness failure documented in this audit is not caused by a flawed algorithm. XGBoost is an appropriate and well-validated choice for fraud detection, widely used in Ghanaian banking and financial services globally. The failure is caused by the data the algorithm was trained on.
When 87% of fraud examples come from one type of user, the model naturally learns what fraud looks like for that user and fails to generalize to others. SMOTE was applied to address the overall class imbalance, but it does not solve within-fraud imbalance across groups. SMOTE generates synthetic fraud examples by interpolating between existing ones. With only 35 Low-Balance fraud cases in the training set, it produced almost no synthetic Low-Balance fraud. The 164:1 ratio between High-Balance and Low-Balance fraud examples in training was not corrected. If anything, the synthetic generation amplified the existing skew by producing thousands of additional High-Balance fraud examples.
This means the problem cannot be solved by switching algorithms or by applying standard oversampling. It requires deliberate intervention in either the training data itself or in how predictions are calibrated per group after training.
4.2 The Compounding Effect¶
The problem is not static. In production, when the model misses a Low-Balance fraud case, that transaction is recorded in system logs as legitimate. When the model is retrained on updated data, it learns from its own past mistakes. Missed cases get labelled as normal. The current miss rate for Low-Balance users is 33.3%. Left unmonitored, that gap will widen with every retraining cycle because the feedback loop continuously reinforces the model's existing blind spot.
This is compounded by the underreporting dynamic described in Section 2.2. Low-balance informal users may not report fraud when it happens. Fraud that goes unreported never enters the training data as a confirmed fraud case. The effective Low-Balance fraud sample is already smaller than the 41 cases in this dataset suggest, and retraining without intervention will make it smaller still.
4.3 Regulatory Implications¶
The Bank of Ghana Cyber and Information Security Directive 2026 contains specific and binding requirements for AI systems deployed in financial services.
Annexure E Section e(i)(3) requires institutions to assess risks of bias and unfair outcomes based on proxy attributes. This audit demonstrates what that assessment looks like in practice using a replicable, documented methodology.
Annexure E Section g(iii)(3) requires fairness testing across relevant demographic groups before any AI model is deployed. The 13-section methodology presented here constitutes that testing.
Annexure E Section l(i) states that if material bias is detected, the model must be reviewed and potentially suspended until effective mitigation is applied. An EOD of 0.3305 against a threshold of 0.10 would likely trigger that review under this directive.
The governance gap is confirmed by this audit. As noted in a post on the Ghana AI Summit website titled "3 Reasons African Mobile Money Programs Have a $1.7 Trillion AI Problem" (2026), no independent published audit of this kind exists for any major African mobile money fraud detection system. The required testing is either not being done, or if it is, the results are not disclosed publicly. This audit shows what proper disclosure looks like and why it matters.
4.4 Comparison with International Standards¶
A brief comparison with international frameworks illustrates both the strength of Ghana's regulatory approach and the gap between requirements and practice.
The EU AI Act 2024/1689 classifies credit scoring AI as High-Risk under Annex III Section 5(b) and fraud scoring under Article 6(3), requiring explainability, documentation, and human oversight before deployment. The BoG CISD 2026 aligns with this classification implicitly through its requirements on fairness testing, explainability, and bias mitigation. Ghana's framework is not aspirational. It is operationally equivalent to the EU standard for the AI systems it covers.
NIST AI RMF provides a governance structure across four functions: GOVERN, MAP, MEASURE, and MANAGE. The 13-section audit methodology in this paper maps directly onto the MEASURE function, specifically MEASURE 2.11 on fairness and bias documentation, and feeds the MANAGE function with actionable risk register findings.
The difference between Ghana and the EU or US context is not the quality of the regulatory framework. It is the current absence of enforcement mechanisms and independent audit culture. This paper argues that building that culture begins with practitioners demonstrating what compliant testing looks like before it is mandated by examination.
4.5 Implications for Economic Growth and Inclusion¶
The connection between this audit finding and the conference theme of economic growth and inclusion is direct and not theoretical.
Mobile money adoption in Ghana's informal economy depends on trust. If market traders, smallholder farmers, and gig workers experience fraud that their mobile money provider's AI system fails to detect, and they have no visibility into why the system failed them, trust erodes. Eroded trust slows adoption. Slower adoption in the informal sector means less digitization of informal economic activity, which means less data, less credit access, and weaker integration with formal financial services.
The World Bank has documented that financial inclusion is one of the strongest enablers of GDP growth in low- and middle-income countries. Mobile money is the primary vehicle for financial inclusion in sub-Saharan Africa. An AI system that provides systematically weaker protection to low-balance informal users is not neutral. It is a structural barrier embedded in the infrastructure that is supposed to remove barriers.
Building AI systems that protect all users equally is not just a compliance requirement under BoG CISD 2026. It is a prerequisite for the financial inclusion agenda that mobile money was built to serve.
4.6 Recommendations¶
Three remediation approaches are recommended based on the audit findings.
The first is tier-specific threshold calibration. The model currently uses a single prediction threshold of 0.5 for all users. Lowering that threshold to 0.1 improves Low-Balance fraud detection from 67% to 83%, but adds more false positives across the board. Applying separate thresholds per balance tier allows the model to improve protection for low-balance users without degrading performance for everyone else.
The second is targeted data collection. The core problem is that the training data contains too few confirmed Low-Balance fraud cases for the model to learn from. Working with regulators and mobile money operators to systematically collect and label more Low-Balance fraud incidents would directly fix this. BoG incident reporting mechanisms are a practical vehicle for this. Where labelled data remains too scarce, unsupervised anomaly detection can identify unusual Low-Balance activity without needing fraud labels at all.
The third is per-group monitoring from the first day of deployment. Monitoring only overall model performance while ignoring how each balance tier is being served is non-compliant with BoG CISD 2026 Annexure E. It also means a growing fairness gap can go undetected until it becomes a regulatory and reputational problem. Per-group monitoring is not optional. It is the mechanism that keeps the other two interventions honest.
5. Conclusion¶
This paper presents a complete practitioner audit of a mobile money fraud detection model, documenting a confirmed and measurable fairness failure that disproportionately affects low-balance users in the informal economy.
The model achieves strong overall fraud detection, catching 99.27% of fraud in aggregate. But that headline figure conceals a 110-fold difference in miss rate between the most and least wealthy user groups. One in three fraud cases targeting low-balance accounts goes undetected. For high-balance accounts, the model misses fewer than one in three hundred.
The root cause is structural. Only 41 of 8,213 fraud cases in the training data come from low-balance accounts. Standard oversampling techniques do not correct this imbalance because they cannot generate meaningful synthetic examples from so few real ones. Without deliberate intervention in data collection or model calibration, this gap will compound over time as the model is retrained on its own past predictions.
The Bank of Ghana Cyber and Information Security Directive 2026 already mandates the testing this paper demonstrates. Fairness assessment before deployment is a binding requirement for Regulated Financial Institutions, not a best practice recommendation. The audit methodology presented here provides a replicable framework for meeting that requirement, evaluated against real model outputs and documented against the specific regulatory provisions that apply.
The broader point is simple. A fraud detection model that protects high-balance accounts better than low-balance ones is not just a technical flaw. It is a bias outcome, one that fairness auditing can measure, document, and fix. This audit shows how.
References¶
Azamuke, D. N., Nabukenya, J., and Abayomi-Alli, O. (2025). A synthetic mobile money dataset for fraud detection research in the absence of real transaction data. In Proceedings of the International Conference on Intelligent Systems and Computer Vision. IEEE.
Bank of Ghana. (2026). Cyber and Information Security Directive 2026. Bank of Ghana.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Chen, T., and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
European Parliament. (2024). Regulation 2024/1689 laying down harmonised rules on artificial intelligence. Official Journal of the European Union.
GSMA. (2025). State of the industry report on mobile money 2025. GSMA Mobile Money Programme. Retrieved from gsma.com/sotir
INTERPOL. (2025). Africa cyberthreat assessment 2025. INTERPOL Cybercrime Directorate.
Knowledge Innovations and Partners. (2026, April). 3 reasons African mobile money programs have a 1.7 trillion AI problem. AI Summit and Awards 2026 Conference Publication.
Lopez-Rojas, E. A., Elmir, A., and Axelsson, S. (2016). PaySim: A financial mobile money simulator for fraud detection. In Proceedings of the 28th European Modeling and Simulation Symposium, 249–255.
Lundberg, S. M., and Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
National Institute of Standards and Technology. (2023). Artificial intelligence risk management framework (AI RMF 1.0). NIST AI 100-1. U.S. Department of Commerce.
Republic of Ghana. (2012). Data Protection Act 2012 (Act 843). Parliament of Ghana.
Amponsah, K. (2026). Technical Audit Artifacts and Fairness Scorecards for Mobile Money Fraud Detection: A Practitioner's Methodology for Ghanaian Financial Inclusion. ClearBoxAI Research. Available at: https://kwadwoai.com/paysim-fraud.html
Acknowledgments¶
I would like to thank the AI governance and fairness research community whose published frameworks made this methodology possible, and the organizers of the AI Summit and Awards 2026 for creating a platform for practitioner-led work in the African context.
Contact Information¶
Kwadwo Amponsah
ClearBoxAI / Baystate Bank
kwadwo_amponsah@yahoo.com
The complete audit notebook, code, methodology documentation, and interactive findings for this paper are available at www.kwadwoai.com.