This is the repository for the "Simulation and Detection of Healthcare Fraud in German Inpatient Claims Data" paper submitted to ICCS 2024 in the Health Thematic Track.
This project contains two parts, Claims Simulation and Fraud Detection.
The Simulator generates German inpatient claims according to the regulations valid in 2021. Based on this data, claims are changed in a fraudulent way.
The fraud types included are:
- Increases in ventilation hours
- Changing vaginal births to cesarean sections
- Decreasing the weight of newborns
- Adding the need for personal care to a newborn's treatment
- Releasing people too early from hospital (bloody release)
- Change the order of ICD codes
Factors not simulated:
- no inpatient ward
- the outcome of a treatment (cured, death, etc.) is not simulated
- vacations during long hospital stays are not simulated
- the reason for admissions is not simulated
The Detection uses the generated data to train models. Tested algorithms (from Scikit-Learn):
The models with the best results are Gradient Boosting and Random Forest.
1. Start Simulation: Patients and Hospitals
2. Initialize Treatment: Get ICD- and OPS-Codes, ventilation, duration
3. Adjust Treatments: to coding guidelines
4. Inject Fraud: following the fraud patterns
5. Finishing up: adjusting the fraudulent claims to coding guidelines and calculating claims
More visualizations and UML diagrams can be found in the directory doc.
- Download this repository
- Install requirements with pip:
pip install -r requirements.txt
- Install a DRG-Grouper (here the grouper from IMC Clinicon is used (https://www.imc-clinicon.de/tools/imc-navigator/index_ger.html))
- Adjust config_template.py to your requirements and save it as config.py
IMPORTANT: This project is built and tested with Python 3.9!
After installing the code and adjusting the config_template.py as described in Installation
In case you want to use another DRG-Grouper, you need to modify grouper_wrapping.py accordingly.
If everything is set up, execute from the project's root directory:
python simulation/simulate.py
Make sure, you configured your config.py correctly.
If everything works, several .csv-files are generated in the directory data/generated data:
- claims.csv: initial inpatient treatments, not containing fraud, DRGs, and claims
- claims_with_fraud.csv: claims.csv with injected fraud
- claims_with_drg.csv: claims_with_fraud.csv after grouping the treatments
- claims_final.csv: final inpatient treatments
First preprocess your data according to preprocessing.py. Then select your classifier by commenting everything else (if you want to train all in one run, do not change anything). To train the models execute
python detection/classifying.py
The models trained are saved in the directory models.
The simulated data used for training the machine learning algorithms can be accessed at zenodo.org
In case questions occur, contact me or create an issue.
This code is not maintained anymore. Further necessary developments:
- Improve the OPS-Code generation
- Model the treatment outcome
- Simulate inpatient ward (via simulating outpatient treatment)
- etc.
Special thanks to my supervisors René Raab, Kai Klede and Prof. Dr. Bjoern Eskofier.
Furthermore, thanks to AOK Bayern and Dominik Schirmer for providing the necessary validation data.
Thanks to IMC Clinicon and Gunter Damian for giving me access to IMC Navigator, a certified DRG Grouper.
Until further notice, the development of this project stopped after 29.11.2023. Feel free to contact me (see Support), if you have ideas and use cases for collaboration.