In this tutorial we will use PySyft to study heart disease, and by doing so we will try to answer the following question:
Can we run Machine Learning experiments on multiple and distributed medical datasets, without seeing the data?
We are going to to learn how! All you need to get started is PySyft, and a Jupyter notebook! 🚀
Using the git
command from the terminal:
$ git clone https://github.com/openmined/syft-heart-disease-tutorial
or by clicking on Code >> Local >> Download ZIP
on the repository main page.
The repository includes a requirements.txt
file with the list of
all the Python packages required to work with the notebooks.
You can install all these dependencies using pip
:
$ pip install -r requirements.txt
Please refer to the Quick Install guide to learn how to install PySyft.
Note: It is recommended to install PySyft and all the dependencies within a dedicated Python virtual environment (using the virtual-env manager of choice, e.g. Miniconda, pyenv)
Setup and launch the PySyft Datasites using the launch_datasites.py
script included in the repository. From the command line:
$ python launch_datasites.py
Note: Please, keep the terminal open, as this will keep all the servers running in background. You can stop all the servers, and terminate
the program by typing Ctrl+C
.
- 🧭 (Intro) Setup Datasites: Familiarise with the data, and the Datasites.
- 📊 1. Compare Demographics: Study the distribution of the demographics in the data, using PySyft.
- 🤖 2. ML Model Training Experiment: Use PySyft to train a Machine learning classifier, using data across the four distributed datasites, and without seeing the data! (🌟)
- 📝 3. ML Model Evaluation Experiment: Assess the performance of the trained classifiers on each remote datasite. (🌟🌟)
- 🗳️ 4. Ensemble Learning Experiment:
Create an Ensemble using all the models trained remotely and independently on each dataset. We will test this strategy to obtain
a ML predictive model that has seen
4x
more medical data in training. (🌟🌟🌟) - ⚗️ 5. Federated Learning Experiment: Run a full Federated Learning experiment using PySyft and Scikit-learn. We'll train a linear classifier on each datasite and explore how to pass model parameters as inputs to a Syft function. (🌟🌟🌟🌟)
- 🔮 6. Federated Learning Experiment with PyTorch: Run a complete Federated Learning experiment using PySyft and PyTorch. We'll train a non-linear Neural Network across multiple datasites and learn how to leverage PyTorch within PySyft to seamlessly execute FL experiments. (🌟🌟🌟🌟🌟)
We will use the full version of the Heart Disease dataset, as available on UCI ML.
This database is the result of a study for the diagnosis of coronary artery disease, as presented in this paper.
The full dataset contains the data as collected by patients in four different hospitals, in 1988:
- Cleveland Clinic in Cleveland, Ohio (303 patients);
- Hungarian Institute of Cardiology in Budapest, Hungary (425 patients);
- Veterans Administration Medical Center in Long Beach, California (200 patients)
- University Hospitals in Zurich and Basel (143 patients).
Each Hospital will correspond to a single PySyft Datasite, hosting their corresponding version of the Heart Study Data
.
This dataset is quite popular, and well-known in the data science/machine learning community. However, only the Cleveland database is the one that has been effectively used by ML researchers to date 1. The "target" field refers to the presence of heart disease in the patient. It is integer valued from 0
(no presence) to 4
. In our Machine learning experiments we will treat this problem as a binary (presence
vs absence
) classification problem.
The authors of the dataset have requested that any use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:
- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.
If you spot any error or mistake, please feel free to reach out directly to me via email, or to open an Issue on the repository.
Any feedback will be very much appreciated! Thank you! 🙏
For any technical question, or clarification, or any request for assistance with PySyft, please consider
joining the OpenMined slack, and pop your question in the #support
channel.
Author: Valerio Maggio (@leriomaggio
),
Researcher, SSI Fellow,
and Education Team @ Open Mined.
All the Code material is distributed under the terms of the Apache License. See LICENSE file for additional details.
All the instructional materials in this repository are free to use, and made available under the Creative Commons Attribution license. The following is a human-readable summary of (and not a substitute for) the full legal text of the CC BY 4.0 license.
You are free:
- to Share---copy and redistribute the material in any medium or format
- to Adapt---remix, transform, and build upon the material
for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
-
Attribution --- You must give appropriate credit, and provide a link to the LICENSE
cc-by-human
, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. -
No additional restrictions --- You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.