This repository provides tools in Python to quickly start using the UK-BioBank dataset before UKB RAP. The folder has the following structure:
├── commands/
├── create_data.py
├── create_eu_set.py
├── get_newest_baskets.py
├── ukb_tools/
├── preprocess
├── filtering.py
├── labeling.py
├── utils.py
├── __init__.py
├── data.py
├── logger.py
├── tools.py
Clone the repository:
git clone https://github.com/TemryL/UKB-Tools.git
Move to the directory:
cd UKB-Tools
Create a virtual environment with Python 3.11 installed. Then install the dependencies:
pip install -r requirements.txt
UK-BioBank is organized by projects and baskets. Each project ID can have several basket IDs associated. When somenone requests new fields or a data update under the same project ID, a new basket will be created. Data across projects cannot be merged (because of eids randomization). However, data across baskets of the same project can be merged and it is preferable to get data for a given UKB field from the most recent basket.
Let's say we want to create a dataset with UKB fields 31, 131369, 3066. Then one can store the fields in a text file as follow:
ukb_fields.txt
:
31
131369
3066
Run the following command to retrieve, for a given project ID, the most recent basket that contains the given UKB fields:
python commands/get_newest_baskets.py ${/dir/to/ukb_folder} ${project_id} ${data/ukb_fields.txt} ${data/field_to_basket.json}
The results will be stored in a JSON file as follow:
field_to_basket.json
:
{
"31": "project_52887_41230",
"131369": "project_52887_676883",
"3066": "project_52887_669338",
}
Finally, to merge the data in a single CSV file, run the following command:
python commands/create_data.py ${/dir/to/ukb_folder} ${data/field_to_basket.json} ${data.csv}
Feel free to contribute to this repo by fixing issues, improving performances or adding new features!