The data_wrangling repo collects Python mini-apps for popular tasks in the data processing.
Each application is placed in a separate directory for the tidy organization where you can find:
- the python script (.py) of the application
- the example inputs
- the documentation in the README.md file, including some example usage variations
All the applications have a built-in set of options provided as in-line arguments from the command line. Thanks to that, there is no need to modify source code by the user (e.g., to replace input filename or tune params). Also, it makes the apps more universal, comprehensive, and robust.
More advanced (multi-purpose or multi-options) applications have a built-in logger which reports the analysis progress with the details depending on the selected verbosity level.
env = environment
To get started, please visit the Data Wrangling: use ready-made apps ⤴ section in the Data Science Workbook ⤴. In the practical tutorial, you will find all the information you need to set up a universal conda environment ⤴ that works for all the applications present in the data_wrangling repository. It is the first step to create your computational environment and familiarize yourself with the tools and techniques used in the data wrangling process.
While the tutorial provides you with detailed instructions with explanations, below you can find a code snippets that aggregates all the necessary commands to get you started (recommended for Conda-experienced or returning users):
Here we assume that you have conda installed. Otherwise, make up for it by going to workbook's section Environment setup ⤴.
On HPC systems, conda can usually be loaded from the module manager:
module load conda
Create new Conda environment (do it only once on a given computing machine)
conda create -n data_wrangling python=3.9
Activate Conda environment (do it in every new seesion to run data_wrangling apps)
conda activate data_wrangling
^ On some HPC systems, replacing the conda
keyword with source
is needed.
Install basic dependencies within environment (do it only once at the initial creation of the conda env)
pip install pandas
pip install numpy
pip install openpyxl
^ Some applications may have additional requirements listed at the top of the corresponding README file in the application's folder. When necessary, you can install them in the conda environment using the pip
command.
Deactivate Conda environment (do it to 'close' env once you are done with running the data_wrangling apps)
conda deactivate
^ On some HPC systems, replacing the conda
keyword with source
is needed.
List all your conda envs (do it when you can't remember ☺ the name of the env you need)
conda info -e
APP | description |
---|---|
assign_colors | value to color mapping based on the value ranges (or intervals); includes convert_for_ideogram app [see ideogram visualization ⤴] |
bin_data | grouping, slicing, and aggreagting data |
data_merge | merging multiple files using a matching column |
The application enables value-to-color mapping. In other words, it assigns colors to ranges/intervals of numerical values. The colors (with the user-selected scale) can then be used in various visualization programs, including directly in python.
Programmatically created and saved color scales will help maintain color reproducibility in future repetitions or similar projects.
-
app in the repo: ISUgenomics/data_wrangling/assign_colors ⤴
-
docs: README ⤴
-
assign_colors tutorial: comming soon, for now see the comprehensive docs
- convert_for_ideogram tutorial: Bioinformatics_Workbook/Data Visualization/Ideogram: display chromosome bands ⤴
The application enables grouping/slicing of the data as the ensembles of rows and aggregates observables from the numerical columns by calculating the sum or mean in each group/slice.
-
app in the repo: ISUgenomics/data_wrangling/bin_data ⤴
-
docs: README ⤴
The figure shows the main steps of the bin_data algorithm. First, you can group data by unique values in the Label column creating data chunks (marked as different background colors at step 2). Each data chunk can be further sliced based on the value ranges of the numerical data stored in Ranges column (see step 3). Finally, you can aggreagte data of each slice to a single value, which can represent the sum or average of the aggreagted values, separately for each of the STATS column ( see step 4).
The application enables the merging of two (or multiple) files by matching column (column with the same values in all merged files) and assigning custom error_value for missing records (from any file).
-
app in the repo: ISUgenomics/data_wrangling/merge_data ⤴
-
docs: README ⤴
The figure shows the algorithm of merging two files by common column. The dark teal color corresponds to the record available only in the one of input files. The red color corresponds to the missing records (error_value) in the merged output.