data_wrangling

The data_wrangling repo collects Python mini-apps for popular tasks in the data processing.

Each application is placed in a separate directory for the tidy organization where you can find:

the python script (.py) of the application
the example inputs
the documentation in the README.md file, including some example usage variations

All the applications have a built-in set of options provided as in-line arguments from the command line. Thanks to that, there is no need to modify source code by the user (e.g., to replace input filename or tune params). Also, it makes the apps more universal, comprehensive, and robust.

More advanced (multi-purpose or multi-options) applications have a built-in logger which reports the analysis progress with the details depending on the selected verbosity level.

Getting started

env = environment

To get started, please visit the Data Wrangling: use ready-made apps ⤴ section in the Data Science Workbook ⤴. In the practical tutorial, you will find all the information you need to set up a universal conda environment ⤴ that works for all the applications present in the data_wrangling repository. It is the first step to create your computational environment and familiarize yourself with the tools and techniques used in the data wrangling process.

While the tutorial provides you with detailed instructions with explanations, below you can find a code snippets that aggregates all the necessary commands to get you started (recommended for Conda-experienced or returning users):

WARNING:
Here we assume that you have conda installed. Otherwise, make up for it by going to workbook's section Environment setup ⤴.
On HPC systems, conda can usually be loaded from the module manager:
module load conda

Create new Conda environment (do it only once on a given computing machine)

conda create -n data_wrangling python=3.9

Activate Conda environment (do it in every new seesion to run data_wrangling apps)

conda activate data_wrangling

^ On some HPC systems, replacing the conda keyword with source is needed.

Install basic dependencies within environment (do it only once at the initial creation of the conda env)

pip install pandas
pip install numpy
pip install openpyxl

^ Some applications may have additional requirements listed at the top of the corresponding README file in the application's folder. When necessary, you can install them in the conda environment using the pip command.

Deactivate Conda environment (do it to 'close' env once you are done with running the data_wrangling apps)

conda deactivate

^ On some HPC systems, replacing the conda keyword with source is needed.

List all your conda envs (do it when you can't remember ☺ the name of the env you need)

conda info -e

Overview of available applications

APP	description
assign_colors	value to color mapping based on the value ranges (or intervals); includes convert_for_ideogram app [see ideogram visualization ⤴]
bin_data	grouping, slicing, and aggreagting data
data_merge	merging multiple files using a matching column

assign_colors app

The application enables value-to-color mapping. In other words, it assigns colors to ranges/intervals of numerical values. The colors (with the user-selected scale) can then be used in various visualization programs, including directly in python.
Programmatically created and saved color scales will help maintain color reproducibility in future repetitions or similar projects.

app in the repo: ISUgenomics/data_wrangling/assign_colors ⤴
docs: README ⤴
assign_colors tutorial: comming soon, for now see the comprehensive docs

convert_for_ideogram tutorial: Bioinformatics_Workbook/Data Visualization/Ideogram: display chromosome bands ⤴

The figure shows the main steps of the assign_colors algorithm. The numerical values (from selected columns) are replaced by the corresponding discrete colors based on the user-selected ranges.

bin_data app

The application enables grouping/slicing of the data as the ensembles of rows and aggregates observables from the numerical columns by calculating the sum or mean in each group/slice.

app in the repo: ISUgenomics/data_wrangling/bin_data ⤴
docs: README ⤴
tutorial: DataScience_Workbook/07. Data Acquisition and Wrangling/Data Wrangling/Aggregate data over slicing variations ⤴

The figure shows the main steps of the bin_data algorithm. First, you can group data by unique values in the Label column creating data chunks (marked as different background colors at step 2). Each data chunk can be further sliced based on the value ranges of the numerical data stored in Ranges column (see step 3). Finally, you can aggreagte data of each slice to a single value, which can represent the sum or average of the aggreagted values, separately for each of the STATS column ( see step 4).

data_merge app

The application enables the merging of two (or multiple) files by matching column (column with the same values in all merged files) and assigning custom error_value for missing records (from any file).

app in the repo: ISUgenomics/data_wrangling/merge_data ⤴
docs: README ⤴
tutorial: DataScience_Workbook/07. Data Acquisition and Wrangling/Data Wrangling/Merge files by common column ⤴

APP FEATURES:

merging files of the same or different format,
i.e., with different column headers or different column order

merging files separated by different delimiters (including Excel .xlsx files)

merging multiple files all at once

keeping only selected columns during the merge (the same or different columns from files)

providing custom error_value for missing data

The figure shows the algorithm of merging two files by common column. The dark teal color corresponds to the record available only in the one of input files. The red color corresponds to the missing records (error_value) in the merged output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data_wrangling

Getting started

Overview of available applications

assign_colors app

bin_data app

data_merge app

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
assign_colors		assign_colors
bin_data		bin_data
merge_data		merge_data
.gitignore		.gitignore
README.md		README.md

ISUgenomics/data_wrangling

Folders and files

Latest commit

History

Repository files navigation

data_wrangling

Getting started

Overview of available applications

assign_colors app

bin_data app

data_merge app

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages