This package includes functions assisting data scientists with various common tasks during the exploratory data analysis stage of a data science project. Its functions will help the data scientist to do preliminary analysis on common column types like numeric columns, categorical columns and text columns; it will also conduct several experimental clusterings on the dataset.
Our functions are tailored based on our own experience, there are also similar packages published on PyPi, a few good ones worth mentioning:
There are several dependencies not available on test.pypi, please use the exact command below to install our package.
$ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple datascience-eda
-
explore_numeric_columns
: conducts common exploratory analysis on columns with numeric type: it generates a heatmap showing correlation coefficients (usingpearson
,kendall
orspearman
correlation on choice), histograms and SPLOM plots for all numeric columns or a list of columns specified by the user. This returns a list of plot objects so that the user can save and use them later on. -
explore_categorical_columns
: performs exploratory analysis on categorical features. It returns a dataframe containing column names, corresponding unique categories, counts of null values, percentages of null values and most frequent categories. It also generates and visualize countplots of a list of categorical columns of choice. -
explore_text_columns
: performs exploratory data analysis of text features. It prints the summary statistics of character length and word count. It also plots the word cloud, distributions of character lengths, word count and polarity and subjectivity scores. Bar charts of top n stopwords and top n words other than stopwords, top n bigrams, sentiments, name entities and part of speech tags will be visualized as well. This returns a list of plot objects. -
explore_clustering
: fits K-Means and DBSCAN clustering algorithms on the dataset and visualizes Elbow, Silhouette Score and PCA plots. It returns a dictionary with each key being name of the clustering algorithm and the value being a list of plots generated by the models. -
explore_KMeans_clustering
: fits K-Means clustering algorithms on the dataset and visualizes Elbow, Silhouette Score and PCA plots. It returns a dictionary with each key being name of the plot type and the value being a list of plots generated for each type. -
explore_DBSCAN_clustering
: fits K-DBSCAN clustering algorithms on the dataset and visualizes Silhouette Score and PCA plots. It returns a tuple containing a list of n_clusters returned by DBSCAN models and a dictionary with each key being name of the plot type and the value being a list of plots generated for each type.
List of depencies can be found at: https://github.com/UBC-MDS/datascience_eda/blob/main/pyproject.toml
import pandas as pd
import datascience_eda as eda
original_df = pd.read_csv("/data/menu.csv")
numeric_features = eda.get_numeric_columns(original_df)
numeric_transformer = make_pipeline(SimpleImputer(), StandardScaler())
preprocessor = make_column_transformer(
(numeric_transformer, numeric_features)
)
df = pd.DataFrame(
data=preprocessor.fit_transform(original_df), columns=numeric_features
)
eda.explore_numeric_columns(df)
eda.explore_categorical_columns(df, ["categorical_column1", "categorical_column2"])
eda.explore_text_columns(df)
eda.explore_clustering(df)
The official documentation is hosted on Read the Docs: https://datascience_eda.readthedocs.io/en/latest/
We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab. Please check out our CONDUCTING.rst if you are interested in contributing to this project.
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.