Team members: Claude Hu, Caitlyn Nguyen, Luenna Wu
This tool can be used to produce graphical displays without any necessary end-user coding. It provides a quick preliminary analysis for researchers or scientists to understand their results.
A graphical user interface (GUI) was coded in python. This GUI instructs the user to upload a .csv
file containing a dataset with an outcome (y) and predictor(s) (x). The back-end code then read in the data. The next screen of the GUI then instructs the user of which variable is the outcome and which are the predictors. Next, the user selects the types of data (binary, categorical, discrete, or continuous) for each outcome and predictor. The GUI will then have different options for the tool to generate which the user can select through square checkboxes, including:
- Boxplots:
Shows the five-number summary of sets of data. It is helpful for comparing distributions across groups and reveal any potential outliers. - Scatterplot matrix:
Displays scatterplots of the outcome on each continuous predictor. It shows the relationship between the outcome and predictors. - Correlation matrix:
Exhibits a heatmap of the correlation coefficients between continuous predictors. It is useful to investigate the dependence between multiple variables at the same time and examine if there is multicollinearity. - Histograms:
Displays the distribution of each continuous or discrete variable in the dataset. It is useful to show where the peaks of the distribution are, whether the distribution is skewed or symmetric, and any potential outliers. - Pairplots:
Plots pairwise relationships of continuous variables in the dataset. A grid of axes is created with each numeric variable as the y-axes of a single row and the x-axes across a single column.
Once the user submits their selection, a new folder will be created within the local directory. Graphical displays will be outputted as .png
files in the folder.
To install Python and all necessary packages listed in Requirements.txt
, please refer to Python Packaging Installation Instructions.
To install Git, please refer to Git Guides.
To install Pytest, please refer to Pytest Documentation.
Data should be recorded in a .csv
file with columns being each variable. Each column should have a header/variable name, in the first row.
The accepted variable types are:
- Binary
- Categorical
- Discrete
- Continuous
The following .py
scripts should be downloaded and saved in the same folder within your local drive:
Variable_Class.py
final_project_main.py
plots.py
The Variable class is initialized as: Variable(name: str, values: list)
.
Each instance of the Variable class stores attributes for a single variable listed as a column in the uploaded dataset.
The Variable class has the following instance attributes:
- name
- values
The Variable class has the following properties:
- get_type
- get_x_or_y
The method set_type(self, var_type: str)
is used to set the variable type for the variable. The variable type can be set as "Binary", "Categorical", "Discrete", or "Continuous". A ValueError is raised if the inputted var_type
is not one of the four types previously listed. Calling upon get_type
will return the variable type as a str set from set_type
.
The method set_x_or_y(self, x_or_y_type: str)
is used to set whether the variable is a x or y variable. The variable type can be set as "x" or "y". A ValueError is raised if the inputted x_or_y_type
is not "x" or "y". Calling upon get_x_or_y
will return if the variable is "x" or "y" as a str set from set_x_or_y
.
The .csv
file should be saved in your local drive in a location which can be easily accessed again.
- Set the folder with all of your scripts as the working directory within VScode.
- In the cmd terminal, write:
python final_project_main.py
- A GUI window will show up instructing with a button to "Select Input Table". Click the "Select Input Table" button.
- The file directory will pop up. Find and select your
.csv
file, then click "Open".
- The file path should now populate within the GUI. Click on "Next".
- A list of all of the variable names will appear. Select your outcome variable and click on "Next".
- Select the type of variable for the outcome from the list of variable types and click on "Next".
- Select which predictors you would like to include and the variable type for each predictor. Click on "Next" when done.
- Select the visualizations you would like to be produced. Click on "Run" to run the generation of plots.
- A new folder named
EDA_[year]_[month]_[day]_[hour]_[minute]_[second]
will be created in your local directory. All plot figures are outputted into this folder as.png
files.
An example output of each plot type is given below.
Test modules are placed in the test_Variable_Class.py
. Importation of pytest is required for testing. The files test_data.csv
and test_data_2.csv
are included to be used for testing.
To test the Variable_Class.py
module, run the test_Variable_Class.py
module by typing in the console:
pytest test_Variable_Class.py
A 100% passed test result should appear similar to:
test_Variable_Class.py .... [100%]
======================================== 1 passed in 0.12s ========================================