CA04_Ensemble_Models

Overview

This project involves the use of ensemble models to analyze salary data obtained from the Census Bureau. The analysis aims to predict whether individuals earn more or less than $50K per year based on seven demographic variables. This assignment follows up on a previous analysis (CA03), utilizing the same dataset but focusing on ensemble methods to improve prediction accuracy and model performance.

Data Source

The dataset is reused from CA03, featuring demographics and salary information:

Number of target classes: 2 ('>50K' and '<=50K') [Labels: 1, 0]
Number of attributes (Columns): 7
Number of instances (Rows): 48,842

Data can be accessed at the following URL: https://github.com/ArinB/MSBA-CA-03-Decision-Trees/blob/master/census_data.csv?raw=true

Objectives

Data Preparation: Utilize the provided path to load the dataset. Split the data into training and test sets based on a specific column in the dataset. Ensure data cleaning and transformations are applied as done in CA03.
Finding Optimal Value of a Key Ensemble Method Hyper-parameter: Analyze the effect of the number of estimators on model performance. This involves plotting Accuracy vs. n_estimators and AUC vs. n_estimators graphs to find the optimal number of estimators.
Building Ensemble Models:
- Train a Random Forest Model with the dataset. Explore the model's behavior and find an optimal number of estimators from the set [50, 100, 150, 200, 250, 300, 350, 400, 450, 500].
- Repeat the analysis using AdaBoost, Gradient Boost, and XGB Classifiers.
Performance Comparison: Compare the best values of Accuracy and AUC for the four models (Random Forest, AdaBoost, Gradient Boost, XGB) and document the findings in a table.

Deliverables

A fully functional and executed Jupyter notebook.
This README file documenting the project overview, data source, objectives, and analysis steps.

Installation and Usage

Ensure you have Jupyter Notebook or JupyterLab installed to run the .ipynb file. Dependencies include pandas, numpy, matplotlib, sklearn, and xgboost. Install these packages via pip or conda if not already installed.

pip install pandas numpy matplotlib scikit-learn xgboost

To run the notebook:

Clone the repository or download the .ipynb file
Open the notebook in Jupyter Notebook/Lab.
Execute the cells in order to reproduce the analysis. If you have all of the installed packages, I recommend starting at cell 2, otherwise run from the beginning

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CA04.ipynb		CA04.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CA04_Ensemble_Models

Overview

Data Source

Objectives

Deliverables

Installation and Usage

About

Releases

Packages

Languages

SkiDutch/CA04_Ensemble_Models

Folders and files

Latest commit

History

Repository files navigation

CA04_Ensemble_Models

Overview

Data Source

Objectives

Deliverables

Installation and Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages