UCI Madelon Dataset: Feature Selection + Classification

Data

Demonstrate a capacity to identify relevant features using machine learning. Madelon. "MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized."

The Madelon Dataset does not have attribute information to avoid biasing feature selection.

MADELON -- Positive ex. -- Negative ex. -- Total

Training set -- 1000 -- 1000 -- 2000
Validation set -- 300 -- 300 -- 600
Test set -- 900 -- 900 -- 1800
All -- 2200 -- 2200 -- 4400

Number of variables/features/attributes: Real: 20 Probes: 480 Total: 500

Problem Statement

Your challenge here is to develop a series of models for two purposes:

for the purposes of identifying relevant features.
for the purposes of generating predictions from the model.

Content

Data Sampling

Do substantive work on at least six subsets of the data.

3 sets of 10% of the data from the UCI Madelon set
3 sets of 10% of the data from the Madelon set made available by your instructors

EDA

perform EDA on each set as you see necessary

Benchmarking

Perform a naive fit for each of the base model classes:
- logistic regression
- decision tree
- k nearest neighbors
- support vector classifier

Identify Features & Feature Importance

Considering these results, build a final predictive model
Approaches:
- Use feature selection to reduce the dataset to a manageable size then use conventional methods
- Use an iterative model training method to find relevant features (ANOVA)

Build Model

Implement final model

Additional Items to Add (forthcoming):

ROC visualizations
comparative score visualizations for different classification pipelines
tune hyperparameters to improve accuracy/precision/recall and reduce logloss

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
functions		functions
ipynb		ipynb
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UCI Madelon Dataset: Feature Selection + Classification

Data

Problem Statement

Content

Data Sampling

EDA

Benchmarking

Identify Features & Feature Importance

Build Model

Additional Items to Add (forthcoming):

About

Releases

Packages

Languages

godsylla/UCI-Madelon-Dataset

Folders and files

Latest commit

History

Repository files navigation

UCI Madelon Dataset: Feature Selection + Classification

Data

Problem Statement

Content

Data Sampling

EDA

Benchmarking

Identify Features & Feature Importance

Build Model

Additional Items to Add (forthcoming):

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages