Demonstrate a capacity to identify relevant features using machine learning. Madelon. "MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized."
The Madelon Dataset does not have attribute information to avoid biasing feature selection.
MADELON -- Positive ex. -- Negative ex. -- Total
- Training set -- 1000 -- 1000 -- 2000
- Validation set -- 300 -- 300 -- 600
- Test set -- 900 -- 900 -- 1800
- All -- 2200 -- 2200 -- 4400
Number of variables/features/attributes: Real: 20 Probes: 480 Total: 500
Your challenge here is to develop a series of models for two purposes:
- for the purposes of identifying relevant features.
- for the purposes of generating predictions from the model.
Do substantive work on at least six subsets of the data.
- 3 sets of 10% of the data from the UCI Madelon set
- 3 sets of 10% of the data from the Madelon set made available by your instructors
- perform EDA on each set as you see necessary
- Perform a naive fit for each of the base model classes:
- logistic regression
- decision tree
- k nearest neighbors
- support vector classifier
- Considering these results, build a final predictive model
- Approaches:
- Use feature selection to reduce the dataset to a manageable size then use conventional methods
- Use an iterative model training method to find relevant features (ANOVA)
- Implement final model
- ROC visualizations
- comparative score visualizations for different classification pipelines
- tune hyperparameters to improve accuracy/precision/recall and reduce logloss