Online Feature Selection

The attached scripts provide two alternative utilities of the algorithm as described in article by Sengupta, D.; Bandyopadhyay, S.; Sinha, D., "A Scoring Scheme for Online Feature Selection: Simulating Model Performance Without Retraining," in Neural Networks and Learning Systems, IEEE Transactions on , vol.PP, no.99, pp.1-10 doi: 10.1109/TNNLS.2016.2514270

Kindly cite this article if you use the algorithm for your research.

Pre-requisites:

The following python packages must be present in the host machine before executing the main script.

numpy

scipy

sklearn

matplotlib

Application Type 1

The software attempts to simulate a online/streaming feature scenario. First it builds a base model with a handful of features against which the new features is evaluated for goodness. Execute the following command:

python demo_ofs.py

Input:

A data matrix with binary labels. The rows represent features and the columns represent samples. The first row contains the labels.
A linear classifier: Logistic Regression with a suitable parameter (we chose high lambda value to minimize regularizing effect)

Output:

The following files are generated along with a figure:

mfeat.init # a set of base features

ent # evaluation score corresponding to each feature and sorted rank wise

mfeat.entrank # evaluation score corresponding to each feature

mfeat.entauc # evaluation score and improvement in AUC corresponding to each feature

mfeat.contable # contingency table to perform statistical significance test

The script produces a figure that represents a contingency table. The fist quadrant demonstrates how the evaluated score correlates with the actual improvement. A statistical significance test of the contingency table is also displayed in the output.

Application Type 2

The same program can be used for simple feature selection avoiding over-fitting as well. Execute the following command:

python demo_fs.py

Input:

A data matrix with binary labels. The rows represent features and the columns represent samples. The first row contains the labels.
A linear classifier: Logistic Regression with a suitable parameter (we chose high lambda value to minimize regularizing effect)
Initial set of base features, as generated from the first usage
Evaluation score corresponding to each feature, as generated from the first usage

Output:

The script outputs model performance when feature subset is incremented through the batches of fixed size. Batch features are cumulatively added from the ranked subset produced by the proposed algorithm. The classifier is retrained every time the feature subset is incremented.

NB: All scripts have been tested on python version 2.6.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
mfeat		mfeat
README.md		README.md
demo_fs.py		demo_fs.py
demo_ofs.py		demo_ofs.py
ent		ent
entropy.py		entropy.py
pca2.py		pca2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Online Feature Selection

Application Type 1

Application Type 2

About

Releases

Packages

Contributors 2

Languages

debsin/OFS

Folders and files

Latest commit

History

Repository files navigation

Online Feature Selection

Application Type 1

Application Type 2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages