Skip to content

Current implementation of SPORF

parthgvora edited this page May 3, 2021 · 6 revisions

The current implementation of SPORF is located on the main branch of the repository.

USAGE

The usage of SPORF is the same as scikit-learn's Random Forest algorithm, but with two extra parameters. These parameters are implemented in the same way as RerF

  1. max_features is a float (0.0, 1.0) that controls the output dimension of SPORF after applying the projection matrix. Specifically, the output dimension is equal to int(max_features * n_features), where n_features is the number of features of the data.
  2. feature_combinations is a float (0.0, n_features) that controls the average number of features to consider at a split.

To build and run SPORF, simply use the following commands:

from oblique_forests.sporf import ObliqueForestClassifier

clf = ObliqueForestClassifier(**params)

clf.fit(X_train, y_train)

y_predict = clf.predict(X_test)

For speed, I highly recommend setting the parameter n_jobs=-1 to use all available threads.

IMPLEMENTATION

There are three components that make up SPORF: a splitter, a tree, and the forest. Each component's current implementation on the main branch is detailed below. Many of these will change when developing a sklearn PR ready SPORF.

ObliqueSplitter

The oblique splitter is a class that selects the next best oblique split of the data when building the tree. Located in oblique_tree.py, the oblique splitter has three functions: sample_proj_mat, leaf_label_proba, and split. sample_proj_mat is a function called at the beginning of each split to sample a new projection matrix; each vector of the matrix is an independent transformation to test. In the case of SPORF, each vector is a random sample of linear combinations to test for splitting. leaf_label_proba is a function called to determine the label & probability at a leaf node. split is called to determine the best oblique split, returning a SplitInfo container class with all the relevant information.

The actual searching for the best split is implemented in a cython class called BaseObliqueSplitter, in _split.pyx. The BaseObliqueSplitter takes in data transformed with a projection matrix and searches for the best split that minimizes gini impurity.

NOTE: any decision tree that applies a transformation to the data at each split can be implemented by simply inheriting the Oblique Splitter and specifying your own "sample_proj_mat" function. For example, to build MORF, everything else can remain the same and only the sample_proj_mat function needs changing.

ObliqueTree

The oblique tree, located in oblique_tree.py, is a binary decision tree that stores the information of each split for future predictions. The tree has three functions: add_node, build, and predict. The add_node function adds a node at each split, storing relevant information such as the parent, leaf status, and the projection vector to apply to data at predict time. The build function builds the tree in a depth-first manner, making the use of a stack to keep track of what samples to split next. The builder tries to follow sklearn's implementation, with nodes being stored in an array. The predict function predicts which leaf node a new data point belongs to, returning an array of node_ids for the number of data points tested.

ObliqueTreeClassifier

The oblique tree classifier, located in oblique_tree.py, is essentially just a wrapper around the ObliqueTree to implement sklearn's BaseEstimator API and other sklearn decision tree functions. This class has five functions: fit, apply, predict, predict_proba, and predict_log_proba. The fit function initializes the ObliqueSplitter and ObliqueTree, and calls the ObliqueTree's build function. The apply function calls the ObliqueTree's predict function to return an array of node ids that new data points belong to. The predict function returns a list of predicted classes for new data. The predict_proba function returns the probability of prediction labels for new data. The predict_log_proba returns the log probability of prediction labels for new data.

ObliqueForestClassifier

The oblique forest classifier, located in sporf.py, builds a Random Forest of ObliqueTreeClassifiers to implement SPORF. This class simply inherits from sklearn's ForestClassifier, changing only the base decision tree and adding the max_features and feature_combinations parameters as specified earlier. Many of sklearn's Random Forest parameters are commented out, as they have not been implemented in the ObliqueTreeClassifier (i.e. weighted samples).