-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Augmented Random Forest with Kernel Convolution + predictRFStat extension - JIRAO2-5110 #359
Comments
Link to the JIRA ticket- |
Presentation slide - to add to the RootInteractive full presentation: |
test of the AugmentedRandomForest Requirements:
Pseudocode for Implementation: class AugmentedRandomForestArray:
# Array of N standard random forests
# Input parameters include standard options like kernel width
def fit(self, X, Y):
# Fit each random forest with augmented input data
def predict(self):
# Use an ensemble of forests for prediction
# Unlike standard RFStat which uses a single array of trees, this method uses an array of arrays
def predictRFStat(rfArray, X, statDictionary, n_jobs):
# Definition for detailed statistical prediction using random forests |
Simple version of AugmentedKernelRandomForestCode used recently to evaluate smooth function with kernel with defined by the sigmaVec import numpy as np
from sklearn.ensemble import RandomForestClassifier
def makeAugmentRF(X, Y, rfArray, nRepetitions, sigmaVec):
"""
Augments training data by adding Gaussian noise and trains multiple Random Forest models.
Parameters:
X (np.array): Feature matrix.
Y (np.array): Target vector.
rfArray (list of RandomForestClassifier): List of RF models to train.
nRepetitions (int): Number of times to repeat the augmentation.
sigmaVec (list or np.array): Standard deviations for Gaussian noise.
Returns:
list of RandomForestClassifier: Trained RF models.
"""
X=np.array(X)
Y=np.array(Y)
# Ensure sigmaVec is correctly sized
if len(sigmaVec) != X.shape[1]:
raise ValueError("sigmaVec must have the same length as the number of features in X")
total_samples = X.shape[0]
nRF = len(rfArray)
indices = np.random.permutation(total_samples)
subset_size = total_samples // nRF # Use integer division for indexing
for i in range(nRF):
start_idx = i * subset_size
end_idx = (i + 1) * subset_size if i < nRF - 1 else total_samples
X_train = X[indices[start_idx:end_idx]]
Y_train = Y[indices[start_idx:end_idx]]
# Pre-allocate arrays to hold augmented data
augmented_X = np.zeros((nRepetitions * X_train.shape[0], X_train.shape[1]))
augmented_Y = np.zeros(nRepetitions * X_train.shape[0], dtype=Y.dtype)
# Fill the pre-allocated arrays
for j in range(nRepetitions):
noise = np.random.normal(0, sigmaVec, X_train.shape)
augmented_X[j * X_train.shape[0]:(j + 1) * X_train.shape[0], :] = X_train + noise
augmented_Y[j * X_train.shape[0]:(j + 1) * X_train.shape[0]] = Y_train
# Train the RF model on the augmented data
rfArray[i].fit(augmented_X, augmented_Y)
return rfArray |
Augmented Kernel Random Forest- error estimate using predictRFStatTask Description: We have Objective: Develop an optimal estimator for Methodology: Typically, we would calculate the mean and standard deviation of the measurements and use Let's assume the correlation between the measurements is represented by Questions:
Solution Approach:To account for the correlation
Using this approach, you can adjust your estimations for correlated data and provide more accurate and reliable estimates of parameters and their uncertainties. |
adding augmented makeAugmentXGBoostmakeAugmentXGBoost |
Augmented Random Forest with Kernel Convolution
For fast prototyping, a smooth and flexible representation of functions is essential. Traditional approaches using trees or forests for function representation typically result in a piecewise constant output, which is a significant limitation.
To achieve a smoother representation, we propose data augmentation by randomly smearing the input vector of explanatory variables (X_n) with a user-defined kernel function (default is Gaussian), denoted as (W_n).
Three functionalities should be implemented:
Training Augmentation: Each tree in the forest should be augmented using a random vector (E_n), enhancing the diversity and robustness of the model.
Smoothed Mean: Calculate a weighted mean of the tree outputs in the local neighborhood to produce a smoother result.
Statistical Analysis of Predictions: Provide functionality to calculate various statistics from different tree predictions, including weighted mean, standard deviation, median, and linear fits (possibly enhanced with kernel methods).
The text was updated successfully, but these errors were encountered: