Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augmented Random Forest with Kernel Convolution + predictRFStat extension - JIRAO2-5110 #359

Open
miranov25 opened this issue Jun 28, 2024 · 6 comments

Comments

@miranov25
Copy link
Owner

Augmented Random Forest with Kernel Convolution

For fast prototyping, a smooth and flexible representation of functions is essential. Traditional approaches using trees or forests for function representation typically result in a piecewise constant output, which is a significant limitation.

To achieve a smoother representation, we propose data augmentation by randomly smearing the input vector of explanatory variables (X_n) with a user-defined kernel function (default is Gaussian), denoted as (W_n).

Three functionalities should be implemented:

  1. Training Augmentation: Each tree in the forest should be augmented using a random vector (E_n), enhancing the diversity and robustness of the model.

  2. Smoothed Mean: Calculate a weighted mean of the tree outputs in the local neighborhood to produce a smoother result.

  3. Statistical Analysis of Predictions: Provide functionality to calculate various statistics from different tree predictions, including weighted mean, standard deviation, median, and linear fits (possibly enhanced with kernel methods).

    • Note: It is unclear whether the scikit-learn trees can provide information about the "box" defining cube properties. I would need additional investigation to figure out if this aspect is feasible.
@miranov25 miranov25 changed the title Augmented Random Forest with Kernel Convolution + predictRFStat extension Augmented Random Forest with Kernel Convolution + predictRFStat extension - JIRAO2-5110 Aug 1, 2024
@miranov25
Copy link
Owner Author

Link to the JIRA ticket-

https://its.cern.ch/jira/browse/O2-5110

@miranov25
Copy link
Owner Author

Presentation slide - to add to the RootInteractive full presentation:
https://docs.google.com/presentation/d/1lb3vvhp4iKfLoJXW1nnZwGNp5XlhouItc98PZx1nP44/edit#slide=id.g2789cb2e63c_0_12

@miranov25
Copy link
Owner Author

test of the AugmentedRandomForest
The goal is to generate a Panda DataFrame containing points (X_0, X_1, \ldots, X_n) that are uniformly distributed between 0 and 1. Each point should be evaluated using a specified function, such as ( \cos(X_0\pi + 2X_1\pi) + \cos(3\pi X_2 + 4\pi X_4) ). The data frame should include noise addition, function evaluation, and predictions using varying kernel widths to assess the model's sensitivity to these parameters.

Requirements:

  1. Data Generation:

    • Generate multi-dimensional points ((X_0, X_1, \ldots, X_n)) where each (X_i) is uniformly distributed between 0 and 1.
    • Evaluate these points using a specified function with the ability to pass different mathematical expressions as parameters.
    • Add configurable noise to the function output.
  2. Model Prediction:

    • Implement predictions using different kernel widths, e.g., 0.05, 0.02, 0.01.
    • Extract and group the standard deviations of the predictions for each kernel width to compare against the exact function.

Pseudocode for Implementation:

class AugmentedRandomForestArray:
    # Array of N standard random forests
    # Input parameters include standard options like kernel width

    def fit(self, X, Y):
        # Fit each random forest with augmented input data

    def predict(self):
        # Use an ensemble of forests for prediction
        # Unlike standard RFStat which uses a single array of trees, this method uses an array of arrays

    def predictRFStat(rfArray, X, statDictionary, n_jobs):
        # Definition for detailed statistical prediction using random forests

@miranov25
Copy link
Owner Author

miranov25 commented Oct 27, 2024

Simple version of AugmentedKernelRandomForest

Code used recently to evaluate smooth function with kernel with defined by the sigmaVec

import numpy as np
from sklearn.ensemble import RandomForestClassifier

def makeAugmentRF(X, Y, rfArray, nRepetitions, sigmaVec):
    """
    Augments training data by adding Gaussian noise and trains multiple Random Forest models.
    Parameters:
        X (np.array): Feature matrix.
        Y (np.array): Target vector.
        rfArray (list of RandomForestClassifier): List of RF models to train.
        nRepetitions (int): Number of times to repeat the augmentation.
        sigmaVec (list or np.array): Standard deviations for Gaussian noise.
    Returns:
        list of RandomForestClassifier: Trained RF models.
    """
    X=np.array(X)
    Y=np.array(Y)
    # Ensure sigmaVec is correctly sized
    if len(sigmaVec) != X.shape[1]:
        raise ValueError("sigmaVec must have the same length as the number of features in X")
    
    total_samples = X.shape[0]
    nRF = len(rfArray)
    indices = np.random.permutation(total_samples)
    subset_size = total_samples // nRF  # Use integer division for indexing

    for i in range(nRF):
        start_idx = i * subset_size
        end_idx = (i + 1) * subset_size if i < nRF - 1 else total_samples

        X_train = X[indices[start_idx:end_idx]]
        Y_train = Y[indices[start_idx:end_idx]]

        # Pre-allocate arrays to hold augmented data
        augmented_X = np.zeros((nRepetitions * X_train.shape[0], X_train.shape[1]))
        augmented_Y = np.zeros(nRepetitions * X_train.shape[0], dtype=Y.dtype)

        # Fill the pre-allocated arrays
        for j in range(nRepetitions):
            noise = np.random.normal(0, sigmaVec, X_train.shape)
            augmented_X[j * X_train.shape[0]:(j + 1) * X_train.shape[0], :] = X_train + noise
            augmented_Y[j * X_train.shape[0]:(j + 1) * X_train.shape[0]] = Y_train

        # Train the RF model on the augmented data
        rfArray[i].fit(augmented_X, augmented_Y)

    return rfArray

@miranov25
Copy link
Owner Author

Augmented Kernel Random Forest- error estimate using predictRFStat

Task Description:

We have n measurements of the same variable, which follow a probability distribution characterized by parameters mean (μ) and standard deviation (σ).

Objective:

Develop an optimal estimator for μ and σ and their associated uncertainties.

Methodology:

Typically, we would calculate the mean and standard deviation of the measurements and use std/sqrt(n) to estimate the errors. However, the measurements may be correlated, and we need to account for this correlation in our estimates.

Let's assume the correlation between the measurements is represented by ρ.

Questions:

  1. How can we incorporate the correlation ρ into our estimation of μ and σ?
  2. How do we adjust our uncertainty calculations to reflect this correlation?

Solution Approach:

To account for the correlation ρ between measurements in estimating μ and σ, and to adjust the uncertainty calculations accordingly, you can consider the following approach:

  1. Estimate the Mean μ:

    The mean can still be estimated using the usual sample mean formula:

    μ_hat = (1/n) * sum(x_i for i=1 to n)
    
  2. Adjust for Correlation in Uncertainty Estimation:

    When measurements are correlated, the standard error of the mean (SEM) is no longer simply σ/sqrt(n). Instead, it should be adjusted to:

    SEM = σ * sqrt((1 + (n-1) * ρ) / n)
    

    This formula accounts for the increased uncertainty due to correlation among measurements.

  3. Estimate the Standard Deviation σ:

    The standard deviation in the presence of correlation can be more complex to estimate directly and might require more specific assumptions about the nature of the correlation or the use of computational methods such as bootstrapping to account for correlation effects.

  4. Estimating σ with Computational Methods:

    You might consider using bootstrapping, where each bootstrap sample accounts for correlation structure (e.g., resampling blocks of correlated data), to estimate the distribution of σ and hence its uncertainty.

  5. Implement and Test:

    Implement these calculations in a programming environment that supports statistical analysis, such as Python with libraries like NumPy and SciPy. Validate your implementation by simulating data with known correlation and verifying that your estimators recover the true parameters accurately.

Using this approach, you can adjust your estimations for correlated data and provide more accurate and reliable estimates of parameters and their uncertainties.

@miranov25
Copy link
Owner Author

adding augmented makeAugmentXGBoost

makeAugmentXGBoost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant