Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ComBat.R #58

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Update ComBat.R #58

wants to merge 1 commit into from

Conversation

elies-ramon
Copy link

@jtleek @zhangyuqing
I am currently working with machine learning models for metabolomic data. In that context, the model is first trained using a training set and then its prediction performance is assessed in an independent test set. Thus, it is key to prevent information leaks from the test to the training set, including data-driven pre-processing or normalization parameters. For this reason, I modified the ComBat function so now it allows the user to adjust separately a training and a test set for batch effects. First, Combat is applied over dat data as usual (dat data is always assumed to be the training set). Then, dat_test (the test set) is adjusted using the coefficients computed from training data. The training and test data should have at least some batches in common. The output of the ComBat function is now a list with 2 elements: first the adjusted training and second the adjusted test data. In case the dat_test is NULL, the output is identical to the current version of ComBat.

If you find this modified ComBat function useful, feel free to merge it into the devel version of SVA or to further modify it.

Thank you!

ComBat function modification - now it allows the user to adjust a training and a test set for batch effects. This update is useful for machine learning applications.
@wevanjohnson
Copy link
Contributor

wevanjohnson commented Mar 2, 2023 via email

@elies-ramon
Copy link
Author

Thank you for your quick answer! I would try to explain the differences between the reference batch approach and my update, as the two do not serve exactly for the same purposes.

As you say in the paper, the reference batch is useful in scenarios where one batch is of better quality or is considered the “baseline”, so the rest of data/batches are adjusted to it. In that case, the reference batch data should be used as the training data and the rest as the test data. With that approach the independence of the test set is warranted.

However, sometimes it is not possible to choose a reference batch. For instance, when one observes a striking batch effect (clearly traced to a systematic origin)that affects a whole dataset but there is not a superior batch, as they are all equally “wrong”. Furthermore, if the final goal of the study is to estimate the performance of a machine learning model in some data or to compare the performance of several ml methods, it is very important that the training data has a reasonable size to not underestimate the performance. In ml, usually 75-80% of the data is reserved to train the model. Unless the total amount of data is very large, selecting at random one of the batches as a reference would lead to a very small datasets. I modified the ComBat code after seeing these two issues arise twice in my research group, so I thought the update this could be useful to more people.

At the end, I consider that they are two different ways to manage the available data. If you have 50 samples and 5 batches of equal size, you may use one batch as a training set (for example batch 1, i.e. samples 1 to 10) to adjust ComBat. In my approach, the training set will consist of 40 samples (8 samples coming from each batch). Then the ComBat model generated with this data will be used to adjust also for the test data (the remaining 2 samples for batch).

Hope that this helps,

Elies

@wevanjohnson
Copy link
Contributor

wevanjohnson commented Mar 3, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants