You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope this message finds you well. My name is Roberto, and I've been reading your paper recently, a very interesting study.
I have a few questions about the utilization of NGS files containing antibodies categorized as binders and non-binders. In the 'scripts/main.py' file of the repository, there are the following lines of code:
If I understand correctly, these lines of code read NGS-processed files for binders and non-binders from three distinct DMS enrichment rounds, merging them to create two files: 'data/mHER_H3_AgNeg.csv' and 'data/mHER_H3_AgPos.csv'. Subsequently, in a later step, these files are concatenated, and the class ratio is adjusted. Notably, a subset of non-binders is removed during this adjustment, only to be reintroduced incrementally into the training set later.
The incremental imbalance of the training dataset is only done when benchmarking the different ML algorithms, in a later step when training only the CNN the training set is not incremented.
Upon looking into the incremented datasets employed for model training, I observed that, except for the training set with a binders-to-non-binders ratio of 0.5, all other sets contain some instances where the same CDR3H sequence appears with different labels. I guess this happens because an antibody can be a binder for instance in the results coming from the first DMS enrichment round but is a non-binder in the last.
This is an image showing an example of two sequences in which this happens:
- Is this observation correct? Does this labeling duality serve a specific purpose in the training process? If so, could you shed light on the benefits of adopting this approach when benchmarking the ML algorithms?
- Could you share your perspective on the advantages of incorporating data from different rounds of DMS enrichment in the training set?
Thank you in advance for your time and insights.
Kind regards,
Roberto
The text was updated successfully, but these errors were encountered:
Hi,
I hope this message finds you well. My name is Roberto, and I've been reading your paper recently, a very interesting study.
I have a few questions about the utilization of NGS files containing antibodies categorized as binders and non-binders. In the 'scripts/main.py' file of the repository, there are the following lines of code:
If I understand correctly, these lines of code read NGS-processed files for binders and non-binders from three distinct DMS enrichment rounds, merging them to create two files: 'data/mHER_H3_AgNeg.csv' and 'data/mHER_H3_AgPos.csv'. Subsequently, in a later step, these files are concatenated, and the class ratio is adjusted. Notably, a subset of non-binders is removed during this adjustment, only to be reintroduced incrementally into the training set later.
The incremental imbalance of the training dataset is only done when benchmarking the different ML algorithms, in a later step when training only the CNN the training set is not incremented.
Upon looking into the incremented datasets employed for model training, I observed that, except for the training set with a binders-to-non-binders ratio of 0.5, all other sets contain some instances where the same CDR3H sequence appears with different labels. I guess this happens because an antibody can be a binder for instance in the results coming from the first DMS enrichment round but is a non-binder in the last.
This is an image showing an example of two sequences in which this happens:
- Is this observation correct? Does this labeling duality serve a specific purpose in the training process? If so, could you shed light on the benefits of adopting this approach when benchmarking the ML algorithms?
- Could you share your perspective on the advantages of incorporating data from different rounds of DMS enrichment in the training set?
Thank you in advance for your time and insights.
Kind regards,
Roberto
The text was updated successfully, but these errors were encountered: