Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs in TF10 dataset #10

Open
tung-nd opened this issue Apr 25, 2023 · 2 comments
Open

bugs in TF10 dataset #10

tung-nd opened this issue Apr 25, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@tung-nd
Copy link

tung-nd commented Apr 25, 2023

Hi. Thank you for the great work and for publishing the benchmark. I am experimenting with the TF10 task and found out that we have more than 4M data points in the public dataset and more than 8M in the hidden dataset. Do you know if this is a bug? Because these numbers are even larger than the number of all possible configurations (4^10 ~ 1M)

@TsingQAQ
Copy link

hi @tung-nd, do you have any clue now? it seems both TF8 and TF10 have duplicating inputs, while for TF8 i can safely remove duplications as the output are exactly the same, while for TF10 duplications have obviously different output though.

@brandontrabucco brandontrabucco self-assigned this Jan 29, 2024
@brandontrabucco
Copy link
Owner

brandontrabucco commented Jan 29, 2024

Hello tung-nd and TsingQAQ,

Thanks for bringing this to my attention! After inspecting the TFBind10 dataset, it appears that each 10-mer sequence is evaluated 4 times to compute the ddG score. In the current benchmark, each trial was stored as an additional datapoint.

However, now knowing this repetition, each of the 4 trials should be averaged and treated as a single datapoint so that there is no overlap between training and testing datasets. I'm working on a patch for this in the form of a TFBind10-Exact-v1 task.

The original task with duplicate datapoints will continue to be served through TFBind10-Exact-v0, which is the current id for that task in design-bench.

I will add a similar patch for TFBind8.

-Brandon

@brandontrabucco brandontrabucco added the bug Something isn't working label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants