You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first question is why does not all test cases pass? I also found this is the case for several other instances.
The second question is, what is your reasoning for not allowing the use PASS_TO_PASS test cases for evaluation of SWE-bench? This question is related to the first one, the test cases that pass before fixing the issue shouldn't be a secret, having the existing test failing makes it harder to determine if they are caused by the bug that the issue raised, or if it is SWE-bench issue.
I come from automated program repair background, therefore in there, the usual assumption is that existing test cases will pass and we will use them as regression test to test that our patch did not break the existing functionality.
Suggest an improvement to documentation
No response
The text was updated successfully, but these errors were encountered:
Describe the issue
Hi, sorry if I missed this in your paper or somewhere in your GitHub repository.
But I tried to execute all existing test cases for a given SWE-bench project, for example by using the docker image (I assume that SWE-bench team have uploaded it) from https://hub.docker.com/r/swebench/sweb.eval.x86_64.psf_1776_requests-1142:
Here is the full list of test results:
The first question is why does not all test cases pass? I also found this is the case for several other instances.
The second question is, what is your reasoning for not allowing the use
PASS_TO_PASS
test cases for evaluation of SWE-bench? This question is related to the first one, the test cases that pass before fixing the issue shouldn't be a secret, having the existing test failing makes it harder to determine if they are caused by the bug that the issue raised, or if it is SWE-bench issue.I come from automated program repair background, therefore in there, the usual assumption is that existing test cases will pass and we will use them as regression test to test that our patch did not break the existing functionality.
Suggest an improvement to documentation
No response
The text was updated successfully, but these errors were encountered: