-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing benchmark instances #167
Comments
thanks for making an issue about this |
Hi @aorwall! Thanks so much for the kind words, swe-bench-docker was a huge inspiration for our release. Really appreciate all the past and ongoing work w/ Moatless + SWE-bench evals 😄 Ok so regarding the issue...
I'm actively working on 2, have gotten some help from Stanford folks as well on this - I think you can expect a dataset update that addresses these problems by end of next week at the latest! |
Looks like #166 fixed the Django isses 👍 |
I'm currently working on a hosted solution where I'm running benchmarks on virtual machines in Azure. I've gotten very stable results with no instances failing (except for One thing that would be worth investigating is the performance of And ping me on Discord if you'd like to try my solution. I might have something up and running in the next few days... |
I think I was a bit too quick there. It seems like there are still some shaky tests in
For other sympy tests, I haven't found a solution. It's always recursion depth errors. I've tried increasing the recursion depth with So either we can comment out some flaky tests or try with different dependencies. The tests for I've set up my hosted version for testing at eval.moatless.ai. Feel free to try it out. The report of the latest gold patch run can be found here. |
Hmm ok yeah I'm definitely also noticing this recursion depth limit issue. Currently, it does look like most of the shaky tests are pass to pass tests. I'm currently working with another team who've been able to identify these shaky tests by just running multiple times w/ the gold patch and seeing if there are any tests that don't consistently pass. I think the resolution for this will likely just be eliminating shaky pass to pass tests from the SWE-bench dataset. This has not happened yet, but I think we'll try to make this happen by the end of the month. Also, the hosted testing version looks beautiful! We're working on something similar, so it's great to see a version of it already out 😄 |
Great job with the new containerized evaluation tool! I've run it a couple of times on the golden patches on SWE-bench Lite and overall it gives a more stable result than my swe-bench-docker setup. There are a few instances that fail intermittently, though. Some I recognize from tests in swe-bench-docker, and some are new. None of them are failing in 100% of the runs.
Django instances
In all the failing Django instances I've checked, the tests seem to pass but are marked as failed because other logs are being printed in the test results.
Here's an example of a test that is marked as failed:
The same test in a successful test output log
Other instances
In the following instances some different tests fails intermittently and I haven't found the root cause. I got the same issues in swe-bench-docker with matplotlib and sympy instances. I haven't got issues with psf__requests though.
Have you experienced the same issues? Is it also be possible for you to share your
run_instance_logs
somewhere to compare to your successful evaluation runs. Would be nice to nail this once and for all :)I've run the benchmarks on Ubuntu 22 VMs with 16 cores on Azure (
max_workers = 14
)The text was updated successfully, but these errors were encountered: