Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5 Instances in Verified Fail for Gold Patch #267

Open
wistuba opened this issue Nov 30, 2024 · 4 comments
Open

5 Instances in Verified Fail for Gold Patch #267

wistuba opened this issue Nov 30, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@wistuba
Copy link
Contributor

wistuba commented Nov 30, 2024

Describe the bug

There are 5 cases in which the gold patch fails on Verified:

  • astropy__astropy-7606
  • astropy__astropy-8707
  • astropy__astropy-8872
  • matplotlib__matplotlib-20488
  • django__django-10097

The reason for why astropy__astropy-7606 fails seems to be that Verified was not update (see #223). It works when using princeton-nlp/SWE-bench instead.

Steps/Code to Reproduce

python -m swebench.harness.run_evaluation \                                                                                                      
 --predictions_path gold \                                                                 
 --max_workers 5 \
 --dataset_name princeton-nlp/SWE-bench_Verified \
 --run_id validate-gold \
 --instance_ids astropy__astropy-7606 matplotlib__matplotlib-20488 django__django-10097 astropy__astropy-8872 astropy__astropy-8707 \
 --cache_level instance

Expected Results

All 5 problems are resolved

Actual Results

None of them is resolved

System Information

latest version on main

@wistuba wistuba added the bug Something isn't working label Nov 30, 2024
@wistuba wistuba changed the title 5 Instances in Verified Fail 5 Instances in Verified Fail for Gold Patch Nov 30, 2024
@wistuba
Copy link
Contributor Author

wistuba commented Nov 30, 2024

I was looking into it. The matplotlib problem could be something on my end:

Test failing: lib/matplotlib/tests/test_image.py::test_https_imread_smoketest
Reason: urllib.error.HTTPError: HTTP Error 403: Forbidden

This is a simple test trying to read https://matplotlib.org/1.5.0/_static/logo2.png Things work fine on my machine when trying to set up the test manually. Could be that the urllib request gets blocked. This has happened previously already with django (this was fixable though since it was part of swebench code, not the benchmark itself)

@john-b-yang
Copy link
Member

I ran the provided script, but I observed that the instances all pass.

(sweb) john-b-yang@bitbop:~/swe-bench/private$ ./test.sh
/opt/miniconda3/envs/sweb/lib/python3.10/runpy.py:126: RuntimeWarning: 'swebench.harness.run_evaluation' found in sys.modules after import of package 'swebench.harness', but prior to execution of 'swebench.harness.run_evaluation'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Using gold predictions - ignoring predictions_path
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.30k/3.30k [00:00<00:00, 29.3MB/s]
test-00000-of-00001.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.10M/2.10M [00:00<00:00, 31.7MB/s]
Generating test split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 29878.22 examples/s]
Running 5 unevaluated instances...
Base image sweb.base.x86_64:latest already exists, skipping build.
Base images built successfully.
Total environment images to build: 2
2 ran successfully, 0 failed: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:58<00:00, 59.15s/it]
All environment images built successfully.
Running 5 instances...
5 ran successfully, 0 failed: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [05:16<00:00, 63.25s/it]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 5
Instances submitted: 5
Instances completed: 0
Instances incomplete: 0
Instances resolved: 0
Instances unresolved: 0
Instances with empty patches: 0
Instances with errors: 5
Unstopped containers: 0
Unremoved images: 5
Report written to gold.validate-gold.json
(sweb) john-b-yang@bitbop:~/swe-bench/private$ cat test.sh
python -m swebench.harness.run_evaluation \
 --predictions_path gold \
 --max_workers 5 \
 --dataset_name princeton-nlp/SWE-bench_Verified \
 --run_id validate-gold \
 --instance_ids astropy__astropy-7606 matplotlib__matplotlib-20488 django__django-10097 astropy__astropy-8872 astropy__astropy-8707 \
 --cache_level instance

@john-b-yang
Copy link
Member

Thanks for the details on the matplotlib instance, good to know. For the astropy instances, I'm not sure if there's an issue? I started containers for the respective images as well, and pytest seems to be running fine.

@wistuba
Copy link
Contributor Author

wistuba commented Dec 10, 2024

I've just tried with the latest code again. I can confirm that astropy 8707 and 8872 work now. The problem for 7606 remains. It is resolved when using SWE-bench, it is not when using Verified. I've deleted ~/.cache/huggingface/datasets before running the harness. I checked the website, the row doesn't seem to have been updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants