5 Instances in Verified Fail for Gold Patch #267

wistuba · 2024-11-30T16:59:13Z

Describe the bug

There are 5 cases in which the gold patch fails on Verified:

astropy__astropy-7606
astropy__astropy-8707
astropy__astropy-8872
matplotlib__matplotlib-20488
django__django-10097

The reason for why astropy__astropy-7606 fails seems to be that Verified was not update (see #223). It works when using princeton-nlp/SWE-bench instead.

Steps/Code to Reproduce

python -m swebench.harness.run_evaluation \                                                                                                      
 --predictions_path gold \                                                                 
 --max_workers 5 \
 --dataset_name princeton-nlp/SWE-bench_Verified \
 --run_id validate-gold \
 --instance_ids astropy__astropy-7606 matplotlib__matplotlib-20488 django__django-10097 astropy__astropy-8872 astropy__astropy-8707 \
 --cache_level instance

Expected Results

All 5 problems are resolved

Actual Results

None of them is resolved

System Information

latest version on main

The text was updated successfully, but these errors were encountered:

wistuba · 2024-11-30T18:52:36Z

I was looking into it. The matplotlib problem could be something on my end:

Test failing: lib/matplotlib/tests/test_image.py::test_https_imread_smoketest
Reason: urllib.error.HTTPError: HTTP Error 403: Forbidden

This is a simple test trying to read https://matplotlib.org/1.5.0/_static/logo2.png Things work fine on my machine when trying to set up the test manually. Could be that the urllib request gets blocked. This has happened previously already with django (this was fixable though since it was part of swebench code, not the benchmark itself)

john-b-yang · 2024-12-09T19:50:22Z

I ran the provided script, but I observed that the instances all pass.

(sweb) john-b-yang@bitbop:~/swe-bench/private$ ./test.sh
/opt/miniconda3/envs/sweb/lib/python3.10/runpy.py:126: RuntimeWarning: 'swebench.harness.run_evaluation' found in sys.modules after import of package 'swebench.harness', but prior to execution of 'swebench.harness.run_evaluation'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Using gold predictions - ignoring predictions_path
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.30k/3.30k [00:00<00:00, 29.3MB/s]
test-00000-of-00001.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.10M/2.10M [00:00<00:00, 31.7MB/s]
Generating test split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 29878.22 examples/s]
Running 5 unevaluated instances...
Base image sweb.base.x86_64:latest already exists, skipping build.
Base images built successfully.
Total environment images to build: 2
2 ran successfully, 0 failed: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:58<00:00, 59.15s/it]
All environment images built successfully.
Running 5 instances...
5 ran successfully, 0 failed: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [05:16<00:00, 63.25s/it]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 5
Instances submitted: 5
Instances completed: 0
Instances incomplete: 0
Instances resolved: 0
Instances unresolved: 0
Instances with empty patches: 0
Instances with errors: 5
Unstopped containers: 0
Unremoved images: 5
Report written to gold.validate-gold.json
(sweb) john-b-yang@bitbop:~/swe-bench/private$ cat test.sh
python -m swebench.harness.run_evaluation \
 --predictions_path gold \
 --max_workers 5 \
 --dataset_name princeton-nlp/SWE-bench_Verified \
 --run_id validate-gold \
 --instance_ids astropy__astropy-7606 matplotlib__matplotlib-20488 django__django-10097 astropy__astropy-8872 astropy__astropy-8707 \
 --cache_level instance

john-b-yang · 2024-12-09T19:51:08Z

Thanks for the details on the matplotlib instance, good to know. For the astropy instances, I'm not sure if there's an issue? I started containers for the respective images as well, and pytest seems to be running fine.

wistuba · 2024-12-10T09:34:27Z

I've just tried with the latest code again. I can confirm that astropy 8707 and 8872 work now. The problem for 7606 remains. It is resolved when using SWE-bench, it is not when using Verified. I've deleted ~/.cache/huggingface/datasets before running the harness. I checked the website, the row doesn't seem to have been updated

wistuba added the bug Something isn't working label Nov 30, 2024

wistuba changed the title ~~5 Instances in Verified Fail~~ 5 Instances in Verified Fail for Gold Patch Nov 30, 2024

wistuba mentioned this issue Nov 30, 2024

fix dependencies for some astropy problems #268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5 Instances in Verified Fail for Gold Patch #267

5 Instances in Verified Fail for Gold Patch #267

wistuba commented Nov 30, 2024 •

edited

Loading

wistuba commented Nov 30, 2024

john-b-yang commented Dec 9, 2024

john-b-yang commented Dec 9, 2024

wistuba commented Dec 10, 2024

5 Instances in Verified Fail for Gold Patch #267

5 Instances in Verified Fail for Gold Patch #267

Comments

wistuba commented Nov 30, 2024 • edited Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

System Information

wistuba commented Nov 30, 2024

john-b-yang commented Dec 9, 2024

john-b-yang commented Dec 9, 2024

wistuba commented Dec 10, 2024

wistuba commented Nov 30, 2024 •

edited

Loading