How to access to the unlisted datasets in PWC? #24

zhimin-z · 2024-04-23T03:33:26Z

I discovered that the main dataset page mentions the availability of up to 9,753 machine-learning datasets:

However, upon navigating through the pages from page 1 to page 100, I found no way to access the datasets not listed within the first 100 pages. Even when I manually attempted to access pages beyond 100, the website returned the same dataset list as page 100.

Could you please advise if there is a method to retrieve datasets beyond the first 100 pages? Your assistance in this matter would be greatly appreciated. @alefnula @lambdaofgod @rstojnic @mkardas

rstojnic · 2024-04-23T07:06:28Z

We temporarily disabled dataset browsing because someone was DDOS-ing the website using a bot. It looks like they are running a broken bot that's trying all kinds of nonsensical dataset filters, which is why we've disabled them for now. Should be back shortly after we fully identify and block them.

zhimin-z · 2024-04-23T07:28:27Z

We temporarily disabled dataset browsing because someone was DDOS-ing the website using a bot. It looks like they are running a broken bot that's trying all kinds of nonsensical dataset filters, which is why we've disabled them for now. Should be back shortly after we fully identify and block them.

Dear @rstojnic ,

I hope this message finds you well. After reading your comment, I wanted to reach out and clarify that the activity you've observed might potentially be related to my research efforts (but this is not 100% sure). I've been collecting dataset information for research on dataset evolution, which involves gathering data from various sources, including your platform. Here is my code:

from paperswithcode import PapersWithCodeClient

client = PapersWithCodeClient(token=XXXX)

page = 1
scrape = True
dataset_full = {}

while scrape:
    try:
        dataset_page = client.dataset_list(page=page)
        for dataset in dataset_page.results:
            dataset_full[dataset.id] = {
                'name': dataset.name,
                'url': dataset.url,
            }
    except:
        scrape = False
    page += 1

Please note that my intentions are purely academic, and I sincerely apologize for any unintended strain my actions may have placed on your website. I can assure you that I am not engaged in any malicious activity, such as DDOS-ing.

Would there be a more appropriate method for me to collect this dataset information for research purposes without causing any issues to your platform? Your guidance and support in this matter would be greatly appreciated.

Thank you for your understanding, and I look forward to hearing from you.

Best regards,
Jimmy

zhimin-z · 2024-04-23T07:33:06Z

We temporarily disabled dataset browsing because someone was DDOS-ing the website using a bot. It looks like they are running a broken bot that's trying all kinds of nonsensical dataset filters, which is why we've disabled them for now. Should be back shortly after we fully identify and block them.

We temporarily disabled dataset browsing because someone was DDOS-ing the website using a bot. It looks like they are running a broken bot that's trying all kinds of nonsensical dataset filters, which is why we've disabled them for now. Should be back shortly after we fully identify and block them.

Dear @rstojnic ,

I hope this message finds you well. After reading your comment, I wanted to reach out and clarify that the activity you've observed might potentially be related to my research efforts. I've been collecting dataset information for research on dataset evolution, which involves gathering data from various sources, including your platform. Here is my code:
from paperswithcode import PapersWithCodeClient

client = PapersWithCodeClient(token=XXXX)

page = 1
scrape = True
dataset_full = {}

while scrape:
    try:
        dataset_page = client.dataset_list(page=page)
        for dataset in dataset_page.results:
            dataset_full[dataset.id] = {
                'name': dataset.name,
                'url': dataset.url,
            }
    except:
        scrape = False
    page += 1
    
with open(f'{path_meta}/dataset_full.pkl', 'wb') as f:
    pickle.dump(dataset_full, f) 
Please note that my intentions are purely academic, and I sincerely apologize for any unintended strain my actions may have placed on your website. I can assure you that I am not engaged in any malicious activity, such as DDOS-ing.

Would there be a more appropriate method for me to collect this dataset information for research purposes without causing any issues to your platform? Your guidance and support in this matter would be greatly appreciated.

Thank you for your understanding, and I look forward to hearing from you.

Best regards, Jimmy

I indeed wrote up an email clarifying this a few days ago, but there is no reply yet so I just collect them using this API for a chance.

rstojnic · 2024-04-23T07:39:41Z

Hi @zhimin-z there is no need to scrape the website, all the data is available on: https://github.com/paperswithcode/paperswithcode-data

rstojnic · 2024-04-23T07:51:27Z

The repo itself is old because it's just a README. The links point back to our S3 bucket that should be updated every day.

zhimin-z · 2024-04-23T08:01:14Z

The repo itself is old because it's just a README. The links point back to our S3 bucket that should be updated every day.

Thanks, but I found some datasets are not available in the downloadable json files. For example, HELM and HEIM are not in the Datasets. That is the reason why I thought these files might be obsolete initially. I just wonder what are the criteria for your generating the Datasets and other Evaluation tables files?

rstojnic · 2024-04-23T08:03:33Z

They should all be there. If they are not, the export might be stuck. @alefnula @andrewkuanop

zhimin-z · 2024-04-23T09:01:16Z

There seem to be a lot of leaderboards missing in the Evaluation tables from the website.

Here is what Evaluation tables gives (9238 datasets in total):

Here is what I collected (within 100 displayable pages of the PWC datasets, 4800 datasets in total):

number of evaluation records from paper mining:
number of evaluation records from model card:

Overall, at least ten thousand level records are missing from your online archive, and this does not even take into account the evaluations from the datasets beyond 100 pages from the PWC website. @rstojnic @alefnula @andrewkuanop

andrewkuanop · 2024-04-25T02:41:00Z

Hi Jimmy, It should be fixed by now. Can you check if it works. Thanks. Sent from Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: JIMMY ZHAO ***@***.***> Sent: Thursday, April 25, 2024 1:57:00 AM To: paperswithcode/paperswithcode-client ***@***.***> Cc: Andrew Kuan ***@***.***>; Mention ***@***.***> Subject: Re: [paperswithcode/paperswithcode-client] How to access to the unlisted datasets in PWC? (Issue #24) Any update? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned. Message ID: <paperswithcode/paperswithcode-client/issues/24/2075522995@ github. com> ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ZjQcmQRYFpfptBannerStart This Message Is From an External Sender ZjQcmQRYFpfptBannerEnd Any update? — Reply to this email directly, view it on GitHub<#24 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A4TEPHOLP3SKWK67WF5HFR3Y67W6ZAVCNFSM6AAAAABGT62YWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZVGUZDEOJZGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

zhimin-z · 2024-04-25T04:56:02Z

https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum

Thanks for your reply, @andrewkuanop

For evaluation tables, I found https://paperswithcode.com/sota/text-classification-on-glue is available in the Evaluation tables, but https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum is still not.

For datasets, I found both HELM and HEIM are not in the Datasets.

I think the issue still persists...

andrewkuanop · 2024-04-26T14:46:12Z

Hi Jimmy, I believe it should be solved. Can you give it a try. Thanks, Andrew From: JIMMY ZHAO ***@***.***> Date: Thursday, 25 April 2024 at 12:56 To: paperswithcode/paperswithcode-client ***@***.***> Cc: Andrew Kuan ***@***.***>, Mention ***@***.***> Subject: Re: [paperswithcode/paperswithcode-client] How to access to the unlisted datasets in PWC? (Issue #24) https: //paperswithcode. com/sota/abstractive-dialogue-summarization-on-samsum Thanks for your reply, @andrewkuanop For evaluation tables, I found https: //paperswithcode. com/sota/text-classification-on-glue is available in the Evaluation tables, ZjQcmQRYFpfptBannerStart This Message Is From an External Sender ZjQcmQRYFpfptBannerEnd https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum<https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum> Thanks for your reply, @andrewkuanop<https://github.com/andrewkuanop> For evaluation tables, I found https://paperswithcode.com/sota/text-classification-on-glue<https://paperswithcode.com/sota/text-classification-on-glue> is available in the Evaluation tables<https://production-media.paperswithcode.com/about/evaluation-tables.json.gz>, but https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum<https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum> is still not. For datasets, I found both HELM<https://paperswithcode.com/dataset/helm> and HEIM<https://paperswithcode.com/dataset/heim> are not in the Datasets<https://production-media.paperswithcode.com/about/datasets.json.gz>. I think the issue still persists... — Reply to this email directly, view it on GitHub<#24 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A4TEPHIDGTWCJSBRMB64MXTY7CEHRAVCNFSM6AAAAABGT62YWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZWGM2TMMRWHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

zhimin-z · 2024-05-26T04:40:34Z

https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum

Thanks, @andrewkuanop

After checking, I found the dataset issue is solved. Both HELM and HEIM are in the Datasets now.

I found https://paperswithcode.com/sota/text-classification-on-glue is available in the Evaluation tables, but https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum is still not.

I think the issue still persists for specific evaluation tables.

zhimin-z · 2024-11-21T23:58:41Z

Hmm...still not available, any further update?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to access to the unlisted datasets in PWC? #24

How to access to the unlisted datasets in PWC? #24

zhimin-z commented Apr 23, 2024 •

edited

Loading

rstojnic commented Apr 23, 2024

zhimin-z commented Apr 23, 2024 •

edited

Loading

zhimin-z commented Apr 23, 2024

rstojnic commented Apr 23, 2024

rstojnic commented Apr 23, 2024

zhimin-z commented Apr 23, 2024 •

edited

Loading

rstojnic commented Apr 23, 2024

zhimin-z commented Apr 23, 2024 •

edited

Loading

andrewkuanop commented Apr 25, 2024 via email

zhimin-z commented Apr 25, 2024 •

edited

Loading

andrewkuanop commented Apr 26, 2024 via email

zhimin-z commented May 26, 2024 •

edited

Loading

zhimin-z commented Nov 21, 2024

How to access to the unlisted datasets in PWC? #24

How to access to the unlisted datasets in PWC? #24

Comments

zhimin-z commented Apr 23, 2024 • edited Loading

rstojnic commented Apr 23, 2024

zhimin-z commented Apr 23, 2024 • edited Loading

zhimin-z commented Apr 23, 2024

rstojnic commented Apr 23, 2024

rstojnic commented Apr 23, 2024

zhimin-z commented Apr 23, 2024 • edited Loading

rstojnic commented Apr 23, 2024

zhimin-z commented Apr 23, 2024 • edited Loading

andrewkuanop commented Apr 25, 2024 via email

zhimin-z commented Apr 25, 2024 • edited Loading

andrewkuanop commented Apr 26, 2024 via email

zhimin-z commented May 26, 2024 • edited Loading

zhimin-z commented Nov 21, 2024

zhimin-z commented Apr 23, 2024 •

edited

Loading

zhimin-z commented Apr 23, 2024 •

edited

Loading

zhimin-z commented Apr 23, 2024 •

edited

Loading

zhimin-z commented Apr 23, 2024 •

edited

Loading

zhimin-z commented Apr 25, 2024 •

edited

Loading

zhimin-z commented May 26, 2024 •

edited

Loading