Error happened with load data #314

StefanIsSmart · 2024-09-16T02:33:18Z

Describe the bug
The bug was happened while loading the data

To Reproduce
Steps to reproduce the behavior:

from tdc.single_pred import Yields
data = Yields(name = 'Buchwald-Hartwig')
split = data.get_split()

Expected behavior

get a dataframe

Screenshots

Environment:

OS:Linux
Python version:3.8
TDC version:0.4.1
Any other relevant information:None

Additional context

flogrammer · 2024-10-02T18:52:01Z

I'm having the same issue

Any ideas?

flogrammer · 2024-10-02T18:58:08Z

The problem seems to be that the downloaded zinc.tab file is empty (in my case zinc)

mxfly14 · 2024-10-03T20:00:45Z

Hi,
I have the same issue (same message and an empty .tab file). And when I run it in my terminal I got this :

Maybe it is a bad request to https://dataverse.harvard.edu/ ?

Arslan-Masood · 2024-10-04T09:59:46Z

If you just want to download the data, directly download from here
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/21LKWG

jepdavidson · 2024-10-07T09:45:02Z

Hi,

I am seeing the same (misleading) "TDC is hosted in Harvard Dataverse and it is currently under maintenance" message.
As @flogrammer and @mxfly14 said, this appears to be due to empty files being retrieved.

The underlying cause (in my environment at least) is due to getting a 202 response instead of 200 when sending a GET request.
Here's the code for the dataverse_download function (from tdc.utils.load):

def dataverse_download(url, path, name, types, id=None):
    """dataverse download helper with progress bar

    Args:
        url (str): the url of the dataset
        path (str): the path to save the dataset
        name (str): the dataset name
        types (dict): a dictionary mapping from the dataset name to the file format
    """
    if id is None:
        save_path = os.path.join(path, name + "." + types[name])
    else:
        save_path = os.path.join(path, name + "-" + str(id) + "." + types[name])
    response = requests.get(url, stream=True)
    total_size_in_bytes = int(response.headers.get("content-length", 0))
    block_size = 1024
    progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
    with open(save_path, "wb") as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data)
    progress_bar.close()

The 202 status means that response.iter_content() doesn't generate anything, and the function ends-up writing an empty file.
The 202 status can be simply reproduced like this:

import requests
r = requests.get("https://dataverse.harvard.edu/api/access/datafile/4267146")
print(r.status_code)

202

Strangely, the same behaviour is not observed when running in a Google colab environment (I haven't figured-out why that is yet!).

Kind regards

James

abearab mentioned this issue Oct 7, 2024

Dataverse data access module pachterlab/gget#124

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error happened with load data #314

Error happened with load data #314

StefanIsSmart commented Sep 16, 2024 •

edited

Loading

flogrammer commented Oct 2, 2024

flogrammer commented Oct 2, 2024 •

edited

Loading

mxfly14 commented Oct 3, 2024

Arslan-Masood commented Oct 4, 2024

jepdavidson commented Oct 7, 2024

Error happened with load data #314

Error happened with load data #314

Comments

StefanIsSmart commented Sep 16, 2024 • edited Loading

flogrammer commented Oct 2, 2024

flogrammer commented Oct 2, 2024 • edited Loading

mxfly14 commented Oct 3, 2024

Arslan-Masood commented Oct 4, 2024

jepdavidson commented Oct 7, 2024

StefanIsSmart commented Sep 16, 2024 •

edited

Loading

flogrammer commented Oct 2, 2024 •

edited

Loading