Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error happened with load data #314

Open
StefanIsSmart opened this issue Sep 16, 2024 · 5 comments
Open

Error happened with load data #314

StefanIsSmart opened this issue Sep 16, 2024 · 5 comments

Comments

@StefanIsSmart
Copy link

StefanIsSmart commented Sep 16, 2024

Describe the bug
The bug was happened while loading the data

To Reproduce
Steps to reproduce the behavior:

from tdc.single_pred import Yields
data = Yields(name = 'Buchwald-Hartwig')
split = data.get_split()

Expected behavior

get a dataframe

Screenshots
截屏2024-09-16 上午10 33 08

Environment:

  • OS:Linux
  • Python version:3.8
  • TDC version:0.4.1
  • Any other relevant information:None

Additional context
Uploading 截屏2024-09-16 上午10.34.25.png…

@flogrammer
Copy link

I'm having the same issue
image

Any ideas?

@flogrammer
Copy link

flogrammer commented Oct 2, 2024

The problem seems to be that the downloaded zinc.tab file is empty (in my case zinc)

@mxfly14
Copy link

mxfly14 commented Oct 3, 2024

Hi,
I have the same issue (same message and an empty .tab file). And when I run it in my terminal I got this :
image
Maybe it is a bad request to https://dataverse.harvard.edu/ ?

@Arslan-Masood
Copy link

If you just want to download the data, directly download from here
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/21LKWG

@jepdavidson
Copy link

Hi,

I am seeing the same (misleading) "TDC is hosted in Harvard Dataverse and it is currently under maintenance" message.
As @flogrammer and @mxfly14 said, this appears to be due to empty files being retrieved.

The underlying cause (in my environment at least) is due to getting a 202 response instead of 200 when sending a GET request.
Here's the code for the dataverse_download function (from tdc.utils.load):

def dataverse_download(url, path, name, types, id=None):
    """dataverse download helper with progress bar

    Args:
        url (str): the url of the dataset
        path (str): the path to save the dataset
        name (str): the dataset name
        types (dict): a dictionary mapping from the dataset name to the file format
    """
    if id is None:
        save_path = os.path.join(path, name + "." + types[name])
    else:
        save_path = os.path.join(path, name + "-" + str(id) + "." + types[name])
    response = requests.get(url, stream=True)
    total_size_in_bytes = int(response.headers.get("content-length", 0))
    block_size = 1024
    progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
    with open(save_path, "wb") as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data)
    progress_bar.close()

The 202 status means that response.iter_content() doesn't generate anything, and the function ends-up writing an empty file.
The 202 status can be simply reproduced like this:

import requests
r = requests.get("https://dataverse.harvard.edu/api/access/datafile/4267146")
print(r.status_code)

202

Strangely, the same behaviour is not observed when running in a Google colab environment (I haven't figured-out why that is yet!).
image

Kind regards

James

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants