Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BadZipFile: Bad CRC-32 for file 'citi_bike_data_00001.csv' #10

Open
stoneyv opened this issue Nov 15, 2022 · 1 comment
Open

BadZipFile: Bad CRC-32 for file 'citi_bike_data_00001.csv' #10

stoneyv opened this issue Nov 15, 2022 · 1 comment

Comments

@stoneyv
Copy link

stoneyv commented Nov 15, 2022

The make-dataset returned a bad CRC-32 error. I will try to manually download the data and convert it.

(medium-data-bakeoff-py3.9) stoney@laptop2:~/Desktop/medium-data-bakeoff/src$ python -m medium_data_bakeoff make-dataset
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/stoney/.kaggle/kaggle.json'
2022-11-14 22:11:15.730 | INFO     | medium_data_bakeoff.data:construct_dataset:64 - Downloading 'rosenthal/citi-bike-stations' dataset from kaggle.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/stoney/.cache/pypoetry/virtualenvs/medium-data-bakeoff-2erTjUlV-py3.9/lib/python3.9/site-p │
│ ackages/kaggle/api/kaggle_api_extended.py:1246 in dataset_download_files                         │
│                                                                                                  │
│   1243 │   │   │   if unzip:                                                                     │
│   1244 │   │   │   │   try:                                                                      │
│   1245 │   │   │   │   │   with zipfile.ZipFile(outfile) as z:                                   │
│ ❱ 1246 │   │   │   │   │   │   z.extractall(effective_path)                                      │
│   1247 │   │   │   │   except zipfile.BadZipFile as e:                                           │
│   1248 │   │   │   │   │   raise ValueError(                                                     │
│   1249 │   │   │   │   │   │   'Bad zip file, please report on '                                 │
│                                                                                                  │
│ ╭────────────────────────────────────────── locals ───────────────────────────────────────────╮  │
│ │        dataset = 'rosenthal/citi-bike-stations'                                             │  │
│ │   dataset_slug = 'citi-bike-stations'                                                       │  │
│ │   dataset_urls = ['rosenthal', 'citi-bike-stations']                                        │  │
│ │     downloaded = True                                                                       │  │
│ │ effective_path = PosixPath('/home/stoney/Desktop/medium-data-bakeoff/data/csv')             │  │
│ │          force = False                                                                      │  │
│ │        outfile = '/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi-bike-stations.zip' │  │
│ │     owner_slug = 'rosenthal'                                                                │  │
│ │           path = PosixPath('/home/stoney/Desktop/medium-data-bakeoff/data/csv')             │  │
│ │          quiet = True                                                                       │  │
│ │       response = <urllib3.response.HTTPResponse object at 0x7fb5beb1f3a0>                   │  │
│ │           self = <kaggle.api.kaggle_api_extended.KaggleApi object at 0x7fb5bed08850>        │  │
│ │          unzip = True                                                                       │  │
│ │              z = <zipfile.ZipFile [closed]>                                                 │  │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────╯  │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:1633 in extractall                   │
│                                                                                                  │
│   1630 │   │   │   path = os.fspath(path)                                                        │
│   1631 │   │                                                                                     │
│   1632 │   │   for zipinfo in members:                                                           │
│ ❱ 1633 │   │   │   self._extract_member(zipinfo, path, pwd)                                      │
│   1634 │                                                                                         │
│   1635 │   @classmethod                                                                          │
│   1636 │   def _sanitize_windows_name(cls, arcname, pathsep):                                    │
│                                                                                                  │
│ ╭─────────────────────────── locals ────────────────────────────╮                                │
│ │ members = [                                                   │                                │
│ │           │   'citi_bike_data_00000.csv',                     │                                │
│ │           │   'citi_bike_data_00001.csv',                     │                                │
│ │           │   'citi_bike_data_00002.csv',                     │                                │
│ │           │   'citi_bike_data_00003.csv',                     │                                │
│ │           │   'citi_bike_data_00004.csv',                     │                                │
│ │           │   'citi_bike_data_00005.csv',                     │                                │
│ │           │   'citi_bike_data_00006.csv',                     │                                │
│ │           │   'citi_bike_data_00007.csv',                     │                                │
│ │           │   'citi_bike_data_00008.csv',                     │                                │
│ │           │   'citi_bike_data_00009.csv',                     │                                │
│ │           │   ... +40                                         │                                │
│ │           ]                                                   │                                │
│ │    path = '/home/stoney/Desktop/medium-data-bakeoff/data/csv' │                                │
│ │     pwd = None                                                │                                │
│ │    self = <zipfile.ZipFile [closed]>                          │                                │
│ │ zipinfo = 'citi_bike_data_00001.csv'                          │                                │
│ ╰───────────────────────────────────────────────────────────────╯                                │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:1688 in _extract_member              │
│                                                                                                  │
│   1685 │   │                                                                                     │
│   1686 │   │   with self.open(member, pwd=pwd) as source, \                                      │
│   1687 │   │   │    open(targetpath, "wb") as target:                                            │
│ ❱ 1688 │   │   │   shutil.copyfileobj(source, target)                                            │
│   1689 │   │                                                                                     │
│   1690 │   │   return targetpath                                                                 │
│   1691                                                                                           │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │            arcname = 'citi_bike_data_00001.csv'                                              │ │
│ │ invalid_path_parts = ('', '.', '..')                                                         │ │
│ │             member = <ZipInfo filename='citi_bike_data_00001.csv' compress_type=deflate      │ │
│ │                      file_size=383898052 compress_size=58998990>                             │ │
│ │                pwd = None                                                                    │ │
│ │               self = <zipfile.ZipFile [closed]>                                              │ │
│ │             source = <zipfile.ZipExtFile [closed]>                                           │ │
│ │             target = <_io.BufferedWriter                                                     │ │
│ │                      name='/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi_bike_data… │ │
│ │         targetpath = '/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi_bike_data_0000… │ │
│ │          upperdirs = '/home/stoney/Desktop/medium-data-bakeoff/data/csv'                     │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/shutil.py:205 in copyfileobj                    │
│                                                                                                  │
│    202 │   fsrc_read = fsrc.read                                                                 │
│    203 │   fdst_write = fdst.write                                                               │
│    204 │   while True:                                                                           │
│ ❱  205 │   │   buf = fsrc_read(length)                                                           │
│    206 │   │   if not buf:                                                                       │
│    207 │   │   │   break                                                                         │
│    208 │   │   fdst_write(buf)                                                                   │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │        buf = b',043450N63715450N63715\\N,0,30406,04345027,715450N637151496218044,043450N637… │ │
│ │       fdst = <_io.BufferedWriter                                                             │ │
│ │              name='/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi_bike_data_00001.c… │ │
│ │ fdst_write = <built-in method write of _io.BufferedWriter object at 0x7fb5beb2bb40>          │ │
│ │       fsrc = <zipfile.ZipExtFile [closed]>                                                   │ │
│ │  fsrc_read = <bound method ZipExtFile.read of <zipfile.ZipExtFile [closed]>>                 │ │
│ │     length = 65536                                                                           │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:922 in read                          │
│                                                                                                  │
│    919 │   │   self._readbuffer = b''                                                            │
│    920 │   │   self._offset = 0                                                                  │
│    921 │   │   while n > 0 and not self._eof:                                                    │
│ ❱  922 │   │   │   data = self._read1(n)                                                         │
│    923 │   │   │   if n < len(data):                                                             │
│    924 │   │   │   │   self._readbuffer = data                                                   │
│    925 │   │   │   │   self._offset = n                                                          │
│                                                                                                  │
│ ╭─────────────── locals ───────────────╮                                                         │
│ │  buf = b''                           │                                                         │
│ │  end = 65536                         │                                                         │
│ │    n = 65536                         │                                                         │
│ │ self = <zipfile.ZipExtFile [closed]> │                                                         │
│ ╰──────────────────────────────────────╯                                                         │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:1012 in _read1                       │
│                                                                                                  │
│   1009 │   │   self._left -= len(data)                                                           │
│   1010 │   │   if self._left <= 0:                                                               │
│   1011 │   │   │   self._eof = True                                                              │
│ ❱ 1012 │   │   self._update_crc(data)                                                            │
│   1013 │   │   return data                                                                       │
│   1014 │                                                                                         │
│   1015 │   def _read2(self, n):                                                                  │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ data = b'0,304015,0,1,17,0,1,1,1,1605744509,Broadway & Battery                               │ │
│ │        Pl,18270463334,-74.0136170'+53620                                                     │ │
│ │    n = 65536                                                                                 │ │
│ │ self = <zipfile.ZipExtFile [closed]>                                                         │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/stoney/.pyenv/versions/3.9.5/lib/python3.9/zipfile.py:940 in _update_crc                   │
│                                                                                                  │
│    937 │   │   self._running_crc = crc32(newdata, self._running_crc)                             │
│    938 │   │   # Check the CRC if we're at the end of the file                                   │
│    939 │   │   if self._eof and self._running_crc != self._expected_crc:                         │
│ ❱  940 │   │   │   raise BadZipFile("Bad CRC-32 for file %r" % self.name)                        │
│    941 │                                                                                         │
│    942 │   def read1(self, n):                                                                   │
│    943 │   │   """Read up to n bytes with at most one read() system call."""                     │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ newdata = b'0,304015,0,1,17,0,1,1,1,1605744509,Broadway & Battery                            │ │
│ │           Pl,18270463334,-74.0136170'+53620                                                  │ │
│ │    self = <zipfile.ZipExtFile [closed]>                                                      │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
BadZipFile: Bad CRC-32 for file 'citi_bike_data_00001.csv'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/stoney/Desktop/medium-data-bakeoff/src/medium_data_bakeoff/cli.py:10 in make_dataset       │
│                                                                                                  │
│    7 def make_dataset() -> None:                                                                 │
│    8 │   from medium_data_bakeoff.data import construct_dataset                                  │
│    9 │                                                                                           │
│ ❱ 10 │   construct_dataset()                                                                     │
│   11                                                                                             │
│   12                                                                                             │
│   13 @app.command(                                                                               │
│                                                                                                  │
│ ╭────────────────────────────── locals ──────────────────────────────╮                           │
│ │ construct_dataset = <function construct_dataset at 0x7fb5bebcb820> │                           │
│ ╰────────────────────────────────────────────────────────────────────╯                           │
│                                                                                                  │
│ /home/stoney/Desktop/medium-data-bakeoff/src/medium_data_bakeoff/data.py:66 in construct_dataset │
│                                                                                                  │
│    63 │   # Download the dataset from Kaggle                                                     │
│    64 │   logger.info("Downloading {!r} dataset from kaggle.", config.KAGGLE_DATASET)            │
│    65 │   config.CSV_PATH.mkdir(parents=True, exist_ok=True)                                     │
│ ❱  66 │   kaggle.api.dataset_download_files(                                                     │
│    67 │   │   config.KAGGLE_DATASET, config.CSV_PATH, unzip=True                                 │
│    68 │   )                                                                                      │
│    69                                                                                            │
│                                                                                                  │
│ /home/stoney/.cache/pypoetry/virtualenvs/medium-data-bakeoff-2erTjUlV-py3.9/lib/python3.9/site-p │
│ ackages/kaggle/api/kaggle_api_extended.py:1248 in dataset_download_files                         │
│                                                                                                  │
│   1245 │   │   │   │   │   with zipfile.ZipFile(outfile) as z:                                   │
│   1246 │   │   │   │   │   │   z.extractall(effective_path)                                      │
│   1247 │   │   │   │   except zipfile.BadZipFile as e:                                           │
│ ❱ 1248 │   │   │   │   │   raise ValueError(                                                     │
│   1249 │   │   │   │   │   │   'Bad zip file, please report on '                                 │
│   1250 │   │   │   │   │   │   'www.github.com/kaggle/kaggle-api', e)                            │
│   1251                                                                                           │
│                                                                                                  │
│ ╭────────────────────────────────────────── locals ───────────────────────────────────────────╮  │
│ │        dataset = 'rosenthal/citi-bike-stations'                                             │  │
│ │   dataset_slug = 'citi-bike-stations'                                                       │  │
│ │   dataset_urls = ['rosenthal', 'citi-bike-stations']                                        │  │
│ │     downloaded = True                                                                       │  │
│ │ effective_path = PosixPath('/home/stoney/Desktop/medium-data-bakeoff/data/csv')             │  │
│ │          force = False                                                                      │  │
│ │        outfile = '/home/stoney/Desktop/medium-data-bakeoff/data/csv/citi-bike-stations.zip' │  │
│ │     owner_slug = 'rosenthal'                                                                │  │
│ │           path = PosixPath('/home/stoney/Desktop/medium-data-bakeoff/data/csv')             │  │
│ │          quiet = True                                                                       │  │
│ │       response = <urllib3.response.HTTPResponse object at 0x7fb5beb1f3a0>                   │  │
│ │           self = <kaggle.api.kaggle_api_extended.KaggleApi object at 0x7fb5bed08850>        │  │
│ │          unzip = True                                                                       │  │
│ │              z = <zipfile.ZipFile [closed]>                                                 │  │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────╯  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: ('Bad zip file, please report on www.github.com/kaggle/kaggle-api', BadZipFile("Bad CRC-32 for file 
'citi_bike_data_00001.csv'"))
@EthanRosenthal
Copy link
Owner

Weird, I have not run into this before. I honestly have no idea what's going on. I believe that contruct-dataset first downloads a single large zip file from Kaggle into ./data/ and then unzips that file. As you mentioned, maybe manually downloading and unzipping yourself will work? Let me know what you find out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants