Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Seek before start of file using custom open_with function or S3 file object #875

Open
soerenbrandt opened this issue Jul 31, 2023 · 3 comments

Comments

@soerenbrandt
Copy link

I am running into a very weird issue when using a modified version of S3FileSystem().open with fastparquet.Parquet file:

import fastparquet
from s3fs import S3FileSystem

s3 = S3FileSystem()

path: str = <path to a Parquet file on AWS S3>

def test(*args, **kwargs):
    return s3.open(*args, **kwargs)

fastparquet.ParquetFile(path, open_with=test)

The issue resolves when I use open_with=s3.open instead.

The trace is below:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 11
      8     yield s3.open(*args, **kwargs)
     10 with s3.open(path) as f:
---> 11     fastparquet.ParquetFile(f)
     12 # fastparquet.ParquetFile(str(path), open_with=test)

File [~/.cache/pants/named_caches/pex_root/venvs/78afaaf29abf624f78018e2ea235624e188d1920/a54cf194a38a982fc7904d648eb451a554889da0/lib/python3.10/site-packages/fastparquet/api.py:134](https://file+.vscode-resource.vscode-cdn.net/Users/soerenbrandt/src/anagenex/agx/anagenex/playground/soeren/~/.cache/pants/named_caches/pex_root/venvs/78afaaf29abf624f78018e2ea235624e188d1920/a54cf194a38a982fc7904d648eb451a554889da0/lib/python3.10/site-packages/fastparquet/api.py:134), in ParquetFile.__init__(self, fn, verify, open_with, root, sep, fs, pandas_nulls, dtypes)
    131 elif hasattr(fn, 'read'):
    132     # file-like
    133     self.fn = None
--> 134     self._parse_header(fn, verify)
    135     if self.file_scheme not in ['simple', 'empty']:
    136         raise ValueError('Cannot use file-like input '
    137                          'with multi-file data')

File [~/.cache/pants/named_caches/pex_root/venvs/78afaaf29abf624f78018e2ea235624e188d1920/a54cf194a38a982fc7904d648eb451a554889da0/lib/python3.10/site-packages/fastparquet/api.py:209](https://file+.vscode-resource.vscode-cdn.net/Users/soerenbrandt/src/anagenex/agx/anagenex/playground/soeren/~/.cache/pants/named_caches/pex_root/venvs/78afaaf29abf624f78018e2ea235624e188d1920/a54cf194a38a982fc7904d648eb451a554889da0/lib/python3.10/site-packages/fastparquet/api.py:209), in ParquetFile._parse_header(self, f, verify)
    207 if verify:
    208     assert f.read(4) == b'PAR1'
--> 209 f.seek(-8, 2)
    210 head_size = struct.unpack('<I', f.read(4))[0]
    211 if verify:

File [~/.cache/pants/named_caches/pex_root/venvs/78afaaf29abf624f78018e2ea235624e188d1920/a54cf194a38a982fc7904d648eb451a554889da0/lib/python3.10/site-packages/fsspec/spec.py:1575](https://file+.vscode-resource.vscode-cdn.net/Users/soerenbrandt/src/anagenex/agx/anagenex/playground/soeren/~/.cache/pants/named_caches/pex_root/venvs/78afaaf29abf624f78018e2ea235624e188d1920/a54cf194a38a982fc7904d648eb451a554889da0/lib/python3.10/site-packages/fsspec/spec.py:1575), in AbstractBufferedFile.seek(self, loc, whence)
   1573     raise ValueError("invalid whence (%s, should be 0, 1 or 2)" % whence)
   1574 if nloc < 0:
-> 1575     raise ValueError("Seek before start of file")
   1576 self.loc = nloc
   1577 return self.loc

ValueError: Seek before start of file
@martindurant
Copy link
Member

It certainly looks like it ought to work, but can I ask why you are calling it in that manner? For your information, fastparquet does try to guess whether the function passed is actually an fsspec filesystem implementation method, so that must be the difference - you can equally pass fs= in that case.

@soerenbrandt
Copy link
Author

I am working with both pathlib and S3 paths in our pipeline and I want to avoid checking before each call which of the two I have. So I implemented an S3Path analog to pathlib.Path which has a function similar to the one below:

def open(self, *args, **kwargs):
    return S3FileSystem().open(self, *args, **kwargs)

@martindurant
Copy link
Member

I would say this is a genuine bug, and the code deciding what to do with file-like objects versus fsspec-specific files should be revisited. Are you able to do any debugging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants