Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3.read(-1) for a large file (2^31+α bytes) fails due to an SSL OverflowError #271

Open
belltailjp opened this issue Mar 5, 2022 · 4 comments

Comments

@belltailjp
Copy link
Member

belltailjp commented Mar 5, 2022

I found out that when reading the entire content of a file of 2+αGiB from S3 fails by OverflowError: signed integer is greater than maximum exception raised from Python SSL library.

Here is the minimum reproduction.

import pfio
import os

path = 's3://<bucket>/foo.dat'
# size = 2**31 + 7 * 1024   # No error
size = 2**31 + 8 * 1024   # Get error

# Create test data of _size_ bytes
bs = 128 * 1024 * 1024
with pfio.v2.open_url(path, 'wb') as f:
    while 0 < size:
        s = min(bs, size)
        print('ss={}, s={}'.format(size, s))
        f.write(bytearray(os.urandom(s)))
        size -= s

# Read the entire content
with pfio.v2.open_url(path, 'rb') as f:
    assert len(f.read(-1))
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/s3.py", line 149, in readall
    return self.read(-1)
  File "/usr/local/lib/python3.8/site-packages/pfio/v2/s3.py", line 82, in read
    data = body.read()
  File "/usr/local/lib/python3.8/site-packages/botocore/response.py", line 95, in read
    chunk = self._raw_stream.read(amt)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 515, in read
    data = self._fp.read() if not fp_closed else b""
  File "/usr/local/lib/python3.8/http/client.py", line 468, in read
    s = self._safe_read(self.length)
  File "/usr/local/lib/python3.8/http/client.py", line 609, in _safe_read
    data = self.fp.read(amt)
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum

Reading the error message, I thought reading a file of 2^31 bytes is fine and 2^31+1 bytes is NG, but it seems to be slightly different; the threshold is somewhere between 2147490816 (2^31+7k) ~ 2147491840 (2^31+8k).

I think the S3 API itself should support reading such a large file, but the issue is in Python SSL library layer (if so, maybe it'd be better trying Python 3.9 and 3.10).

Here is my environment:

% python --version
Python 3.8.10
% python -c "import pfio; print(pfio.__version__)"
2.2.0
@belltailjp
Copy link
Member Author

It seems that this Python bug is deeply related, which is apparently dealt in Python 3.10~.
https://bugs.python.org/issue42853
https://stackoverflow.com/questions/70905872

For pfio, since we cannot drop support for Python 3.8~ right now, I guess we need some workarounds to prevent attempting to read the whole content at once even when _ObjectReader.read(-1) or _ObjectReader.readall() is called.

The naive approach would be to modify _ObjectReader.read to split get_object API call when necessary, though it sounds like re-implementing kind of a buffering which is duplication with BufferedReader (#247).

I wonder if there is a way to somehow force BufferedReader to do buffering when read(-1) is called, although currently it directly calls _ObjectReader.readall.
c.f., https://github.com/python/cpython/blob/v3.11.0a5/Lib/_pyio.py#L1096
In that case, we also need to consider "rt" mode which uses TextIOWrapper instead of BufferedReader.
In addition, I guess it would also be preferred to prevent this issue without buffering wrapper (buffering=0).
# Note: The reported issue reproduces regardless of with/without buffering option and "rb"/"rt" mode.

@kuenishi
Copy link
Member

Strictly speaking, 42853 was fixed in Python 3.9.7 (release note). I knew this issue in January, but I didn't report here, sorry! I thought at that time reading a fairly large file (>2GB) at once was a bit rare use case so that it doesn't pay implementing a workaround for the issue. Regarding the fact you reported here, did you find this issue in your application?

Python 3.8 EoL is scheduled in 2024-10. It'll be more than two years from today, and 3.8 maintenance state is security update only. 42853 isn't a vulnerability, so it' won't be fixed in 3.8 branch. Hmmm....

@kuenishi
Copy link
Member

We just had observed an internal use case of loading large pickle file failure like this:

import pickle
from pfio.v2 import open_url
with open_url("s3://very/large/file.pickle", "rb") as fp:
  pickle.load(fp)  # Gets the exception

@kuenishi
Copy link
Member

Update: even after 3.9.7, this issue reproduced against loading large ndarray pickled files, possibly because the binary protocol forces to load a large array more than 2G to read at once from SSL. This is fixed in Python 3.10, as it uses SSL_read_ex().

Python 3.10 will use SSL_write_ex() and SSL_read_ex(), which support > 2 GB data.

So the complete resolution for this issue is to use Python 3.10. ¯\_(ツ)_/¯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants