-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Create index on reads, not on seeks #74
Comments
@pauldmccarthy also bitten by this… is it enough if I manually call EDIT: OK I tried the above and it didn't work. In the sense that I guess I don't understand how this works, will have to study the code in detail. |
Hi @piskvorky, I don't fully understand the problem you are experiencing - can you paste some example code to use as a basis for discussion? |
Sure, thanks for looking into this. I created this minimal example: import io
import tarfile
import random
from string import ascii_letters
from indexed_gzip import IndexedGzipFile
def generate_tgz(path, num_files=10000, file_size=5*1024):
"""Create a .tgz file containing lots of random files inside."""
random.seed(42)
with tarfile.open(name=path, mode='w:gz') as tf:
for fileno in range(num_files):
content = (''.join(random.choice(ascii_letters) for i in range(file_size))).encode('utf8')
ti = tarfile.TarInfo(name=f'{fileno}.txt')
ti.size = len(content)
tf.addfile(ti, io.BytesIO(content))
if __name__ == '__main__':
fname = '/tmp/test_igzip.tgz'
generate_tgz(fname)
# Simulate data coming from a stream. In reality, this is from network and not a local disk file.
raw_obj = open(fname, mode='rb')
gzip_wrapped = IndexedGzipFile(fileobj=raw_obj, mode='rb', auto_build=True, spacing=256 * 1024, buffer_size=4 * 1024**2)
tf = tarfile.open(fileobj=gzip_wrapped, mode='r')
# Simulate advancing through the archive, extracting members.
# My expectation was that seek points are being created periodically, so we can seek back later.
last_seek_points = -1
while True:
fileinfo = tf.next()
if not fileinfo:
break
content = tf.extractfile(fileinfo)
gzip_wrapped.seek(gzip_wrapped.tell()) # a no-op seek, attempting to force creating a seek point
seek_points = list(gzip_wrapped.raw.seek_points())
if last_seek_points != len(seek_points):
print(f"at file {fileinfo.name}, {len(seek_points)} seek points, last one {seek_points[-1] if seek_points else '-'}")
last_seek_points = len(seek_points) The output is:
… while I expected the first seek point to appear around Basically my mental model for when seek points are created is broken. |
My running hypothesis is that it's somehow related to input buffering – either inside tarfile, or igzip's buffer (set to 4 MB above). |
Hi @piskvorky, I'm sorry, I still don't understand what the problem is here - it looks to me like the seek points are being created at approximately 256kb intervals?| e.g.:
|
The problem is seek points are not being created as the file is being read (which is why I'm hijacking this particular issue, I thought it's related). Instead, there are no seek points at all, until there are 20 seek points all at once, after ~5 MB have been read. In other words, at no point is there a single seek point, two seek points, three seek points etc. The number of seek points jumps from 0 - 20 - 33 - 49 - 64… This despite me explicitly calling |
Ok, sorry for my slowness in understanding. Unfortunately, I came to the conclusion that resolving this particular issue would require a major refactor (explanation here). I originally wrote this library to work with regular on-disk files, and never envisaged it being applied to data streams. It wasn't too difficult to bolt-on support for in-memory file-likes (thanks to the work by @epicfaace in #55), but I feel that adding true support for streamed data would be a substantial undertaking. To be honest, I am leaning towards the idea of a from-scratch re-write in pure Python (using ctypes for the zlib interactions that aren't covered by the standard zlib module - I really should have done this from the outset, rather than use C). The logic in Unfortunately I can't envisage having the time to sit down and do this work any time soon, as I am continually tied up with unrelated work (and most of my free time is spent away from the screen!). However, if I don't get around to it, I am more than happy to support/review work done by somebody else. |
I understand, thank you for your support. Being maximally away from the screen and juggling time between open source & family is also something I aspire to! |
Currently the index is built:
zran_seek(n)
- the index is bulit to cover up to positionn
in the uncompressed datazran_read
- when reading from position 0, one seek point is created at the beginning of the stream, but then no more points are created. If another call tozran_read
is then made, the index is built to cover up to the position that the previouszran_read
call finished - data is re-read and re-decompressed from the beginning of the file/stream. Then the read is initialised from that point.This is complicated and inefficient -
zran_seek
should just update an internal offset into the uncompressed data, without doing anything else. Thenzran_read
should both read data and build the index in a single pass.To support pre-generating an index, without reading any data,
zran_read
could allowbuf
to beNULL
, which would cause it to read/build the index, but not copy any data.The text was updated successfully, but these errors were encountered: