Skip to content

Commit

Permalink
2.5 million batch size
Browse files Browse the repository at this point in the history
  • Loading branch information
orf committed Oct 20, 2024
1 parent 540e97b commit 1169378
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/pypi_data/combine_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ async def fill_buffer(
log.info(f"Downloaded, reading {path}")
table = pq.read_table(path, memory_map=True).combine_chunks()

for idx, batch in enumerate(table.to_batches(max_chunksize=2_000_000)):
for idx, batch in enumerate(table.to_batches(max_chunksize=2_500_000)):
batch: RecordBatch
digest = hashlib.sha256()
for item in batch.column("path").cast(pyarrow.large_binary()).to_pylist():
Expand Down

0 comments on commit 1169378

Please sign in to comment.