This is me attempting One Billion Row Challenge by Gunnar Morling in Python. The original challenge only accepts solutions written in Java 21. However, Show & Tell section features all kinds of implementations.
-
Generate
measurements.txt
file usingcreate_measurements.py
python create_measurements.py -r 1_000_000_000
This takes a few minutes to generate a file with billion lines. File size is around 13 GB, make sure to have enough space on your machine. It took 816.72 seconds or around 13 minutes on my 16 GB M2 Air.
-
Run the
calculate_avg.py
and record time.
Results from running the program on M2 MacBook Air(16GB RAM with 8 core CPU)
Attempts | Time(sec) |
---|---|
4 | 64 |
3 | 162 |
2 | 319 |
1 | 870 |
-
No tricks, just basic python. It took 869.78 seconds and it's painful to look at.
-
Read the file line by line and start aggregating the values in a python
dict
with open('measurements.txt') as f_reader: for line in f_reader: # aggregate values here
-
This avoids loading the entire file into the memory all at once.
- Processing the file as chunks, each chunk with 100 million lines. Took 319 seconds.
- Used multi-processing to process the chunks concurrently. Aggregation remains same - using python
dict
. - Instead of one process reading line by line, multiple processes read the same file line by line. Chunk results are again aggregated using python
dict
- Better way of splitting the file based on size in bytes not based on number of lines. Removed extra steps to advance the file-stream position in file. Time down to 162 seconds.
- Split the file in 8 (or max cpu cores) parts of size
total_file_size / max_cpu_cores
bytes. Then dvance the file-stream position to next\n
character. - This is to make sure that
readline()
always yields a complete line without breaks in between.
- Memory mapped chunks, then read line by line. Replace the
min()
andmax()
functions with basicif a > b
checks. Finished in 76 seconds. - Values in the
dict
were also adict
, replaced them with alist
and shaved off another ~10 seconds. Time down to 64 seconds. - Used 16 worker processes instead of default value-
os.cpu_count()
(or 8 in my case).