Skip to content

davisssamuel/1brc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1 Billion Row Challenge

The challenge is simple: given a text file containing 1 billion rows, where each row is a temperature reading from a given weather station, calculate the min, max, and mean temperatures for each weather station as fast as possible. I chose to tackle the challenge using Go because I wanted to make use of Goroutines.

Sequential Solution

Reading in a file, splitting lines, and calculating data is not incredibly complicated. The hard part is making it fast. I knew reading the 14 GB measurements file sequentially, line-by-line, would be slow. Nevertheless, I created a sequential solution so I could compare benchmarks between it and the parallel solution.

Parallel Solution

Because I can only calculate data as fast as I can get it from the operating system, my idea was to split the file into chunks, sending each chunk to 1 of n routines to be parsed.

As of July 2024, my solution is to create chunks and send them through a channel to 1 of n routines that parses each line in the chunk and sends the weather station name and reading through another channel to a single routine that updates a dictionary containing the min, max, and mean readings for each station.

I create each chunk by reading n bytes into the file using the ReadAt method. This allows me to skip a specific number of bytes into the file instead of having to read line-by-line. If the character ReadAt lands on is not a newline, I keep reading byte-by-byte, incrementing the offsetEnd, till I hit a newline character. Then, based off the offsetStart and offsetEnd, I determine the size of the chunk and send the offset and size through the chunks channel to one of the routines that parses the chunk. After the chunk is sent, I set the offsetStart to the offsetEnd and read n bytes again.

offsetStart := int64(0)
offsetEnd := bufferSize
b := make([]byte, 1)

for err != io.EOF {
	_, err = file.ReadAt(b, int64(offsetEnd))
	check(err)
	for string(b) != "\n" {
		offsetEnd++
		_, err = file.ReadAt(b, int64(offsetEnd))
		check(err)
	}
	chunks <- &Buffer{offset: offsetStart, size: (offsetEnd - int(offsetStart))}
	offsetStart = int64(offsetEnd + 1)
	offsetEnd += bufferSize
}

Unfortunately, this solution is slower than sequential parsing, which is something I'm working to resolve whenever I have time. Right now, sequential parsing takes approx. 3-4 min, whereas parallel parsing takes approx. 9-10 min.

About

1 billion row challenge completed in Go

Resources

Stars

Watchers

Forks

Languages