1 Billion Row Challenge

The challenge is simple: given a text file containing 1 billion rows, where each row is a temperature reading from a given weather station, calculate the min, max, and mean temperatures for each weather station as fast as possible. I chose to tackle the challenge using Go because I wanted to make use of Goroutines.

Sequential Solution

Reading in a file, splitting lines, and calculating data is not incredibly complicated. The hard part is making it fast. I knew reading the 14 GB measurements file sequentially, line-by-line, would be slow. Nevertheless, I created a sequential solution so I could compare benchmarks between it and the parallel solution.

Parallel Solution

Because I can only calculate data as fast as I can get it from the operating system, my idea was to split the file into chunks, sending each chunk to 1 of n routines to be parsed.

As of July 2024, my solution is to create chunks and send them through a channel to 1 of n routines that parses each line in the chunk and sends the weather station name and reading through another channel to a single routine that updates a dictionary containing the min, max, and mean readings for each station.

I create each chunk by reading n bytes into the file using the ReadAt method. This allows me to skip a specific number of bytes into the file instead of having to read line-by-line. If the character ReadAt lands on is not a newline, I keep reading byte-by-byte, incrementing the offsetEnd, till I hit a newline character. Then, based off the offsetStart and offsetEnd, I determine the size of the chunk and send the offset and size through the chunks channel to one of the routines that parses the chunk. After the chunk is sent, I set the offsetStart to the offsetEnd and read n bytes again.

offsetStart := int64(0)
offsetEnd := bufferSize
b := make([]byte, 1)

for err != io.EOF {
	_, err = file.ReadAt(b, int64(offsetEnd))
	check(err)
	for string(b) != "\n" {
		offsetEnd++
		_, err = file.ReadAt(b, int64(offsetEnd))
		check(err)
	}
	chunks <- &Buffer{offset: offsetStart, size: (offsetEnd - int(offsetStart))}
	offsetStart = int64(offsetEnd + 1)
	offsetEnd += bufferSize
}

Unfortunately, this solution is slower than sequential parsing, which is something I'm working to resolve whenever I have time. Right now, sequential parsing takes approx. 3-4 min, whereas parallel parsing takes approx. 9-10 min.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
parallel		parallel
sequential		sequential
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1 Billion Row Challenge

Sequential Solution

Parallel Solution

About

Languages

davisssamuel/1brc

Folders and files

Latest commit

History

Repository files navigation

1 Billion Row Challenge

Sequential Solution

Parallel Solution

About

Resources

Stars

Watchers

Forks

Languages