Parallel processing #111

im-n1 · 2021-12-28T11:48:55Z

Hi,

first of all thanks for this crate. I utilized that in my project transpors. One thing that I noticed is loading/parsing GTFS takes a lot of time (~20MB file on my i7 8th gen takes about 30 secs). Thats understandable but I noticed that all the works runs in one thread. I don't know what the core of this crate is but I assume it's a lot of for loops. These can be optimized literally in one lines thanks to rayon.

I did tne same thing in my application and now the walking thru data takes only ~20% of the time it used to take.

The text was updated successfully, but these errors were encountered:

Tristramg · 2021-12-28T15:03:41Z

Thank you for that request and for transpors!
Indeed, the bottle neck is parsing the stop_times.txt file that should be straightfoward to parallelize. I’ll have a look into it.

In the meanwhile, did you compile your project in --release this makes quite a lot of difference

im-n1 · 2021-12-28T18:49:01Z

Sure I compile and use my app with release build. If you could look into that critical code that would be nice 👍

Tristramg · 2021-12-29T11:43:56Z

It turns out it’s a bit more trickier than expected. Maybe there could be some gain in optimizing how to parse the datetimes from stops.txt.

I used a 200Mb file (https://transitfeeds.com/p/ov/814) on a i7, 6th generation, and the parsing takes about 50 seconds.
What is your gtfs source file, so I can benchmark what takes so long in yours?

im-n1 · 2021-12-29T12:09:20Z

What exactly is tricky? I was hoping the hidden gem is in parallelizing the loops. With nowadays machines it must be a piece of cake to parse ~20MB file (my case). Imagine doing that 15 years back.

I use this file.

Tristramg · 2021-12-30T15:58:26Z

The problem is that rayon needs to access the data structure globaly to optimize the workload.

The CSV library reads each record on its own, so rayon can’t work directly on it. Storing the data in a vec helps a bit (~25% speed improvement) https://github.com/rust-transit/gtfs-structure/pull/112/files#diff-5eefb32e95318a7427d74dcc3d96ad47392dcce676e949788c9bb53367a2f1d6R303

I agree that it is frustrating such low parsing speeds. The big problem is stop_times.txt that is about 100Mb unziped in your source file. It consists of 1,7 M lines, so 3,4 M hours to parse, and as many hash_map lookups to store the times for the right journey.

I have some leads to improve a bit more.

I am still a bit surprised that you need more than 10 seconds to parse your file in release mode

im-n1 · 2022-01-02T11:17:52Z

The edit looks good. It you noticed an improvement it's a good news and I wouldn't wait and release a new version ;)

m0n0chr0m3 · 2022-01-02T13:18:22Z

The improvements are nice, but they aren't free. The ideas currently in the PR have the following side effects:

They change gtfs_structures's public API by replacing std::Strings with smol_str:SmolStrs in some of the core GTFS types.
They force users of gtfs_structures to depend on rayon, and thus to build rayon, which increases build time and build size.
They make gtfs_structures spawn a worker thread pool on every run. Even when reading, e.g., a tiny GTFS archive that could have been parsed in less time than it takes to spawn the worker threads. The impact on working memory consumption hasn't been established yet, but might be significantly negative too.
They force users of gtfs_structures to depend on nom (with similar, though smaller, implications).

Another idea discussed in the comments of PR #112 would also yield nice performance improvements, but there the choices are either:

wait until the csv crate accepts their PR #247, or
have the gtfs-structure crate depend on some random branch of some random fork of the csv crate.

All in all, I think it makes sense to study the effects a bit more before committing to these ideas. In the mean time, you could speed up your own development by temporarily depending on PR #112's branch: in your Cargo.toml change the dependency for gtfs-structures to gtfs-structures = { git="https://github.com/rust-transit/gtfs-structure.git", branch="parallel_parse" }.

im-n1 · 2022-01-02T18:32:13Z

What comes to my mind is to create parallel processing as a "feature" which will solve almost everything from your previous comment. Anyway the branch is a cool compromise. Well done.

Tristramg · 2022-01-11T12:50:32Z

Hello, your issue lead us to quite a journey.
In the end, we managed to improve the performances, nothing to spectacular, but almost with the same gain as using rayon.

Your data seems to be well structured. So you can do the following to get quite a performance boost.
And of course, always build with --release

-    let gtfs = Gtfs::new(&file_path);
+    let gtfs = gtfs_structures::GtfsReader::default()
+        .trim_fields(false)
+        .read(&file_path);

Tristramg mentioned this issue Dec 29, 2021

Performance improvements #112

Closed

Tristramg mentioned this issue Jan 3, 2022

Performance improvement #114

Merged

Tristramg closed this as completed in #114 Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing #111

Parallel processing #111

im-n1 commented Dec 28, 2021

Tristramg commented Dec 28, 2021

im-n1 commented Dec 28, 2021

Tristramg commented Dec 29, 2021

im-n1 commented Dec 29, 2021

Tristramg commented Dec 30, 2021

im-n1 commented Jan 2, 2022

m0n0chr0m3 commented Jan 2, 2022

im-n1 commented Jan 2, 2022

Tristramg commented Jan 11, 2022

Parallel processing #111

Parallel processing #111

Comments

im-n1 commented Dec 28, 2021

Tristramg commented Dec 28, 2021

im-n1 commented Dec 28, 2021

Tristramg commented Dec 29, 2021

im-n1 commented Dec 29, 2021

Tristramg commented Dec 30, 2021

im-n1 commented Jan 2, 2022

m0n0chr0m3 commented Jan 2, 2022

im-n1 commented Jan 2, 2022

Tristramg commented Jan 11, 2022