Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel processing #111

Closed
im-n1 opened this issue Dec 28, 2021 · 9 comments · Fixed by #114
Closed

Parallel processing #111

im-n1 opened this issue Dec 28, 2021 · 9 comments · Fixed by #114

Comments

@im-n1
Copy link

im-n1 commented Dec 28, 2021

Hi,

first of all thanks for this crate. I utilized that in my project transpors. One thing that I noticed is loading/parsing GTFS takes a lot of time (~20MB file on my i7 8th gen takes about 30 secs). Thats understandable but I noticed that all the works runs in one thread. I don't know what the core of this crate is but I assume it's a lot of for loops. These can be optimized literally in one lines thanks to rayon.

I did tne same thing in my application and now the walking thru data takes only ~20% of the time it used to take.

@Tristramg
Copy link
Collaborator

Thank you for that request and for transpors!
Indeed, the bottle neck is parsing the stop_times.txt file that should be straightfoward to parallelize. I’ll have a look into it.

In the meanwhile, did you compile your project in --release this makes quite a lot of difference

@im-n1
Copy link
Author

im-n1 commented Dec 28, 2021

Sure I compile and use my app with release build. If you could look into that critical code that would be nice 👍

@Tristramg
Copy link
Collaborator

It turns out it’s a bit more trickier than expected. Maybe there could be some gain in optimizing how to parse the datetimes from stops.txt.

I used a 200Mb file (https://transitfeeds.com/p/ov/814) on a i7, 6th generation, and the parsing takes about 50 seconds.
What is your gtfs source file, so I can benchmark what takes so long in yours?

@im-n1
Copy link
Author

im-n1 commented Dec 29, 2021

What exactly is tricky? I was hoping the hidden gem is in parallelizing the loops. With nowadays machines it must be a piece of cake to parse ~20MB file (my case). Imagine doing that 15 years back.

I use this file.

@Tristramg
Copy link
Collaborator

The problem is that rayon needs to access the data structure globaly to optimize the workload.

The CSV library reads each record on its own, so rayon can’t work directly on it. Storing the data in a vec helps a bit (~25% speed improvement) https://github.com/rust-transit/gtfs-structure/pull/112/files#diff-5eefb32e95318a7427d74dcc3d96ad47392dcce676e949788c9bb53367a2f1d6R303

I agree that it is frustrating such low parsing speeds. The big problem is stop_times.txt that is about 100Mb unziped in your source file. It consists of 1,7 M lines, so 3,4 M hours to parse, and as many hash_map lookups to store the times for the right journey.

I have some leads to improve a bit more.

I am still a bit surprised that you need more than 10 seconds to parse your file in release mode

@im-n1
Copy link
Author

im-n1 commented Jan 2, 2022

The edit looks good. It you noticed an improvement it's a good news and I wouldn't wait and release a new version ;)

@m0n0chr0m3
Copy link

The improvements are nice, but they aren't free. The ideas currently in the PR have the following side effects:

  • They change gtfs_structures's public API by replacing std::Strings with smol_str:SmolStrs in some of the core GTFS types.
  • They force users of gtfs_structures to depend on rayon, and thus to build rayon, which increases build time and build size.
  • They make gtfs_structures spawn a worker thread pool on every run. Even when reading, e.g., a tiny GTFS archive that could have been parsed in less time than it takes to spawn the worker threads. The impact on working memory consumption hasn't been established yet, but might be significantly negative too.
  • They force users of gtfs_structures to depend on nom (with similar, though smaller, implications).

Another idea discussed in the comments of PR #112 would also yield nice performance improvements, but there the choices are either:

  • wait until the csv crate accepts their PR #247, or
  • have the gtfs-structure crate depend on some random branch of some random fork of the csv crate.

All in all, I think it makes sense to study the effects a bit more before committing to these ideas. In the mean time, you could speed up your own development by temporarily depending on PR #112's branch: in your Cargo.toml change the dependency for gtfs-structures to gtfs-structures = { git="https://github.com/rust-transit/gtfs-structure.git", branch="parallel_parse" }.

@im-n1
Copy link
Author

im-n1 commented Jan 2, 2022

What comes to my mind is to create parallel processing as a "feature" which will solve almost everything from your previous comment. Anyway the branch is a cool compromise. Well done.

@Tristramg
Copy link
Collaborator

Hello, your issue lead us to quite a journey.
In the end, we managed to improve the performances, nothing to spectacular, but almost with the same gain as using rayon.

Your data seems to be well structured. So you can do the following to get quite a performance boost.
And of course, always build with --release

-    let gtfs = Gtfs::new(&file_path);
+    let gtfs = gtfs_structures::GtfsReader::default()
+        .trim_fields(false)
+        .read(&file_path);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants