-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full Scale Performance: Single Process, Sharrow On #6
Comments
On a windows machine with 512 GB RAM, 2.44 GHz processor. Using the blosc2:zstd compressed skims. The compute_accessibility (sharp yellow on the far left part of the charts) consistently takes 314 GB of memory, because it runs for the full size zones and it does not change by household sample size. The memory peak discussed below excludes compute_accessibility.
With 100% households, we are estimating the run time to be about 16 hours, and memory peak to be about 500 GB. WSP is current running this on the same 512 RAM machine. We should probably run it on a larger machine to fairly trace memory. Note: Sharrow is set to skip Trip Destination in ABM3. @jpn-- is working on turning it on. Hopefully we will see trip destination memory to come down with Sharrow turned on, as we saw in the 1-zone model. |
My sharrow on run is still going, but I am seeing some very large increases in runtime compared to the sharrow off runs on this same machine: I will let it finish, but wanted to bring this up asap... Any ideas on what is going on here @jpn-- ? I do notice that the CPU usage on the server will spike to 100% even though I am only running single process. I assume this is because multi-threading is turned on and this is expected behavior? |
Update on run with full sample, sharrow on, and single process (1 TB memory, Intel Xeon Gold 6342 @ 2.8GHz machine):
|
Update: the longer runtime is no longer happening under sharrow with multi process. Total runtime is approximately 150mins (full sample). Note that I added these constraints to simulation.py, would be good to know if they are required in sharrow mp: os.environ["OMP_NUM_THREADS"] = "1" |
@aletzdy - what was peak memory usage for run under sharrow with multi process? Can you post performance profile? |
On the memory trace, I see lots of flat sections all with about same slope. Can you confirm that there are memory pings filling out the middle of these slopes at the constant ping rate, and it's not just the computer completely locking up periodically? Reviewing the log file, it appears the problem is entirely in the interaction components. The flat and consistent slope of the memory profile every time it happens suggests we are somehow leaning against some memory speed constraint, as otherwise I'd expect the slopes to differ more drastically for different components. @aletzdy, can you try:
Thanks
|
@aletzdy can you also share up here the raw data file for the memory log (which backs the figure you posted)? I want to more closely cross-reference the log output and the tall but very short-lived memory spikes to see if I can narrow down the causes. |
I edited my original post to add the memory_profile.csv. We do get the memory pings and it does not appear the system is locking up. I will try your suggestions. Thanks. |
My multiprocess run is not creating the memory_profile.csv so it is hard to say what the peak memory usage is. I am not sure if memory profile simply does not work with MP? @jpn-- or @i-am-sijia might know more. |
It doesn't. We've talked about this in our recent meetings. Measuring memory usage in MP is complex. |
I'm running on SFCTA's 1 TB RAM, 2.29 GHz server. The server has 80 cores and 160 logical processors. Single process:
Update (6/5)The 100% HH single process, sharrow production run finished on SFCTA's server. It took 56 hours to complete. With memory peak at 389 GB. The "flat valleys" in the chart are taking much of the run time, they are location choice components. 30/05/2024 10:00:23 - INFO - sharrow - using existing flow code ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF
30/05/2024 10:00:23 - INFO - activitysim.core.flow - completed setting up flow workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils in 0:00:00.139339
30/05/2024 10:00:23 - INFO - activitysim.core.flow - begin flow_ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF.load workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils
30/05/2024 10:20:58 - INFO - activitysim.core.flow - completed flow_ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF.load in 0:20:35.055109 workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils
30/05/2024 10:20:58 - INFO - activitysim.core.flow - completed apply_flow in 0:20:35.196635
...
30/05/2024 11:41:42 - INFO - sharrow - flow exists in library: QW257FB6ESHEH4IIQUSVA7677OVHLFGU
30/05/2024 11:41:42 - INFO - activitysim.core.flow - completed setting up flow school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils in 0:00:00.265573
30/05/2024 11:41:42 - INFO - activitysim.core.flow - begin flow_QW257FB6ESHEH4IIQUSVA7677OVHLFGU.load school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils
30/05/2024 12:52:07 - INFO - activitysim.core.flow - completed flow_QW257FB6ESHEH4IIQUSVA7677OVHLFGU.load in 1:10:24.558034 school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils
30/05/2024 12:52:07 - INFO - activitysim.core.flow - completed apply_flow in 1:10:24.823607
...
31/05/2024 04:43:17 - INFO - sharrow - flow exists in library: CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR
31/05/2024 04:43:17 - INFO - activitysim.core.flow - completed setting up flow non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils in 0:00:00
31/05/2024 04:43:17 - INFO - activitysim.core.flow - begin flow_CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR.load non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils
31/05/2024 07:12:50 - INFO - activitysim.core.flow - completed flow_CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR.load in 2:29:33.130106 non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils
31/05/2024 07:12:50 - INFO - activitysim.core.flow - completed apply_flow in 2:29:33.130106 |
Update on running the model with full sample, single-process, and sharrow on, AND the following env variables set to 1: os.environ["OMP_NUM_THREADS"] = "1" The runtime, at 2289.1 mins, appears to be somewhat longer, although I cannot be sure why. So in single process mode, setting these variables to 1 either does not make a difference, or makes things slightly worse. I, however, observed that CPU usage was at 100% with all cores involved for some model steps. I could confirm this high CPU usage for school and workplace location models, but did not keep a close eye during other model steps. It is likely we'll fix this issue if we can figure out why all cores get involved. |
I did a 100% household sample, single-process, sharrow on run, on WSP's machine, using the latest Sharrow (including Jeff's code changes to work with np.where). The run time is only 6.9 hours (compared to 36 and 56 hours reported before), with memory peak at 411 GB. Looking at the memory chart, I don't see the long "flat valleys" I reported previously. It seems sharrow is now efficiently working with np.where. ActivitySim: pr/867@b465dd0e4 I can do the same run with the Sharrow version before the np.where changes, to have an apples-to-apples comparison on WSP's machine. |
Below shows the comparison of two Sharrow runs on the same WSP machine. Both runs used: ActivitySim: pr/867@b465dd0e4 The only difference is the Sharrow version, the one on the top used v2.9.1, the one on the bottom used main@8d63a66. With v2.9.1, the run time was 489.3 mins, 8.2 hours. With main@8d63a66, the run time was 413.9 mins, 6.9 hours, a 1.3-hour saving. WSP_server_single_process_sharrow_v291.zip WSP_server_single_process_sharrow_8d63a66.zip The 8.2 hour run time on WSP's machine is much shorter than the 36 hours (RSG) and 56 hours (SFCTA), all using Sharrow v2.9.1. Next step: can crosscheck the versions of dependencies (like xarray, numba) of these two runs vs the other runs. Can those be different and causing a difference in run time? |
To examine the hypothesis that there is something other than raw CPU and total memory that is blocking performance, I ran a new test on the SFCTA machine: Running identical versions of the model, once with unlimited multi-threading, and once with multithreading limited to 36 threads (slightly less than 25% of the number of CPUs on the machine). Unfortunately I forgot to compile shadow first, so the unlimited run has compile time in it as well. Those times are now reported in the log, and I have netted them out of the results shown below. (I am also re-running the relevant test line to double-check the results.) It appears that, on this machine, limiting sharrow/numba to use only 25% of the CPUs makes the overall model run twice as fast. If you look at the performance monitor during the model run (which I did) it does look like under the unlimited-thread case we are "using" all available cores to the max, and under the limited-thread we are only using about 1/4 of the machine's capacity. However, the total runtimes tell a different story: with limited threads, many of the sharrow-heavy model components run about twice as fast. I do not know for sure what part of the hardware is the bottleneck. I had previously hypothesized that it might be the memory bandwidth, but now I suspect that's not it (or not the whole story) as if it were that only I'd expect the limited-thread run to be just as fast as the unlimited, but not a lot faster. I think it might instead be the on-chip memory cache (L1/L2/L3 cache). This cache is like super-fast RAM located on-chip and used for intermediate calculations, and on the SFCTA server there's 226 MB of it available. If we run 36 threads in parallel each thread might get 6MB of cache to play with (enough to hold a couple rows of skim data at once) while with 160 threads each gets only maybe 1.4 MB of space, which is maybe not enough and the code then has to go get data from relatively slow system RAM much more frequently. (This could also explain why the code runs very efficiently on my M2 Max Mac laptop, which offers 7MB of L2+L3 cache per CPU core.) Some more experimentation will be required to confirm these findings and develop hardware and/or model-size/complexity related guidance on what might work best for different models.
|
I have extended the experiment above to further reduce the number of threads that Numba can use. I also re-ran the "unlimited" test with the precompiled sharrow flows, for a better apples-to-apples comparison.
|
On the WSP's machine, the effect seems to be the opposite. Fewer numba threads led to longer run time. This is the machine that had always have a significantly lower run time among all test machines, before we started exploring numba threading.
The L1+2+3 cache on the WSP machine is not larger than other machines. But it runs faster than other machines when having similar cache per numba thread. Note also how 32 and 64 threads have almost the same run time. I don't think we have ruled out the memory bandwidth. Could it be interacting with the numba thread? Like on a machine with fast memory bandwidth, the more thread the better, on a machine with slow memory bandwidth, the fewer thread the better? |
This is the issue to report on memory usage and runtime performance...
data_dir: "data-full"
full scale skims (24333 MAZs)households_sample_size: 0
(full scale 100% sample of households)sharrow: "require"
multiprocess: false
single processThe text was updated successfully, but these errors were encountered: