Full Scale Performance: Single Process, Sharrow On #6

jpn-- · 2024-04-26T16:26:10Z

This is the issue to report on memory usage and runtime performance...

data_dir: "data-full" full scale skims (24333 MAZs)
households_sample_size: 0 (full scale 100% sample of households)
sharrow: "require"
multiprocess: false single process

The text was updated successfully, but these errors were encountered:

i-am-sijia · 2024-04-30T20:10:21Z

On a windows machine with 512 GB RAM, 2.44 GHz processor. Using the blosc2:zstd compressed skims.

The compute_accessibility (sharp yellow on the far left part of the charts) consistently takes 314 GB of memory, because it runs for the full size zones and it does not change by household sample size. The memory peak discussed below excludes compute_accessibility.

Sharrow Compile with 100,000 households. Completed in 3 hours. Memory peak in Trip Destination, 165 GB.
Sharrow Production with 300,000 households (~25%). Completed in 4.5 hours. Memory peak in Trip Destination, 218 GB.
Sharrow Production with 640,000 households (50%). Completed in 8 hours. Memory peak in Trip Destination, 311 GB.

With 100% households, we are estimating the run time to be about 16 hours, and memory peak to be about 500 GB. WSP is current running this on the same 512 RAM machine. We should probably run it on a larger machine to fairly trace memory.

Note: Sharrow is set to skip Trip Destination in ABM3. @jpn-- is working on turning it on. Hopefully we will see trip destination memory to come down with Sharrow turned on, as we saw in the 1-zone model.

dhensle · 2024-05-24T16:09:55Z

My sharrow on run is still going, but I am seeing some very large increases in runtime compared to the sharrow off runs on this same machine:

I will let it finish, but wanted to bring this up asap... Any ideas on what is going on here @jpn-- ?

I do notice that the CPU usage on the server will spike to 100% even though I am only running single process. I assume this is because multi-threading is turned on and this is expected behavior?

aletzdy · 2024-05-28T17:52:43Z

Update on run with full sample, sharrow on, and single process (1 TB memory, Intel Xeon Gold 6342 @ 2.8GHz machine):

Total runtime: 2,177.9 minutes
Peak memory usage: 417.5Gb taking place in mandatory tour scheduling
The longer runtime issue is still present, and I have had similar problem in my last week SEMCOG sharrow run (and MWCOG last year). @jpn-- any suggestions yet?
activitysim.log
timing_log.csv

memory_profile.csv

aletzdy · 2024-05-29T13:18:08Z

Update: the longer runtime is no longer happening under sharrow with multi process. Total runtime is approximately 150mins (full sample).

Note that I added these constraints to simulation.py, would be good to know if they are required in sharrow mp:

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMBA_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

joecastiglione · 2024-05-29T13:25:57Z

@aletzdy - what was peak memory usage for run under sharrow with multi process? Can you post performance profile?

jpn-- · 2024-05-29T15:16:43Z

On the memory trace, I see lots of flat sections all with about same slope. Can you confirm that there are memory pings filling out the middle of these slopes at the constant ping rate, and it's not just the computer completely locking up periodically?

Reviewing the log file, it appears the problem is entirely in the interaction components. The flat and consistent slope of the memory profile every time it happens suggests we are somehow leaning against some memory speed constraint, as otherwise I'd expect the slopes to differ more drastically for different components. @aletzdy, can you try:

running the single processes with these same environment flags?
running single processes changing nothing except running a 10% household sample instead of the full population?

Thanks

Update: the longer runtime is no longer happening under sharrow with multi process. Total runtime is approximately 150mins (full sample).

Note that I added these constraints to simulation.py, would be good to know if they are required in sharrow mp:

os.environ["OMP_NUM_THREADS"] = "1" os.environ["OPENBLAS_NUM_THREADS"] = "1" os.environ["NUMBA_NUM_THREADS"] = "1" os.environ["VECLIB_MAXIMUM_THREADS"] = "1" os.environ["NUMEXPR_NUM_THREADS"] = "1"

jpn-- · 2024-05-29T15:29:27Z

@aletzdy can you also share up here the raw data file for the memory log (which backs the figure you posted)? I want to more closely cross-reference the log output and the tall but very short-lived memory spikes to see if I can narrow down the causes.

aletzdy · 2024-05-29T15:31:03Z

On the memory trace, I see lots of flat sections all with about same slope. Can you confirm that there are memory pings filling out the middle of these slopes at the constant ping rate, and it's not just the computer completely locking up periodically?

Reviewing the log file, it appears the problem is entirely in the interaction components. The flat and consistent slope of the memory profile every time it happens suggests we are somehow leaning against some memory speed constraint, as otherwise I'd expect the slopes to differ more drastically for different components. @aletzdy, can you try:

running the single processes with these same environment flags?

running single processes changing nothing except running a 10% household sample instead of the full population?

Thanks

Update: the longer runtime is no longer happening under sharrow with multi process. Total runtime is approximately 150mins (full sample).
Note that I added these constraints to simulation.py, would be good to know if they are required in sharrow mp:
os.environ["OMP_NUM_THREADS"] = "1" os.environ["OPENBLAS_NUM_THREADS"] = "1" os.environ["NUMBA_NUM_THREADS"] = "1" os.environ["VECLIB_MAXIMUM_THREADS"] = "1" os.environ["NUMEXPR_NUM_THREADS"] = "1"

I edited my original post to add the memory_profile.csv. We do get the memory pings and it does not appear the system is locking up.

I will try your suggestions. Thanks.

aletzdy · 2024-05-29T15:35:30Z

@aletzdy - what was peak memory usage for run under sharrow with multi process? Can you post performance profile?

My multiprocess run is not creating the memory_profile.csv so it is hard to say what the peak memory usage is. I am not sure if memory profile simply does not work with MP? @jpn-- or @i-am-sijia might know more.

jpn-- · 2024-05-29T15:45:52Z

I am not sure if memory profile simply does not work with MP?

It doesn't. We've talked about this in our recent meetings. Measuring memory usage in MP is complex.

i-am-sijia · 2024-05-30T17:00:09Z

I'm running on SFCTA's 1 TB RAM, 2.29 GHz server. The server has 80 cores and 160 logical processors.

Single process:

Sharrow compile took 7.5 hours to complete. I noticed when running begin flow_xxxxxx.load, the CPU spikes to 100% and is very slow.
Sharrow production with 100% hh - currently running

Update (6/5)

The 100% HH single process, sharrow production run finished on SFCTA's server. It took 56 hours to complete. With memory peak at 389 GB.

The "flat valleys" in the chart are taking much of the run time, they are location choice components.

30/05/2024 10:00:23 - INFO - sharrow - using existing flow code ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF
30/05/2024 10:00:23 - INFO - activitysim.core.flow - completed setting up flow workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils in 0:00:00.139339 
30/05/2024 10:00:23 - INFO - activitysim.core.flow - begin flow_ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF.load workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils
30/05/2024 10:20:58 - INFO - activitysim.core.flow - completed flow_ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF.load in 0:20:35.055109 workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils
30/05/2024 10:20:58 - INFO - activitysim.core.flow - completed apply_flow in 0:20:35.196635 
...
30/05/2024 11:41:42 - INFO - sharrow - flow exists in library: QW257FB6ESHEH4IIQUSVA7677OVHLFGU
30/05/2024 11:41:42 - INFO - activitysim.core.flow - completed setting up flow school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils in 0:00:00.265573 
30/05/2024 11:41:42 - INFO - activitysim.core.flow - begin flow_QW257FB6ESHEH4IIQUSVA7677OVHLFGU.load school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils
30/05/2024 12:52:07 - INFO - activitysim.core.flow - completed flow_QW257FB6ESHEH4IIQUSVA7677OVHLFGU.load in 1:10:24.558034 school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils
30/05/2024 12:52:07 - INFO - activitysim.core.flow - completed apply_flow in 1:10:24.823607 
...
31/05/2024 04:43:17 - INFO - sharrow - flow exists in library: CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR
31/05/2024 04:43:17 - INFO - activitysim.core.flow - completed setting up flow non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils in 0:00:00 
31/05/2024 04:43:17 - INFO - activitysim.core.flow - begin flow_CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR.load non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils
31/05/2024 07:12:50 - INFO - activitysim.core.flow - completed flow_CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR.load in 2:29:33.130106 non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils
31/05/2024 07:12:50 - INFO - activitysim.core.flow - completed apply_flow in 2:29:33.130106

aletzdy · 2024-06-12T14:37:00Z

Update on running the model with full sample, single-process, and sharrow on, AND the following env variables set to 1:

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMBA_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

The runtime, at 2289.1 mins, appears to be somewhat longer, although I cannot be sure why. So in single process mode, setting these variables to 1 either does not make a difference, or makes things slightly worse. I, however, observed that CPU usage was at 100% with all cores involved for some model steps. I could confirm this high CPU usage for school and workplace location models, but did not keep a close eye during other model steps. It is likely we'll fix this issue if we can figure out why all cores get involved.

memory_profile.csv
timing_log.csv
activitysim.log

i-am-sijia · 2024-06-12T18:33:25Z

I did a 100% household sample, single-process, sharrow on run, on WSP's machine, using the latest Sharrow (including Jeff's code changes to work with np.where). The run time is only 6.9 hours (compared to 36 and 56 hours reported before), with memory peak at 411 GB. Looking at the memory chart, I don't see the long "flat valleys" I reported previously. It seems sharrow is now efficiently working with np.where.

ActivitySim: pr/867@b465dd0e4
Sharrow: main@8d63a66
Example: pr/21@57915ab

I can do the same run with the Sharrow version before the np.where changes, to have an apples-to-apples comparison on WSP's machine.

dhensle · 2024-06-18T03:09:45Z

Trying to reproduce the above low runtimes on an RSG machine was unsuccessful -- the run times were still quite long with sharrow turned on.

Run time was 1266.9 minutes = 21.1 hours. The memory profile actually looks quite similar, run time is just much slower.

Ran on a 512 GB machine with 24 cores running 2.1 GHz Intel Xeon processors.

Confirmation that I am indeed using the same version of sharrow:

We saw the same thing on the SANDAG machine, but even slower at 2235 mins = 37.3 hours (consistent with the earlier runtimes on this server using sharrow v2.9.1 before the np.where fix).

i-am-sijia · 2024-06-18T14:55:54Z

Below shows the comparison of two Sharrow runs on the same WSP machine. Both runs used:

ActivitySim: pr/867@b465dd0e4
Example: pr/21@57915ab

The only difference is the Sharrow version, the one on the top used v2.9.1, the one on the bottom used main@8d63a66.

With v2.9.1, the run time was 489.3 mins, 8.2 hours. With main@8d63a66, the run time was 413.9 mins, 6.9 hours, a 1.3-hour saving.

WSP_server_single_process_sharrow_v291.zip

WSP_server_single_process_sharrow_8d63a66.zip

The 8.2 hour run time on WSP's machine is much shorter than the 36 hours (RSG) and 56 hours (SFCTA), all using Sharrow v2.9.1.

Next step: can crosscheck the versions of dependencies (like xarray, numba) of these two runs vs the other runs. Can those be different and causing a difference in run time?

jpn-- · 2024-06-27T15:53:09Z

To examine the hypothesis that there is something other than raw CPU and total memory that is blocking performance, I ran a new test on the SFCTA machine: Running identical versions of the model, once with unlimited multi-threading, and once with multithreading limited to 36 threads (slightly less than 25% of the number of CPUs on the machine).

Unfortunately I forgot to compile shadow first, so the unlimited run has compile time in it as well. Those times are now reported in the log, and I have netted them out of the results shown below. (I am also re-running the relevant test line to double-check the results.)

It appears that, on this machine, limiting sharrow/numba to use only 25% of the CPUs makes the overall model run twice as fast. If you look at the performance monitor during the model run (which I did) it does look like under the unlimited-thread case we are "using" all available cores to the max, and under the limited-thread we are only using about 1/4 of the machine's capacity. However, the total runtimes tell a different story: with limited threads, many of the sharrow-heavy model components run about twice as fast.

I do not know for sure what part of the hardware is the bottleneck. I had previously hypothesized that it might be the memory bandwidth, but now I suspect that's not it (or not the whole story) as if it were that only I'd expect the limited-thread run to be just as fast as the unlimited, but not a lot faster. I think it might instead be the on-chip memory cache (L1/L2/L3 cache). This cache is like super-fast RAM located on-chip and used for intermediate calculations, and on the SFCTA server there's 226 MB of it available. If we run 36 threads in parallel each thread might get 6MB of cache to play with (enough to hold a couple rows of skim data at once) while with 160 threads each gets only maybe 1.4 MB of space, which is maybe not enough and the code then has to go get data from relatively slow system RAM much more frequently. (This could also explain why the code runs very efficiently on my M2 Max Mac laptop, which offers 7MB of L2+L3 cache per CPU core.)

Some more experimentation will be required to confirm these findings and develop hardware and/or model-size/complexity related guidance on what might work best for different models.

ActivitySim: pr/867@d98f776af
sandag-abm3-example: main@8b58e69
Sharrow: Release 2.10.0
numba: v0.60.0
full-scale skims
household sample size: 200,000 (not 100%, to run this experiment faster)

jpn-- · 2024-07-02T15:37:26Z

I have extended the experiment above to further reduce the number of threads that Numba can use. I also re-ran the "unlimited" test with the precompiled sharrow flows, for a better apples-to-apples comparison.

ActivitySim: pr/867@d98f776af
sandag-abm3-example: main@8b58e69
Sharrow: Release 2.10.0
numba: v0.60.0
full-scale skims
household sample size: 200,000 (not 100%, to run this experiment faster)

dhensle · 2024-07-02T18:03:16Z

Ran Essentially the same test as above except on the 24 core RSG Machine.

100% sample used, comparing the result between using all 24 threads and only 1 thread:

i-am-sijia · 2024-07-09T16:03:23Z

On the WSP's machine, the effect seems to be the opposite. Fewer numba threads led to longer run time. This is the machine that had always have a significantly lower run time among all test machines, before we started exploring numba threading.

Sharrow: v2.10.0
ActivitySim: pr//867@c9d420550
ABM3: main@8b58e69
numba: 0.59.1
full-scale skims
household sample size: 100,000 (not 100%, to run this experiment faster)

The L1+2+3 cache on the WSP machine is not larger than other machines. But it runs faster than other machines when having similar cache per numba thread.

Note also how 32 and 64 threads have almost the same run time.

I don't think we have ruled out the memory bandwidth. Could it be interacting with the numba thread? Like on a machine with fast memory bandwidth, the more thread the better, on a machine with slow memory bandwidth, the fewer thread the better?

jpn-- added the performance-checks Issues that report on model performance label Apr 26, 2024

dhensle mentioned this issue Jun 18, 2024

Full Scale Performance: Single Process, Sharrow On, Explicit Chunking #19

Open

dhensle mentioned this issue Jul 1, 2024

Full Scale Performance: Sharrow On ActivitySim/activitysim-prototype-mtc#12

Open

mnbina added this to Phase 10A Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full Scale Performance: Single Process, Sharrow On #6

Full Scale Performance: Single Process, Sharrow On #6

jpn-- commented Apr 26, 2024

i-am-sijia commented Apr 30, 2024

dhensle commented May 24, 2024

aletzdy commented May 28, 2024 •

edited

Loading

aletzdy commented May 29, 2024 •

edited

Loading

joecastiglione commented May 29, 2024

jpn-- commented May 29, 2024

jpn-- commented May 29, 2024

aletzdy commented May 29, 2024

aletzdy commented May 29, 2024

jpn-- commented May 29, 2024

i-am-sijia commented May 30, 2024 •

edited

Loading

aletzdy commented Jun 12, 2024

i-am-sijia commented Jun 12, 2024

dhensle commented Jun 18, 2024

i-am-sijia commented Jun 18, 2024

jpn-- commented Jun 27, 2024 •

edited

Loading

jpn-- commented Jul 2, 2024 •

edited

Loading

dhensle commented Jul 2, 2024

i-am-sijia commented Jul 9, 2024

Full Scale Performance: Single Process, Sharrow On #6

Full Scale Performance: Single Process, Sharrow On #6

Comments

jpn-- commented Apr 26, 2024

i-am-sijia commented Apr 30, 2024

dhensle commented May 24, 2024

aletzdy commented May 28, 2024 • edited Loading

aletzdy commented May 29, 2024 • edited Loading

joecastiglione commented May 29, 2024

jpn-- commented May 29, 2024

jpn-- commented May 29, 2024

aletzdy commented May 29, 2024

aletzdy commented May 29, 2024

jpn-- commented May 29, 2024

i-am-sijia commented May 30, 2024 • edited Loading

Update (6/5)

aletzdy commented Jun 12, 2024

i-am-sijia commented Jun 12, 2024

dhensle commented Jun 18, 2024

i-am-sijia commented Jun 18, 2024

jpn-- commented Jun 27, 2024 • edited Loading

jpn-- commented Jul 2, 2024 • edited Loading

dhensle commented Jul 2, 2024

i-am-sijia commented Jul 9, 2024

aletzdy commented May 28, 2024 •

edited

Loading

aletzdy commented May 29, 2024 •

edited

Loading

i-am-sijia commented May 30, 2024 •

edited

Loading

jpn-- commented Jun 27, 2024 •

edited

Loading

jpn-- commented Jul 2, 2024 •

edited

Loading