Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full Scale Performance: Single Process, Sharrow On #6

Open
jpn-- opened this issue Apr 26, 2024 · 19 comments
Open

Full Scale Performance: Single Process, Sharrow On #6

jpn-- opened this issue Apr 26, 2024 · 19 comments
Labels
performance-checks Issues that report on model performance

Comments

@jpn--
Copy link
Member

jpn-- commented Apr 26, 2024

This is the issue to report on memory usage and runtime performance...

  • data_dir: "data-full" full scale skims (24333 MAZs)
  • households_sample_size: 0 (full scale 100% sample of households)
  • sharrow: "require"
  • multiprocess: false single process
@jpn-- jpn-- added the performance-checks Issues that report on model performance label Apr 26, 2024
@i-am-sijia
Copy link
Contributor

On a windows machine with 512 GB RAM, 2.44 GHz processor. Using the blosc2:zstd compressed skims.

The compute_accessibility (sharp yellow on the far left part of the charts) consistently takes 314 GB of memory, because it runs for the full size zones and it does not change by household sample size. The memory peak discussed below excludes compute_accessibility.

  • Sharrow Compile with 100,000 households. Completed in 3 hours. Memory peak in Trip Destination, 165 GB.
  • Sharrow Production with 300,000 households (~25%). Completed in 4.5 hours. Memory peak in Trip Destination, 218 GB.
  • Sharrow Production with 640,000 households (50%). Completed in 8 hours. Memory peak in Trip Destination, 311 GB.

With 100% households, we are estimating the run time to be about 16 hours, and memory peak to be about 500 GB. WSP is current running this on the same 512 RAM machine. We should probably run it on a larger machine to fairly trace memory.

Note: Sharrow is set to skip Trip Destination in ABM3. @jpn-- is working on turning it on. Hopefully we will see trip destination memory to come down with Sharrow turned on, as we saw in the 1-zone model.

ABM3 Sharrow Single Process

@dhensle
Copy link
Contributor

dhensle commented May 24, 2024

My sharrow on run is still going, but I am seeing some very large increases in runtime compared to the sharrow off runs on this same machine:

image

I will let it finish, but wanted to bring this up asap... Any ideas on what is going on here @jpn-- ?

I do notice that the CPU usage on the server will spike to 100% even though I am only running single process. I assume this is because multi-threading is turned on and this is expected behavior?

@aletzdy
Copy link

aletzdy commented May 28, 2024

Update on run with full sample, sharrow on, and single process (1 TB memory, Intel Xeon Gold 6342 @ 2.8GHz machine):

  • Total runtime: 2,177.9 minutes
  • Peak memory usage: 417.5Gb taking place in mandatory tour scheduling
  • The longer runtime issue is still present, and I have had similar problem in my last week SEMCOG sharrow run (and MWCOG last year). @jpn-- any suggestions yet?
    activitysim.log
    timing_log.csv

memory_profile.csv

image

@aletzdy
Copy link

aletzdy commented May 29, 2024

Update: the longer runtime is no longer happening under sharrow with multi process. Total runtime is approximately 150mins (full sample).

Note that I added these constraints to simulation.py, would be good to know if they are required in sharrow mp:


os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMBA_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"


@joecastiglione
Copy link

@aletzdy - what was peak memory usage for run under sharrow with multi process? Can you post performance profile?

@jpn--
Copy link
Member Author

jpn-- commented May 29, 2024

On the memory trace, I see lots of flat sections all with about same slope. Can you confirm that there are memory pings filling out the middle of these slopes at the constant ping rate, and it's not just the computer completely locking up periodically?

Reviewing the log file, it appears the problem is entirely in the interaction components. The flat and consistent slope of the memory profile every time it happens suggests we are somehow leaning against some memory speed constraint, as otherwise I'd expect the slopes to differ more drastically for different components. @aletzdy, can you try:

  1. running the single processes with these same environment flags?
  2. running single processes changing nothing except running a 10% household sample instead of the full population?

Thanks

Update: the longer runtime is no longer happening under sharrow with multi process. Total runtime is approximately 150mins (full sample).

Note that I added these constraints to simulation.py, would be good to know if they are required in sharrow mp:

os.environ["OMP_NUM_THREADS"] = "1" os.environ["OPENBLAS_NUM_THREADS"] = "1" os.environ["NUMBA_NUM_THREADS"] = "1" os.environ["VECLIB_MAXIMUM_THREADS"] = "1" os.environ["NUMEXPR_NUM_THREADS"] = "1"

@jpn--
Copy link
Member Author

jpn-- commented May 29, 2024

@aletzdy can you also share up here the raw data file for the memory log (which backs the figure you posted)? I want to more closely cross-reference the log output and the tall but very short-lived memory spikes to see if I can narrow down the causes.

@aletzdy
Copy link

aletzdy commented May 29, 2024

On the memory trace, I see lots of flat sections all with about same slope. Can you confirm that there are memory pings filling out the middle of these slopes at the constant ping rate, and it's not just the computer completely locking up periodically?

Reviewing the log file, it appears the problem is entirely in the interaction components. The flat and consistent slope of the memory profile every time it happens suggests we are somehow leaning against some memory speed constraint, as otherwise I'd expect the slopes to differ more drastically for different components. @aletzdy, can you try:

  1. running the single processes with these same environment flags?
  2. running single processes changing nothing except running a 10% household sample instead of the full population?

Thanks

Update: the longer runtime is no longer happening under sharrow with multi process. Total runtime is approximately 150mins (full sample).
Note that I added these constraints to simulation.py, would be good to know if they are required in sharrow mp:
os.environ["OMP_NUM_THREADS"] = "1" os.environ["OPENBLAS_NUM_THREADS"] = "1" os.environ["NUMBA_NUM_THREADS"] = "1" os.environ["VECLIB_MAXIMUM_THREADS"] = "1" os.environ["NUMEXPR_NUM_THREADS"] = "1"

I edited my original post to add the memory_profile.csv. We do get the memory pings and it does not appear the system is locking up.

I will try your suggestions. Thanks.

@aletzdy
Copy link

aletzdy commented May 29, 2024

@aletzdy - what was peak memory usage for run under sharrow with multi process? Can you post performance profile?

My multiprocess run is not creating the memory_profile.csv so it is hard to say what the peak memory usage is. I am not sure if memory profile simply does not work with MP? @jpn-- or @i-am-sijia might know more.

@jpn--
Copy link
Member Author

jpn-- commented May 29, 2024

I am not sure if memory profile simply does not work with MP?

It doesn't. We've talked about this in our recent meetings. Measuring memory usage in MP is complex.

@i-am-sijia
Copy link
Contributor

i-am-sijia commented May 30, 2024

I'm running on SFCTA's 1 TB RAM, 2.29 GHz server. The server has 80 cores and 160 logical processors.

Single process:

  • Sharrow compile took 7.5 hours to complete. I noticed when running begin flow_xxxxxx.load, the CPU spikes to 100% and is very slow.
  • Sharrow production with 100% hh - currently running

Update (6/5)

The 100% HH single process, sharrow production run finished on SFCTA's server. It took 56 hours to complete. With memory peak at 389 GB.

abm3 production on sfcta machine

The "flat valleys" in the chart are taking much of the run time, they are location choice components.

30/05/2024 10:00:23 - INFO - sharrow - using existing flow code ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF
30/05/2024 10:00:23 - INFO - activitysim.core.flow - completed setting up flow workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils in 0:00:00.139339 
30/05/2024 10:00:23 - INFO - activitysim.core.flow - begin flow_ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF.load workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils
30/05/2024 10:20:58 - INFO - activitysim.core.flow - completed flow_ST3RMKSZCPRZMY3GTE45CCEMBVJPZRYF.load in 0:20:35.055109 workplace_location.accessibilities.sample.mngt_busi_scic_arts.presample.interaction_sample.eval_interaction_utils
30/05/2024 10:20:58 - INFO - activitysim.core.flow - completed apply_flow in 0:20:35.196635 
...
30/05/2024 11:41:42 - INFO - sharrow - flow exists in library: QW257FB6ESHEH4IIQUSVA7677OVHLFGU
30/05/2024 11:41:42 - INFO - activitysim.core.flow - completed setting up flow school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils in 0:00:00.265573 
30/05/2024 11:41:42 - INFO - activitysim.core.flow - begin flow_QW257FB6ESHEH4IIQUSVA7677OVHLFGU.load school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils
30/05/2024 12:52:07 - INFO - activitysim.core.flow - completed flow_QW257FB6ESHEH4IIQUSVA7677OVHLFGU.load in 1:10:24.558034 school_location.i1.sample.preschool.presample.interaction_sample.eval_interaction_utils
30/05/2024 12:52:07 - INFO - activitysim.core.flow - completed apply_flow in 1:10:24.823607 
...
31/05/2024 04:43:17 - INFO - sharrow - flow exists in library: CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR
31/05/2024 04:43:17 - INFO - activitysim.core.flow - completed setting up flow non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils in 0:00:00 
31/05/2024 04:43:17 - INFO - activitysim.core.flow - begin flow_CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR.load non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils
31/05/2024 07:12:50 - INFO - activitysim.core.flow - completed flow_CFAQQFMLZ57BYCAEL7762KYMCCKRT3IR.load in 2:29:33.130106 non_mandatory_tour_destination.othmaint.sample.presample.interaction_sample.eval_interaction_utils
31/05/2024 07:12:50 - INFO - activitysim.core.flow - completed apply_flow in 2:29:33.130106 

@aletzdy
Copy link

aletzdy commented Jun 12, 2024

Update on running the model with full sample, single-process, and sharrow on, AND the following env variables set to 1:

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMBA_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

The runtime, at 2289.1 mins, appears to be somewhat longer, although I cannot be sure why. So in single process mode, setting these variables to 1 either does not make a difference, or makes things slightly worse. I, however, observed that CPU usage was at 100% with all cores involved for some model steps. I could confirm this high CPU usage for school and workplace location models, but did not keep a close eye during other model steps. It is likely we'll fix this issue if we can figure out why all cores get involved.

memory_profile.csv
timing_log.csv
activitysim.log

@i-am-sijia
Copy link
Contributor

I did a 100% household sample, single-process, sharrow on run, on WSP's machine, using the latest Sharrow (including Jeff's code changes to work with np.where). The run time is only 6.9 hours (compared to 36 and 56 hours reported before), with memory peak at 411 GB. Looking at the memory chart, I don't see the long "flat valleys" I reported previously. It seems sharrow is now efficiently working with np.where.

ActivitySim: pr/867@b465dd0e4
Sharrow: main@8d63a66
Example: pr/21@57915ab

I can do the same run with the Sharrow version before the np.where changes, to have an apples-to-apples comparison on WSP's machine.

abm3 production on WSP machine

@dhensle
Copy link
Contributor

dhensle commented Jun 18, 2024

Trying to reproduce the above low runtimes on an RSG machine was unsuccessful -- the run times were still quite long with sharrow turned on.

Run time was 1266.9 minutes = 21.1 hours. The memory profile actually looks quite similar, run time is just much slower.

Ran on a 512 GB machine with 24 cores running 2.1 GHz Intel Xeon processors.

image

Confirmation that I am indeed using the same version of sharrow:
image

We saw the same thing on the SANDAG machine, but even slower at 2235 mins = 37.3 hours (consistent with the earlier runtimes on this server using sharrow v2.9.1 before the np.where fix).

@i-am-sijia
Copy link
Contributor

Below shows the comparison of two Sharrow runs on the same WSP machine. Both runs used:

ActivitySim: pr/867@b465dd0e4
Example: pr/21@57915ab

The only difference is the Sharrow version, the one on the top used v2.9.1, the one on the bottom used main@8d63a66.

With v2.9.1, the run time was 489.3 mins, 8.2 hours. With main@8d63a66, the run time was 413.9 mins, 6.9 hours, a 1.3-hour saving.

ABM Single Process Sharrow Comparison

WSP_server_single_process_sharrow_v291.zip

WSP_server_single_process_sharrow_8d63a66.zip

The 8.2 hour run time on WSP's machine is much shorter than the 36 hours (RSG) and 56 hours (SFCTA), all using Sharrow v2.9.1.

Next step: can crosscheck the versions of dependencies (like xarray, numba) of these two runs vs the other runs. Can those be different and causing a difference in run time?

@jpn--
Copy link
Member Author

jpn-- commented Jun 27, 2024

To examine the hypothesis that there is something other than raw CPU and total memory that is blocking performance, I ran a new test on the SFCTA machine: Running identical versions of the model, once with unlimited multi-threading, and once with multithreading limited to 36 threads (slightly less than 25% of the number of CPUs on the machine).

Unfortunately I forgot to compile shadow first, so the unlimited run has compile time in it as well. Those times are now reported in the log, and I have netted them out of the results shown below. (I am also re-running the relevant test line to double-check the results.)

It appears that, on this machine, limiting sharrow/numba to use only 25% of the CPUs makes the overall model run twice as fast. If you look at the performance monitor during the model run (which I did) it does look like under the unlimited-thread case we are "using" all available cores to the max, and under the limited-thread we are only using about 1/4 of the machine's capacity. However, the total runtimes tell a different story: with limited threads, many of the sharrow-heavy model components run about twice as fast.

I do not know for sure what part of the hardware is the bottleneck. I had previously hypothesized that it might be the memory bandwidth, but now I suspect that's not it (or not the whole story) as if it were that only I'd expect the limited-thread run to be just as fast as the unlimited, but not a lot faster. I think it might instead be the on-chip memory cache (L1/L2/L3 cache). This cache is like super-fast RAM located on-chip and used for intermediate calculations, and on the SFCTA server there's 226 MB of it available. If we run 36 threads in parallel each thread might get 6MB of cache to play with (enough to hold a couple rows of skim data at once) while with 160 threads each gets only maybe 1.4 MB of space, which is maybe not enough and the code then has to go get data from relatively slow system RAM much more frequently. (This could also explain why the code runs very efficiently on my M2 Max Mac laptop, which offers 7MB of L2+L3 cache per CPU core.)

Some more experimentation will be required to confirm these findings and develop hardware and/or model-size/complexity related guidance on what might work best for different models.

  • ActivitySim: pr/867@d98f776af
  • sandag-abm3-example: main@8b58e69
  • Sharrow: Release 2.10.0
  • numba: v0.60.0
  • full-scale skims
  • household sample size: 200,000 (not 100%, to run this experiment faster)
image

@jpn--
Copy link
Member Author

jpn-- commented Jul 2, 2024

I have extended the experiment above to further reduce the number of threads that Numba can use. I also re-ran the "unlimited" test with the precompiled sharrow flows, for a better apples-to-apples comparison.

  • ActivitySim: pr/867@d98f776af
  • sandag-abm3-example: main@8b58e69
  • Sharrow: Release 2.10.0
  • numba: v0.60.0
  • full-scale skims
  • household sample size: 200,000 (not 100%, to run this experiment faster)
image

@dhensle
Copy link
Contributor

dhensle commented Jul 2, 2024

Ran Essentially the same test as above except on the 24 core RSG Machine.

100% sample used, comparing the result between using all 24 threads and only 1 thread:
image

@i-am-sijia
Copy link
Contributor

On the WSP's machine, the effect seems to be the opposite. Fewer numba threads led to longer run time. This is the machine that had always have a significantly lower run time among all test machines, before we started exploring numba threading.

  • Sharrow: v2.10.0
  • ActivitySim: pr//867@c9d420550
  • ABM3: main@8b58e69
  • numba: 0.59.1
  • full-scale skims
  • household sample size: 100,000 (not 100%, to run this experiment faster)

image

The L1+2+3 cache on the WSP machine is not larger than other machines. But it runs faster than other machines when having similar cache per numba thread.

Note also how 32 and 64 threads have almost the same run time.

I don't think we have ruled out the memory bandwidth. Could it be interacting with the numba thread? Like on a machine with fast memory bandwidth, the more thread the better, on a machine with slow memory bandwidth, the fewer thread the better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance-checks Issues that report on model performance
Projects
Status: No status
Development

No branches or pull requests

5 participants