Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callable memory estimate #221

Draft
wants to merge 31 commits into
base: master
Choose a base branch
from
Draft

Callable memory estimate #221

wants to merge 31 commits into from

Conversation

BradGreig
Copy link
Member

@BradGreig BradGreig commented Mar 26, 2021

Hey @steven-murray I've gone through and added a function to estimate the memory usage for a light-cone. Co-eval will follow a similar structure.

Outputting of information is still a work in progress (currently done with prints), but the main functionality is there.

I'll need to think about how best to check that these estimates actually match full runs of 21cmFAST. At the moment, the only thing I can think of is to check against current memory usage from my machine for an equivalent run. I guess this is the downside for these functions (that they are not robust to changes anywhere and are reliant on me getting the behaviour correct).

Also, for the same reason, coming up for tests are problematic as these tests will always pass unless someone manually changes these functions.

I'd greatly appreciate your thoughts/comments/feedback on what I have thus far.

Fixes #26

@codecov
Copy link

codecov bot commented Mar 26, 2021

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

Copy link
Member

@steven-murray steven-murray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BradGreig, this is very promising. Unfortunately I think the fact that it is not "automatic" is something we just have to accept -- I can't think of a non-brittle way to do this. On the other hand, I think the python memory part of it can be a little more automatic, as I commented in-line.

As for tests -- I wonder if it'd be possible to write a test in which we actually run a lightcone/coeval and somehow measure the memory with some tool, and compare to the estimate? In fact, we could even try doing this for all the integration tests, so that we test all the different options!

# First, calculate the memory usage for the initial conditions
memory_ics = mem_initial_conditions(user_params=user_params)

memory_data = {"ics_%s" % k: memory_ics["%s" % k] for k in memory_ics.keys()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instaed of "%s"%k, can just have k.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I may have just been copy pasting that from something else. You are 100 per cent correct of course!

if flag_options.PHOTON_CONS:
# First need to create new structs for photon the photon-conservation
astro_params_photoncons = deepcopy(astro_params)
astro_params_photoncons._R_BUBBLE_MAX = astro_params.R_BUBBLE_MAX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Does it not get copied automatically in the deepcopy??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this exactly from how it's done in the run_lightcone function. Didn't check too closely, but seemed a little odd to me too. I'll double check

"""All declared HII_DIM boxes"""
# lowres_density, lowres_vx, lowres_vy, lowres_vz, lowres_vcb
# lowres_vx_2LPT, lowres_vy_2LPT, lowres_vz_2LPT
num_py_boxes_HII_DIM = 8.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think we can do better, after we get #220 in. That introduces an update where each class can return a dict of all arrays and their shapes, before being instantiated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds pretty nice and will be very useful for these functions. This and #220 probably will need to be done hand in hand.

# These are all float arrays
size_py = (np.float32(1.0).nbytes) * size_py

"""Memory usage within GenerateICs"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using comments rather than string literals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visually I think I do this to break things up more clearly. But, I'll change them

return {"python": 0.0, "c": size_c}


def format_output(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think perhaps this function name should be something like print_memory_estimate or something like that, and it should be imported into the top-level namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I threw this together in a few minutes at the end of the day. I tried to use the inbuilt logging rather than printing but it didn't work (hence the printing). But I didn't try too hard to fix the issue

@BradGreig
Copy link
Member Author

Hey @steven-murray, thanks for the quick feedback.

With respect to the tests, I think that would be great. However, I wonder how accurate that would be. Different machines/compilers might allocate different amounts of memory. Thus, these functions relative to actual usage may only be approximate. You might have to invoke some level of tolerance (maybe 10 per cent) to roughly get the two numbers matching close enough to pass the test.

I'm about to generate some runs to test how these functions perform relative to actual usage. Also, to test how they scale with HII_DIM as if I get the scaling correct that gives me confidence that the memory is being accurately captured by these functions.

@steven-murray
Copy link
Member

@BradGreig: perfect. Yeah, we'll put in a fairly wide tolerance. The idea would just be that we should capture the broad trends. It would be unusual for someone to require the exact memory to within 10%.

@BradGreig
Copy link
Member Author

Hey @steven-murray, I have been playing around with comparing the light-cone memory function to actual usage. For larger boxes it seems to do relatively well (still testing that though), however, for the smaller boxes it ends up being an under-estimate.

Primarily the under-estimate comes from the fact that for small HII_DIM memory declaration of some small interpolation tables and other variables (which aren't included in this function) are comparable to the 3D boxes themselves. For increasing HII_DIM this becomes less of a problem as these become dwarfed by the 3D boxes.

Therefore, I'm not sure one can simply have tests with a relatively large tolerance to capture this for the entire test suite.

@steven-murray
Copy link
Member

Hey @BradGreig, I guess that makes some sense. Two options: 1) capture at least the interpolation tables in the memory function (not sure how difficult that is). 2) Just report in the print_ function that this is a lower estimate if HII_DIM < some_number, and when we test, just test that the estimate is below the actual memory usage (since we essentially only test small boxes).

@BradGreig
Copy link
Member Author

BradGreig commented Mar 31, 2021

Hey @steven-murray, yeah, it is possible to capture the interpolation tables. It's straightforward, it's just that those weren't relevant for the main usage of this (trying to fit large boxes into memory etc.). But, I'll go away and add those. Oh by the way, another reason why I wasn't tracking these tables is that their size is defined by a C-side variable that is not accessible to Python (i.e. not in global_params). I don't think those variables will ever change though.

Upon further research into memory recording etc. (with help from @qyx268) one thing I am realising is that accurately determining memory usage of processes is non-trivial and potentially inaccurate. It's ok for a single threaded process, but one could run into issues with multi-threading (e.g. what is considered shared, private etc.). Also, we have no control if and at what point swap memory may be used. Further, we can't guarantee that a compiler is being intelligent in how it is allocating/using page memory. Thus I think dynamically estimating the memory and comparing against these functions is fraught with danger. I think one can do this with valgrind, but valgrind is super slow, so not going to consider that.

Thus, what I think is the best way to think of this is that these functions provide an upper-limit on the physical memory required to perform a calculation. That is, provided you have the physical memory (or very close to it), then the calculation should be able to proceed without it crashing for the lack of memory. This to me makes sense.

As an example (in the attached file), I am showing the tracked physical memory for a full light-cone (no minihaloes, but everything else on). This is for a DIM=525, HII_DIM=175 box, saving 4 light-cones. The horizontal dashed line corresponds to the peak physical memory required from my function, the blue curve is the physical memory output using ps. The coloured patches denote which section the code is in, grey = ICs, orange = perturbed field, green = spin temperature and purple=ionisation field. As you can see, the actual physical memory is always below my estimate. I don't anticipate the interpolation tables making up the difference, thus I am confident the function serves it's purpose as providing the upper-limit. It appears to be correct for smaller mini-halo boxes too.

MemoryUsageOutput.pdf

@steven-murray
Copy link
Member

Hey @BradGreig, that sounds perfectly reasonable. Yes, I don't think we should be trying to predict the behaviour of the OS! If we do add the interpolation tables, we can properly call it an upper limit, and I think that's the most useful thing we can do. Should be able to test that as well.

Copy link
Contributor

@qyx268 qyx268 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably use similar name as estimate_memory_lightcone for e.g. mem_initial_conditions, mem_perturb_field, and etc. so users can also just run e.g. estimate_memory_initial_conditions if needed.

"""All declared 2LPT boxes (DIM)"""
# phi_1 (6 components)
if global_params.SECOND_ORDER_LPT_CORRECTIONS:
num_c_boxes += 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a reminder to change this to 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this will be changed as the various PRs are merged

/ np.log(global_params.DELTA_R_HII_FACTOR)
)
) + 1
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you reckon we should do the same for minihalo filtering if MINIMIZE_MEMORY?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also be updated once #220 is merged.

)

# dxheat_dt_box, dxion_source_dt_box, dxlya_dt_box, dstarlya_dt_box
num_c_boxes_initialised += 2.0 * 4.0 # factor of 2. as these are doubles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we reduce it to float32 then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these should stay as double for numerical accuracy

@BradGreig
Copy link
Member Author

Hey @qyx268, while individual functions can be easily implemented, I'm not really sure how useful they'll be.

I can see some value for initial_conditions, but for the remaining I don't see the benefit of having individual functions. Particularly given all the requisite information will be provided by a run_coeval or run_lightcone.

But I guess it can be added if it is deemed important and useful enough.

@BradGreig
Copy link
Member Author

Hey @steven-murray, following the completion of #220, I just want to clarify something. In the default case (i.e. MINIMIZE_MEMORY=False) does the memory usage remains as it was prior to #220? If so (and I am assuming it is), then it should only be a matter of adding a MINIMIZE_MEMORY=True output to what I have already done.

@steven-murray
Copy link
Member

Hi @BradGreig, no the memory is not quite the same for run_coeval or run_lightcone, because now the initial conditions (at least in most cases) are purged from memory before doing anything after the perturb field. So the general flow of these functions is now:

  • compute ICs
  • purge hires or lowres boxes from ICs depending on PERTURB_ON_HIGH_RES
  • compute all necessary PFs, then if write=True, purge them from memory
  • purge the whole ICs (unless they don't exist on disk -- then we're conservative and keep them around)
  • compute spin_temp/ionize etc. for all redshifts
  • save the whole coeval/lightcone (which may involve reading in some of the earlier boxes that were purged)

This should in general reduce the peak memory. Also, each box now only initializes the arrays that it actually needs, given the input parameters (so for example, the _2LPT arrays in the ICs are only allocated if USE_2LPT=True). This will reduce the memory in most cases, not just the peak memory.

You can determine the exact OUTPUT memory required for each kind of output more easily now as well (without actually allocating it all). If you have an instantiated output object, you can use _array_structure to get the various box sizes, eg:

init = p21c.outputs.InitialConditions(user_params={...}, cosmo_params={...})
print(init._array_structure)

Just to be clear, creating the init here does not actually allocate any arrays -- these are allocated only as required just before passing the object into C. The _array_structure is a dictionary of keys corresponding to arrays, and then the values are usually tuples giving the array shape, but are sometimes dicts themselves, with entries "shape" and "init", where the "init" is a function used to initialize the array (by default this is np.zeros but for some arrays it has to be np.ones).

So you can easily use this dictionary to compute the total output size of the struct. You'll just then need to add all the arrays that are allocated INSIDE the function, and of course any of the _c_based_pointers in the struct you can't know their size before they are returned from C (atm this is limited to the halo field).

I think that covers all the bases? Lmk if you need any further info!

@BradGreig
Copy link
Member Author

Hey @steven-murray, awesome thanks for the run-down. Just needed a quick idea to know what I was going to have to deal with, and how much time I might need to set aside. Seems like it might be a bit more involved than I was thinking, but good to know beforehand.

@BradGreig
Copy link
Member Author

Hey @steven-murray once #320 gets merged in I might look at finally completing this PR. Not that it actually requires #320, but I think it'll be a clean/stable base to use.

@steven-murray
Copy link
Member

@BradGreig that'd be great!

@BradGreig BradGreig mentioned this pull request May 16, 2023
@BradGreig
Copy link
Member Author

Hey @steven-murray, so I've tried to update all the memory functions I had following all the recent changes. But I've noticed a couple of issues, not sure if it is my interpretation or something else is not quite right. To be honest I didn't look too closely at the behaviour of the functions.

As of now, the functions return estimates of the necessary memory for various options. However, they do not resemble what I find when performing an actual run.

The primary issue appears to be with the purging (either it doesn't behave like it should, or I'm misinterpreting its behaviour).

Basically the memory functions return a close match for the peak memory. But it doesn't remotely match the ongoing memory throughout the calculation. For example, under the logic you described above (regarding purging) after the perturbed fields are computed the ICs are removed from memory. Thus, the peak memory will always occur during the IC generation and/or with determining the perturb fields.

However, what I find in practice is that the memory only every increases throughout a run. It never demonstrates a significant drop following a purge of the ICs. Perhaps the compiler doesn't properly release the memory? Thus once it is allocated it never gets properly released to be used again.

This behaviour will result in an underestimate of the peak memory usage as with purging it's assumed that the peak will happen during IC generation (by the nature of its behaviour). However, in practise since it appears this memory isn't completely removed, that original memory plus the memory of everything else results in a larger peak.

Secondly, if MINIMIZE_MEMORY is False it appears all perturbed_field boxes are kept in memory at any one time. That is, for a regular light-cone (z = 5.5) that'll constitute 88 perturbed fields that are kept in memory. Is this desirable behaviour? I guess this comes about following the restructure of the code to purge the ICs. But I can see a few instances where this is far from ideal.

@steven-murray
Copy link
Member

@BradGreig what are you using to capture the memory usage? I know sometimes the OS reserves memory for an application even after it has deallocated, just in case it needs it again. I think there's some distinction in the kind of memory allocated by the OS for different purposes, and I can never quite remember which kind of RAM estimate to use.

Certainly the idea is that once we do perturbed fields, the initial conditions should be dropped from memory, so you should get lower current memory usage then.

Keeping all the perturbed field boxes in memory sounds like a terrible idea, so I'll try fixing that.

@BradGreig
Copy link
Member Author

Hey @steven-murray, in this case I was scraping the information using the command line from top during the evaluation. Additionally, I was grabbing the max memory off the HPC job monitor and submitted job.

I think storing all perturb's has some merit, but I think it is highly dependent on the actual setup. Which should be able to be determined within the function. So if possible you probably want to allow both options.

For example, if the memory requirements of the number of perturbs is less than the ICs, then this is preferable. Otherwise, if the number of redshifts is too large, then only having two perturbs at any one time makes more sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Req.] Add a function to estimate the amount of memory usage given the setup
3 participants