Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iteration could compute masks more efficiently #15

Open
kyleaoman opened this issue Sep 20, 2024 · 0 comments
Open

Iteration could compute masks more efficiently #15

kyleaoman opened this issue Sep 20, 2024 · 0 comments

Comments

@kyleaoman
Copy link
Member

kyleaoman commented Sep 20, 2024

In the initial SWIFTGalaxies iterator class masks are calculated for each galaxy here:

swift_galaxy = self._server[
self.halo_catalogue._get_extra_mask(self._server)
]

within the loop over galaxies. This means that for each galaxy we evaluate:

def _generate_bound_only_mask(self, SG: "SWIFTGalaxy") -> MaskCollection:
# The halo_catalogue_index is the index into the full (HBT+ not SOAP) catalogue;
# this is what group_nr_bound matches against.
masks = MaskCollection(
**{
group_name: getattr(
SG, group_name
)._particle_dataset.group_nr_bound.to_value(u.dimensionless)
== self.input_halos.halo_catalogue_index.to_value(u.dimensionless)
for group_name in SG.metadata.present_group_names
}
)
if not self._multi_galaxy:
for group_name in SG.metadata.present_group_names:
del getattr(SG, group_name)._particle_dataset.group_nr_bound
return masks

The == operation is fairly expensive. Perhaps the masks can be pre-computed for all target galaxies in a region just after the data preloading loop:

for preload_field in self._init_args["preload"]:

Here, instead of looping over the galaxies with == and finding the matches in group_nr_bound, a more efficient solution needs to be found. The inputs are:

  • the target galaxies in the region (contained in solution["region_target_indices"], need to be converted to halo catalogue indices by looking up the corresponding rows in self.halo_catalogue.input_halos.halo_catalogue_index);
  • the particle group membership information, accessible as self._server.gas._particle_dataset.group_nr_bound and similar for other particle types.
    The desired output is:
  • a list of masks ([True, False, False, ...]), one for each particle type, that pick out the particles bound to each galaxy in the list of targets for this region.
    This needs to be calculated more efficiently than a loop over == or similar operation for this improvement to make sense. Probably this is a clever usage of numpy.unique(..., return_inverse=True).

A good starting point would be making some dummy data for some target IDs (say an array of ~10 integers) then a big array of integers containing those 10 integers many times each (plus some other integers that are not the ones searched for) and trying to get out the corresponding masks as efficiently as possible (see if numpy.unique outperforms a loop over ==, for example).

All of this optimization only makes sense for the bound_only mask option, so will need to consider if/how to support other modes, and definitely only do this in the bound_only mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant