Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map function over detector frames #333

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Map function over detector frames #333

wants to merge 4 commits into from

Conversation

takluyver
Copy link
Member

This aims to make it easier to apply a function to each frame of multi-module detector data, like in this screenshot:

image

Azimuthal integration is one particular motivating use case.

Design:

  • The core idea is to batch frames together, so we're submitting fewer, larger tasks
    • Each task loads a chunk of data (by default ~1000 frames), and then runs the function on these sequentially
  • You can ideally use any map method - local thread/process pools, Dask (as in the screenshot), clusterfutures...
  • If the per-frame function has parameter names like mask or cellId, the corresponding data will be loaded and passed to it.
  • It returns a list (one result per frame) by default, but you also have the option to make an array, which should be a bit more efficient than using np.stack().

Concerns & questions:

  • I needed some kludgy specific workarounds to get this working nicely with Dask, which was one of my main goals. In particular, Dask was spending an inordinately long time making unique names for the tasks, until I overrode it with random names.
  • Azimuthal integration was the motivating use case, but it's actually kind of inefficient for that. If you construct the AzimuthalIntegrator outside the function, you have to send about 100 MB of data for each batch task (for AGIPD-1M: 3D positions of each corner of 1 million pixels). If you construct it inside the function, you're redoing that for every frame. 🤔
  • Possible extension: add a parameter so that if out_shared=True, workers write directly to a shared output array, rather than serialising data to send back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant