Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully parallelise batch_read when 1 or more of the symbols requested is recursively normalised #1968

Open
alexowens90 opened this issue Oct 31, 2024 · 0 comments
Labels
enhancement New feature or request refactor

Comments

@alexowens90
Copy link
Collaborator

alexowens90 commented Oct 31, 2024

Currently, the code flow when reading and batch reading pointlessly pivots back off the Python layer in the middle of the call, before calling batch_read_keys.
This is not too inefficient in the non-batch case, as batch_read_keys is parallelised. However, in batch_read, the calls to batch_read_keys are called sequentially for each symbol in the batch that is recursively normalized, and so this will be very inefficient if there are many such small symbols.

There is no conceptual need to pivot off Python mid-call, this could all be handled in the C++ layer with appropriate structures returned to Python.

@alexowens90 alexowens90 added enhancement New feature or request refactor labels Oct 31, 2024
alexowens90 added a commit that referenced this issue Nov 7, 2024
#### Reference Issues/PRs
Closes #1895 
Fixes man-group/arcticdb-man#171
Fixes #1936 
Fixes #1939 
Fixes #1940 

#### What does this implement or fix?
Schedules all work asynchronously in batch reads when processing is
involved, as well as when all symbols are being read directly.
Previously, symbols were processed sequentially, leading to idle CPUs
when processing lots of smaller symbols.

This works by making `read_frame_for_version` schedule work and return
futures, rather than actually performing the processing. This
implementation can then be used for all 4 combinations of
batch/non-batch and direct/with processing reads, significantly
simplifying the code and removing the now redundant `async_read_direct`
(the fact that there were two different implementations to achieve
effectively the same thing is what led to 2 of the bugs in the first
place).

Several bugs that were discovered during the implementation (flagged
above) have also been fixed.

Further work in this area covered in #1968
grusev pushed a commit that referenced this issue Nov 25, 2024
#### Reference Issues/PRs
Closes #1895 
Fixes man-group/arcticdb-man#171
Fixes #1936 
Fixes #1939 
Fixes #1940 

#### What does this implement or fix?
Schedules all work asynchronously in batch reads when processing is
involved, as well as when all symbols are being read directly.
Previously, symbols were processed sequentially, leading to idle CPUs
when processing lots of smaller symbols.

This works by making `read_frame_for_version` schedule work and return
futures, rather than actually performing the processing. This
implementation can then be used for all 4 combinations of
batch/non-batch and direct/with processing reads, significantly
simplifying the code and removing the now redundant `async_read_direct`
(the fact that there were two different implementations to achieve
effectively the same thing is what led to 2 of the bugs in the first
place).

Several bugs that were discovered during the implementation (flagged
above) have also been fixed.

Further work in this area covered in #1968
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor
Projects
None yet
Development

No branches or pull requests

2 participants