Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta issue: ongoing work as of early April 2023 #220

Closed
douglasdavis opened this issue Apr 7, 2023 · 0 comments
Closed

Meta issue: ongoing work as of early April 2023 #220

douglasdavis opened this issue Apr 7, 2023 · 0 comments

Comments

@douglasdavis
Copy link
Collaborator

This is a meta issue to summarize the status of ongoing work in prep
for me being on leave from April 11 to May 18

I'll likely be editing this issue until Monday April 10

In no particular order of priority here are summaries of ongoing
things:


Form re-hydration

When we perform the necessary columns (or "column projection)
optimization; we'd like to retain the complete form.

Relevant issues/PRs

Latest status

@jpivarski is iterating on an unproject_layout routine in #203 that
will do the heavy lifting. We'll need a function for internal use and
library author use that can be used to apply this to arrays that are
instantiated at input nodes in a graph. for example: the __call__
method of the dask_awkward.lib.io.parquet._FromParquetFileWiseFn
class


Handling missing implementations

There are still parts of the awkward API that need to be covered by
dask-awkward.

Relevant issues/PRs

Latest status

In recent weeks @lgray has been raising issues for missing
implementations (and including PRs for many!). Issue #214 asks for a
few more, and I created issue #215 as a general need. I know there are
many cases in the code base where we explicity support axis >= 1
without supporting negative axes. "Fixing" #215 (probably with some
kind of utility function to be used at the top of all dask-awkward
implementations) will help in a lot of places.


Greedy necessary columns

@lgray discovered a problem in typetracer reports where more columns
than necessary are reported as touched. @jpivarski and @agoose77 have
been digging into this and finding fixes.

Relevant issues/PRs

...

Latest status

There are no relevant issues or PRs in the dask-awkward repository, but
we should follow this closely and perhaps add some tests for
dak.necessary_columns that specifically targets calls that were
originally hurt by the greedy touching bug.


DataFrame interop

There was a recent feature request to add awkward's to_dataframe to
the dask-awkward.

Relevant issues/PRs

Latest status

@jpivarski found a bug in awkward where ak.to_numpy converts strings
to floats for empty arrays (scikit-hep/awkward#2364). This is needed
to generate a correct metadata-as-empty-dataframe object when we
instantiate a new dask.dataframe collection.

This also resurfaced discussion
(#208 (comment))
around the awkward-pandas work, where we would like to convert an
awkward array (or dask-awkward array) into a Pandas Series object (or
dask.dataframe Series object) backed by the AwkwardDtype extension.


JSON input with necessary columns optimization

The necessary columns optimization is currently used with
from_parquet and in uproot. We started to think about how to add
this optimization to reading JSON.

Relevant issues/PRs

Latest status

So far we've only explored how to roundtrip between awkward Forms and
JSON schema. Once we have the projected Form we should be able to
generate a schema that can be passed to the from_json nodes in a
task graph.


Slow optimization

It was discovered that optimizing many repeated blockwise layers can
be slow. This is more of a general Dask issue, not necessarily a
dask-awkward issue

Relevant issues/PRs

Latest status

@martindurant starting working on a general solution in #210. I
started working on a solution specific to getitem field access in a
branch on my fork (douglasdavis/dask-awkward@ef9a980)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant