You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@jpivarski is iterating on an unproject_layout routine in #203 that
will do the heavy lifting. We'll need a function for internal use and
library author use that can be used to apply this to arrays that are
instantiated at input nodes in a graph. for example: the __call__
method of the dask_awkward.lib.io.parquet._FromParquetFileWiseFn
class
Handling missing implementations
There are still parts of the awkward API that need to be covered by
dask-awkward.
In recent weeks @lgray has been raising issues for missing
implementations (and including PRs for many!). Issue #214 asks for a
few more, and I created issue #215 as a general need. I know there are
many cases in the code base where we explicity support axis >= 1
without supporting negative axes. "Fixing" #215 (probably with some
kind of utility function to be used at the top of all dask-awkward
implementations) will help in a lot of places.
Greedy necessary columns
@lgray discovered a problem in typetracer reports where more columns
than necessary are reported as touched. @jpivarski and @agoose77 have
been digging into this and finding fixes.
Relevant issues/PRs
...
Latest status
There are no relevant issues or PRs in the dask-awkward repository, but
we should follow this closely and perhaps add some tests for dak.necessary_columns that specifically targets calls that were
originally hurt by the greedy touching bug.
DataFrame interop
There was a recent feature request to add awkward's to_dataframe to
the dask-awkward.
@jpivarski found a bug in awkward where ak.to_numpy converts strings
to floats for empty arrays (scikit-hep/awkward#2364). This is needed
to generate a correct metadata-as-empty-dataframe object when we
instantiate a new dask.dataframe collection.
This also resurfaced discussion
(#208 (comment))
around the awkward-pandas work, where we would like to convert an
awkward array (or dask-awkward array) into a Pandas Series object (or
dask.dataframe Series object) backed by the AwkwardDtype extension.
JSON input with necessary columns optimization
The necessary columns optimization is currently used with from_parquet and in uproot. We started to think about how to add
this optimization to reading JSON.
So far we've only explored how to roundtrip between awkward Forms and
JSON schema. Once we have the projected Form we should be able to
generate a schema that can be passed to the from_json nodes in a
task graph.
Slow optimization
It was discovered that optimizing many repeated blockwise layers can
be slow. This is more of a general Dask issue, not necessarily a
dask-awkward issue
This is a meta issue to summarize the status of ongoing work in prep
for me being on leave from April 11 to May 18
I'll likely be editing this issue until Monday April 10
In no particular order of priority here are summaries of ongoing
things:
Form re-hydration
When we perform the necessary columns (or "column projection)
optimization; we'd like to retain the complete form.
Relevant issues/PRs
Latest status
@jpivarski is iterating on an
unproject_layout
routine in #203 thatwill do the heavy lifting. We'll need a function for internal use and
library author use that can be used to apply this to arrays that are
instantiated at input nodes in a graph. for example: the
__call__
method of the
dask_awkward.lib.io.parquet._FromParquetFileWiseFn
class
Handling missing implementations
There are still parts of the awkward API that need to be covered by
dask-awkward.
Relevant issues/PRs
axis
argument is equivalent to zero #215Latest status
In recent weeks @lgray has been raising issues for missing
implementations (and including PRs for many!). Issue #214 asks for a
few more, and I created issue #215 as a general need. I know there are
many cases in the code base where we explicity support
axis >= 1
without supporting negative axes. "Fixing" #215 (probably with some
kind of utility function to be used at the top of all dask-awkward
implementations) will help in a lot of places.
Greedy necessary columns
@lgray discovered a problem in typetracer reports where more columns
than necessary are reported as touched. @jpivarski and @agoose77 have
been digging into this and finding fixes.
Relevant issues/PRs
...
Latest status
There are no relevant issues or PRs in the dask-awkward repository, but
we should follow this closely and perhaps add some tests for
dak.necessary_columns
that specifically targets calls that wereoriginally hurt by the greedy touching bug.
DataFrame interop
There was a recent feature request to add awkward's
to_dataframe
tothe dask-awkward.
Relevant issues/PRs
dask_awkward.to_dataframe
#209Latest status
@jpivarski found a bug in awkward where
ak.to_numpy
converts stringsto floats for empty arrays (scikit-hep/awkward#2364). This is needed
to generate a correct metadata-as-empty-dataframe object when we
instantiate a new dask.dataframe collection.
This also resurfaced discussion
(#208 (comment))
around the awkward-pandas work, where we would like to convert an
awkward array (or dask-awkward array) into a Pandas Series object (or
dask.dataframe Series object) backed by the
AwkwardDtype
extension.JSON input with necessary columns optimization
The necessary columns optimization is currently used with
from_parquet
and in uproot. We started to think about how to addthis optimization to reading JSON.
Relevant issues/PRs
Latest status
So far we've only explored how to roundtrip between awkward Forms and
JSON schema. Once we have the projected Form we should be able to
generate a schema that can be passed to the
from_json
nodes in atask graph.
Slow optimization
It was discovered that optimizing many repeated blockwise layers can
be slow. This is more of a general Dask issue, not necessarily a
dask-awkward issue
Relevant issues/PRs
Latest status
@martindurant starting working on a general solution in #210. I
started working on a solution specific to
getitem
field access in abranch on my fork (douglasdavis/dask-awkward@ef9a980)
The text was updated successfully, but these errors were encountered: