-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ak.from_json
with a schema=
argument raises exception in ak_from_buffers
due to size difference
#2709
Comments
ak.from_json
with a schema argument raises exception in ak_from_buffers due to size differenceak.from_json
with a schema=
argument raises exception in ak_from_buffers
due to size difference
A key aspect of this bug is that To show that this is the issue, see what happens when we look only at the JSON objects with this field: >>> import fsspec, awkward as ak
>>> schema = {
... "title": "untitled",
... "description": "Auto generated by dask-awkward",
... "type": "object",
... "properties": {
... "payload": {
... "type": "object",
... "properties": {
... "pull_request": {
... "type": ["object", "null"],
... "properties": {"merged_at": {"type": ["string", "null"]}},
... }
... },
... }
... },
... }
>>> with fsspec.open(
... "https://data.gharchive.org/2015-01-01-10.json.gz", compression="infer", mode="rt"
... ) as f:
... subset = "".join([x for x in list(f) if "\"merged_at\":" in x])
...
>>> array = ak.from_json(subset, line_delimited=True, schema=schema)
>>> array.show(type=True)
type: 360 * {
payload: {
pull_request: ?{
merged_at: ?string
}
}
}
[{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:00:32Z'}}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:01:07Z'}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:01:08Z'}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:01:08Z'}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:01:11Z'}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:01:23Z'}}},
...,
{payload: {pull_request: {merged_at: '2015-01-01T10:58:31Z'}}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:59:00Z'}}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: None}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:59:44Z'}}},
{payload: {pull_request: {merged_at: '2015-01-01T10:59:55Z'}}}] (Those This could either be solved by |
Version of Awkward Array
2.4.2
Description and code to reproduce
Working on usage of dask-contrib/dask-awkward#94 and I've come across an issue in
ak.from_json
which is raising fromak_from_buffers
:reproducer:
(The schema is meant to produce an array of records with top level field "payload" which contains a subfield "pull_request" that is either null or has subfield "merged_at" that is either a string or null.)
More info:
My assumption is that because the field that is sandwiched by
payload
andmerged_at
,pull_request
, is of typeobject
ornull
, and only 360 of them are not null, something is going wrong with building the array of this nested nullable type. (I double checked the numbers via):More info on how I came across the issue via dask-awkward:
The schema can be created manually with dask-awkward code:
This is the same schema that gets created automatically by the column optimizer if we try use field access on a dak collection and run compute:
The text was updated successfully, but these errors were encountered: