Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Explorer breaks when dataframe cell has complex data in it #36

Open
jruales opened this issue Apr 7, 2021 · 8 comments · Fixed by #65
Open

Data Explorer breaks when dataframe cell has complex data in it #36

jruales opened this issue Apr 7, 2021 · 8 comments · Fixed by #65
Labels
bug Something isn't working released This issue/pull request has been released.

Comments

@jruales
Copy link

jruales commented Apr 7, 2021

Repro: run the following in a cell

import pandas as pd
pd.set_option("display.html.table_schema", True)

class Cmd:
    def __init__(self, name, params):
        self.name = name
        self.params = params
    def __repr__(self):
        return f'Cmd(name={self.name}, params={self.params})'

cell_payload = [
    Cmd(name='foo', params={'bar', 'baz'}),
    Cmd(name='foo', params={'bar', 'baz'})
]
pd.DataFrame({'param_session': [cell_payload]})

Then the following error appears (with a link to this error page, which mentions that the error was Objects are not valid as a React child (found: object with keys {name}). If you meant to render a collection of children, use an array instead.)
image

For reference, this is how Pandas would normally render the cell, when setting pd.set_option("display.html.table_schema", False)
image

Finally, here's what the output looks like in the ipynb file when the error occurs

            "application/vnd.dataresource+json": {
              "schema": {
                "fields": [
                  {
                    "name": "index",
                    "type": "integer"
                  },
                  {
                    "name": "param_session",
                    "type": "string"
                  }
                ],
                "primaryKey": [
                  "index"
                ],
                "pandas_version": "0.20.0"
              },
              "data": [
                {
                  "index": 0,
                  "param_session": [
                    {
                      "name": "foo"
                    },
                    {
                      "name": "foo"
                    }
                  ]
                }
              ]
            }
          },
@captainsafia
Copy link
Member

@emeeks Is this ringing any bells for you?

@jruales
Copy link
Author

jruales commented Apr 19, 2021

From what I understand, the problem is that whatever values inside "data" in the output are being inserted as children into the React component from a cell, and the problem arises when the data is a dictionary.

So I'm thinking that currently, the

                  [
                    {
                      "name": "foo"
                    },
                    {
                      "name": "foo"
                    }
                  ]

is just being inlined in React, but probably should be turned into a string first before inlining

@captainsafia
Copy link
Member

@jruales Are you able to repro this with the raw data explorer component? I wonder if it has something to do with the way we wrap it in the output.

cc: @willingc

@willingc willingc added the bug Something isn't working label Jun 4, 2021
@hydrosquall
Copy link
Member

I was able to reproduce @jruales's issue outside of Jupyter. The issue persists regardless when the schema type is set to object or array.

Demo: https://codesandbox.io/s/pedantic-hodgkin-78o80?file=/src/App.js:216-221

@emeeks what do you think about changing data-explorer to accept a column type of type object which stringifies the cell internally, vs asking callers of data-explorer to transform object cells into strings before passing them in? We have at least 2 options:

  1. If the column is actually an array or object type per the Frictionless data spec, call JSON.stringify on it to avoid this React error when displaying these cells in tables. This will make the object value displayable in the table, but they won't be used in any of the actual visualizations. Somewhere in the python binding code, the field type should be changed from string to object or array.
  2. Data explorer drops any frictionless spec column types that it doesn't recognize (e.g. just date/boolean/number/string) .

@github-actions
Copy link

github-actions bot commented Jul 1, 2021

🚀 Issue was released in v8.2.11 🚀

@github-actions github-actions bot added the released This issue/pull request has been released. label Jul 1, 2021
@hydrosquall
Copy link
Member

Reopening since while #65 fixes the issue for Javascript consumers when the schema type for these complex columns is set to object instead of string, but a separate fix (maybe a separate issue) needs to be applied to get the pandas code to set the column type correctly.

@hydrosquall hydrosquall reopened this Jul 1, 2021
@hydrosquall
Copy link
Member

hydrosquall commented Jul 2, 2021

I tried to reproduce this issue in my local jupyterlab, but found it wasn't working with the latest version.

Image 2021-07-02 at 1 36 14 PM

I think the data-explorer package (which hasn't been updated in a year) is getting the data in from here, but I'm not sure how to track where the frictionless data spec is generated (perhaps it is coming from something in the Python code). Once we do, we'll want to find a way to get it to set the column type properly (Pandas has it correctly set as an object based on the screencap below)

Image 2021-07-02 at 1 45 48 PM

@jruales did you run into this issue while using Jupyter Lab or Jupyter Notebook?

@hydrosquall
Copy link
Member

hydrosquall commented Jul 5, 2021

I decided to have a look at the Pandas documentation, and found the root of the issue.

https://pandas.pydata.org/docs/user_guide/io.html#table-schema

The column type for a Pandas object column is set to a Frictionless spec string rather than an Object.

https://sourcegraph.com/github.com/pandas-dev/pandas@dad3e7fc3a2a75ba5f330899be0639cff0f73f6c/-/blob/pandas/io/json/_table_schema.py?L62-89

I think we actually want this to be returning a Frictionless object instead.

https://sourcegraph.com/github.com/pandas-dev/pandas@dad3e7fc3a2a75ba5f330899be0639cff0f73f6c/-/blob/pandas/core/dtypes/common.py?L532-571

During the serialization/deserialization process to Jupyter, the string contents were turned back into a JSON object, as it's no longer a string by the time it reaches the data-explorer. There also wasn't metadata that can be used to differentiate what was originally a string from a list of Python objects. Related reading about strings and objects

df = DataFrame(
            {
                "A": ["a", "b", "c"],
                "B": [{ "a": 1}, { "b": 1}, { "c": 1}]
            }
        )
col_types = df.dtypes
# strings and object columns are treated the same way in Pandas
col_types[0] == col_types[1] # this returns true :(

This issue was brought up when Table Schema was implemented in Pandas, but ultimately object ultimately didn't get supported as a special data type.
pandas-dev/pandas#14904 (comment)

There might be a "sniffing heuristic" that we could apply at the Javascript or Python level, where if a column is labeled as a string at the Frictionless level, but actually contains JSON objects in each single cell, we could treat the column as Frictionless spec object instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants