Index error with Categorify on transform step for columns with 100% NaNs #1865

lecardozo · 2023-10-02T14:57:53Z

I was running a workflow.transform(sampled_dataset) step on a sample of my inference dataset and received the following error

Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 510, in transform
    encoded = _encode(
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 1707, in _encode
    if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1625, in _getitem_axis
    self._validate_integer(key, axis)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1557, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/merlin/dag/executors.py", line 237, in _run_node_transform
    transformed_data = node.op.transform(selection, input_data)
  File "/databricks/python/lib/python3.8/site-packages/merlin/core/dispatch.py", line 69, in inner2
    return func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 534, in transform
    raise RuntimeError(f"Failed to categorical encode column {name}") from e
RuntimeError: Failed to categorical encode column my_categorical_column

I noticed this happens when the dataset to be transformed has a categorical column (my_categorical_column) with 100% NaNs. It looks like that happens when this line is reached 👇 where we do a dropna() followed by iloc[0]

NVTabular/nvtabular/ops/categorify.py

Line 1707 in ee21af0

if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):

It's not a huge blocker for me right now, as this mostly happens on dataset samples, but I'm wondering whether that behavior is expected. Any thoughts? 😃

The text was updated successfully, but these errors were encountered:

lecardozo linked a pull request Oct 18, 2023 that will close this issue

fix: IndexError with columns full of NaNs #1869

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index error with Categorify on transform step for columns with 100% NaNs #1865

Index error with Categorify on transform step for columns with 100% NaNs #1865

lecardozo commented Oct 2, 2023

Index error with Categorify on transform step for columns with 100% NaNs #1865

Index error with Categorify on transform step for columns with 100% NaNs #1865

Comments

lecardozo commented Oct 2, 2023