Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slide ETL fails writing to parquet file #407

Open
darinmoore opened this issue Nov 2, 2023 · 0 comments
Open

Slide ETL fails writing to parquet file #407

darinmoore opened this issue Nov 2, 2023 · 0 comments

Comments

@darinmoore
Copy link
Contributor

When using the slide_etl cli function as a python function from src/luna/pathology/cli/slide_etl.py, I am unable to write the etl dataframe to a parquet file. It looks like the data is generated correctly, but an error with the actual write process. I was able to get a workaround by writing to a .csv file.

Code snippet:

from luna.pathology.cli.slide_etl import cli as slide_etl
from dask.distributed import Client

ETL_OUTPUT_PATH = "./slide_etls/"

def create_slide_etl(path, project):    
    with Client() as client:
        slide_etl(
            slide_urlpath = path,
            output_urlpath = ETL_OUTPUT_PATH, 
            project_name = project, 
            comment = "Automated slide etl", 
            no_copy = True
        )
if __name__ == "__main__":
	create_slide_etl("/gpfs/mskmind_ess/pathology_images/spectrum_hnes", "SPECTRUM")

Error traceback:

2023-11-02 13:33:48.758 | INFO     | luna.pathology.cli.slide_etl:cli:95 -            id project_name              comment  slide_size                                  uuid  ... properties.aperio.Parmset  properties.aperio.Filtered  properties.aperio.Gamma  properties.aperio.Rack  properties.aperio.Slide
0     1054708     SPECTRUM  Automated slide etl   150163869  1b8e0567-4fab-3297-b1c8-7ab9737b8448  ...                       NaN                         NaN                      NaN                     NaN                      NaN
1     1054710     SPECTRUM  Automated slide etl   165100695  02d61566-1a26-307a-a079-ac96d7973afc  ...                       NaN                         NaN                      NaN                     NaN                      NaN
2     1148882     SPECTRUM  Automated slide etl   254362059  1493d7fe-d8ae-3ba9-8a89-7b1e9d082fec  ...                       NaN                         NaN                      NaN                     NaN                      NaN
3     1448993     SPECTRUM  Automated slide etl   203616883  010f15aa-a8e5-3e5b-907e-be5d3480aba3  ...                       NaN                         NaN                      NaN                     NaN                      NaN
4     1465759     SPECTRUM  Automated slide etl    88301569  e25c7caf-9ced-3cae-9587-70cfc436b2bf  ...                       NaN                         NaN                      NaN                     NaN                      NaN
...       ...          ...                  ...         ...                                   ...  ...                       ...                         ...                      ...                     ...                      ...
1297   936675     SPECTRUM  Automated slide etl   501182037  53780d71-5a8e-3741-8d62-9491bb3ffc9f  ...                       NaN                         NaN                      NaN                     NaN                      NaN
1298   955274     SPECTRUM  Automated slide etl   566795321  ecb9deac-f076-37eb-b038-cdec4df4823c  ...                       NaN                         NaN                      NaN                     NaN                      NaN
1299   955292     SPECTRUM  Automated slide etl   486125355  8767d44e-cd48-3340-b7c4-f1e40c975e59  ...                       NaN                         NaN                      NaN                     NaN                      NaN
1300   957788     SPECTRUM  Automated slide etl   479193387  6b871890-3a2d-3d79-afaa-02ce7423a2b9  ...                       NaN                         NaN                      NaN                     NaN                      NaN
1301   957796     SPECTRUM  Automated slide etl   361443771  cbf3b996-69a7-3120-bbc2-6f42f4f027e5  ...                       NaN                         NaN                      NaN                     NaN                      NaN

[1302 rows x 76 columns]
2023-11-02 13:33:48.838 | INFO     | luna.pathology.cli.slide_etl:cli:107 - Writing to parquet file
Traceback (most recent call last):
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/slide_inventory/create_slide_etl.py", line 17, in <module>
    create_slide_etl("/gpfs/mskmind_ess/pathology_images/spectrum_hnes", "SPECTRUM")
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/slide_inventory/create_slide_etl.py", line 8, in create_slide_etl
    slide_etl(
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/src/luna/common/utils.py", line 143, in wrapper
    result = func(*args, **kwargs)
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/src/luna/pathology/cli/slide_etl.py", line 108, in cli
    df.to_parquet(of)
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/.venv/luna/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/.venv/luna/lib/python3.9/site-packages/pandas/core/frame.py", line 2976, in to_parquet
    return to_parquet(
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/.venv/luna/lib/python3.9/site-packages/pandas/io/parquet.py", line 430, in to_parquet
    impl.write(
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/.venv/luna/lib/python3.9/site-packages/pandas/io/parquet.py", line 174, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 3475, in pyarrow.lib.Table.from_pandas
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/.venv/luna/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 611, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/.venv/luna/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 611, in <listcomp>
    arrays = [convert_column(c, f)
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/.venv/luna/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column
    raise e
  File "/gpfs/mskmind_emc/data_user/shared_data_folder/moored2/luna/.venv/luna/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 592, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 316, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'int' object", 'Conversion failed for column properties.aperio.DSR ID with type object')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant