Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.lib.ArrowInvalid: Failed casting from large_string to string: input array too large #48

Closed
rvandewater opened this issue Dec 13, 2024 · 7 comments
Labels
bug Something isn't working Needs Clarification OMOP ETL For the OMOP ETL priority:high

Comments

@rvandewater
Copy link
Contributor

Getting this error after my adjustments at https://github.com/rvandewater/meds_etl

Generating metadata from OMOP `concept` table                                                                                                                                           █      
Generating metadata from OMOP `concept` table: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:27<00:00, 27.98s/it]
Generating metadata from OMOP `concept_relationship` table: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████   █████████████████████████████| 1/1 [00:02<00:00,  2.53s/it]                                                                                                                              
Extracting dataset metadata: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 49.00it/s]
Decompressing OMOP tables, mapping to MEDS Unsorted format, writing to disk...
0it [00:00, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:00<00:00, 22.52s/it]
Finished converting dataset to MEDS Unsorted.
Converting from MEDS Unsorted to MEDS...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:40<00:00,  3.37s/it]
Collating source table data, shard by shard, to create subject timelines...
(Gathering events into timelines)
  0%|                                                                                                                                                                                          Dec-24                     "lg03a01" 04:04 13-Dec-24                                     | 0/1 [00:27<?, ?it/s]
Traceback (most recent call last):
  File "/hpc/users/vander09/.conda/envs/meds_etl_033/bin/meds_etl_omop", line 8, in <module>
    sys.exit(main())                                                                                                                                                                         24
  File "/hpc/users/vander09/.conda/envs/meds_etl_033/lib/python3.10/site-packages/meds_etl/omop.py", line 798, in main
    meds_etl.unsorted.sort(
  File "/hpc/users/vander09/.conda/envs/meds_etl_033/lib/python3.10/site-packages/meds_etl/unsorted.py", line 303, in sort
    sort_polars(source_unsorted_path, target_meds_path, num_shards, num_proc)
  File "/hpc/users/vander09/.conda/envs/meds_etl_033/lib/python3.10/site-packages/meds_etl/unsorted.py", line 256, in sort_polars                                                       4
    casted = converted.cast(desired_schema)
  File "pyarrow/table.pxi", line 4683, in pyarrow.lib.Table.cast
  File "pyarrow/table.pxi", line 593, in pyarrow.lib.ChunkedArray.cast
  File "/hpc/users/vander09/.conda/envs/meds_etl_033/lib/python3.10/site-packages/pyarrow/compute.py", line 405, in cast
    return call_function("cast", [arr], options, memory_pool)
  File "pyarrow/_compute.pyx", line 598, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed casting from large_string to string: input array too large
@rvandewater
Copy link
Contributor Author

The problem was that I accidentally tried to create a 1-shard dataset. Perhaps we should put a warning somewhere to prevent it.

@EthanSteinberg
Copy link
Collaborator

@rvandewater Can you provide a reproducible example using a dataset I could get access to? This is a weird error message that should get fixed.

@rvandewater
Copy link
Contributor Author

I think if you would try to map the entirety of MIMIC-IV to one shard you would get the same error. Will see if I have the time to try

@EthanSteinberg
Copy link
Collaborator

@rvandewater I'll try that. That's a good suggestion.

@mmcdermott
Copy link

@EthanSteinberg and @rvandewater, is this fixed, or still an outstanding issue?

@mmcdermott mmcdermott added bug Something isn't working priority:high OMOP ETL For the OMOP ETL Needs Clarification labels Jan 7, 2025
@rvandewater
Copy link
Contributor Author

IMO it is not that important of a bug as I was trying to map everything to one shard (which is highly unadvisable). Adding warnings to doing this that are dependent on the dataset could be helpful.

@mmcdermott
Copy link

Ok, then I'm going to close this issue for now, and we can re-open it or create a new one more targeted to documentation as needed pending the resolution plan for #51

@mmcdermott mmcdermott closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Needs Clarification OMOP ETL For the OMOP ETL priority:high
Projects
None yet
Development

No branches or pull requests

3 participants