You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'll soon be reworking VisionEval's export facilities to make it easy to dump data out in a variety of file system formats (including any SQL database supported in R DBI including ODBC connections, and direct-to-Excel where the table size is compatible).
Among the formats I've been investigating is Apache Arrow. Apart from the "feather" and "parquet" export data formats, Arrow also provides CSV and JSON reading/writing. And the native Arrow formats support hierarchical tables and larger-than-memory datasets.
So I'm thinking of looking at implementing the VisionEval Datastore as an Arrow structure (and manipulating it that way in memory). There are lots of points of contact - column-wise data structure, hierarchy, working on subsets at a time. That would lead to defining a new DatastoreType (and the corresponding access functions).
So here's the discussion point: is Arrow something that would appear to the user base? It certainly would make data transfer to Python, cloud analysis frameworks (e.g.. Hadoop) and other platforms much simpler (though we can always walk the existing data structures over to Arrow with little effort). And I like the idea of implementing the VisionEval Datastore in a format where it can simply be opened in another system using Arrow without having to go through any time- or memory-intensive extraction/conversion operation. That remains one benefit of using the HDF5 format, but the virtue of Arrow is that it is a higher-level approach that offers "native" interfaces.
Drop your thoughts onto this discussion. I'll update as the I/O work proceeds.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'll soon be reworking VisionEval's export facilities to make it easy to dump data out in a variety of file system formats (including any SQL database supported in R DBI including ODBC connections, and direct-to-Excel where the table size is compatible).
Among the formats I've been investigating is Apache Arrow. Apart from the "feather" and "parquet" export data formats, Arrow also provides CSV and JSON reading/writing. And the native Arrow formats support hierarchical tables and larger-than-memory datasets.
So I'm thinking of looking at implementing the VisionEval Datastore as an Arrow structure (and manipulating it that way in memory). There are lots of points of contact - column-wise data structure, hierarchy, working on subsets at a time. That would lead to defining a new DatastoreType (and the corresponding access functions).
So here's the discussion point: is Arrow something that would appear to the user base? It certainly would make data transfer to Python, cloud analysis frameworks (e.g.. Hadoop) and other platforms much simpler (though we can always walk the existing data structures over to Arrow with little effort). And I like the idea of implementing the VisionEval Datastore in a format where it can simply be opened in another system using Arrow without having to go through any time- or memory-intensive extraction/conversion operation. That remains one benefit of using the HDF5 format, but the virtue of Arrow is that it is a higher-level approach that offers "native" interfaces.
Drop your thoughts onto this discussion. I'll update as the I/O work proceeds.
Beta Was this translation helpful? Give feedback.
All reactions