Data Formats #176

jrawbits · 2022-03-01T14:55:22Z

jrawbits
Mar 1, 2022
Maintainer

I'll soon be reworking VisionEval's export facilities to make it easy to dump data out in a variety of file system formats (including any SQL database supported in R DBI including ODBC connections, and direct-to-Excel where the table size is compatible).

Among the formats I've been investigating is Apache Arrow. Apart from the "feather" and "parquet" export data formats, Arrow also provides CSV and JSON reading/writing. And the native Arrow formats support hierarchical tables and larger-than-memory datasets.

So I'm thinking of looking at implementing the VisionEval Datastore as an Arrow structure (and manipulating it that way in memory). There are lots of points of contact - column-wise data structure, hierarchy, working on subsets at a time. That would lead to defining a new DatastoreType (and the corresponding access functions).

So here's the discussion point: is Arrow something that would appear to the user base? It certainly would make data transfer to Python, cloud analysis frameworks (e.g.. Hadoop) and other platforms much simpler (though we can always walk the existing data structures over to Arrow with little effort). And I like the idea of implementing the VisionEval Datastore in a format where it can simply be opened in another system using Arrow without having to go through any time- or memory-intensive extraction/conversion operation. That remains one benefit of using the HDF5 format, but the virtue of Arrow is that it is a higher-level approach that offers "native" interfaces.

Drop your thoughts onto this discussion. I'll update as the I/O work proceeds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Formats #176

{{title}}

Replies: 0 comments

Select a reply

Data Formats #176

jrawbits Mar 1, 2022 Maintainer

Replies: 0 comments

jrawbits
Mar 1, 2022
Maintainer