Usecases question #48

aplavin · 2024-06-20T22:51:18Z

Nice to see a modern take on datasets handling in Julia!
I've been looking at DataToolkit trying to understand how to apply it and what specific advantages would it bring. I have three different usecases in mind, and cannot really understand how to plug DataToolkit in any of them.
Briefly outlining them below, any suggestions are welcome!

Small and sporadically-updated table, like 500 rows. Currently, I just put a CSV file into a data folder of the Julia package, and provide a function that reads it into a Julia table with some minor cleanup.
What can DataToolkit improve here?
An online collection of publicly-available tables. For a specific example, astronomical catalogs at https://vizier.cds.unistra.fr. Currently (in VirtualObservatory.jl) I provide a function that's basically download_and_read_table(catalog_id::String) with some conveniences.
There are obvious issues with that: every time the dataset is downloaded anew, and one cannot access a dataset without internet/when the archive is down even if it was downloaded previously. Some transparent caching would be nice.
A large well-structured collection of files (tables, images, ...), think hundreds of GBs. Currently, I manually ensure that the collection is available on the machine I need to work at, and have an interface like MyDataset("path-to-the-directory").
Would be nice to have a per-machine config file so that the path is defined here, and then MyDataset() automatically finds it. Also, maybe some basic presence/sanity checks...

The text was updated successfully, but these errors were encountered:

tecosaur · 2024-07-05T08:30:38Z

Hi Alexander, thanks for your interest 🙂. By the sounds of it, DataToolkit should be able to help with some of the use cases you describe, and I'd be very happy to explore how it could do so better :)

(1) Small sporadically updated table

Without more details, it sounds like it would be hard to make much of an improvement over just calling CSV.read/CSV.write.

(2) An online collection of tabular data

This sounds rather applicable. I'd recommend holding off till the 0.10 release, but using DataToolkitBase + a Data.toml in the package, you'd get transparent caching via the "data store" (store plugin).

That said, if you've got 1000s of entries, asking for them to be put in a Data.toml is probably a bit much. You could use the (v0.10) API to construct anonymous DataCollections. This comes with an earlier "automatic cleanup" age (30 days by default), but I imagine that's still helpful.

Something else you could consider doing is add an API for adding a particular catalog to a user's project Data.toml, which would by extension allow them to download all the data referenced in that project using the data> store fetch command. Whether or not this makes sense will depend on how you see the package being used, but if it does I'd be happy to help.

(3) A large well-structured collection of files

How large is "large?" I've only tested DataToolkit with a few hundred data sets myself, but it should scale further, and I'd like to improve this area if it doesn't.

To put this collection in a Data.toml, an entry would need to be made for each "dataset" within the collection, but this could be done using the (v0.10) API for ease.

I think the main benefit you'd get from this is data existence and integrity checks.

The per-machine config makes things a little trickier. Since the "data store" is content-addressed, if you have the checksum for a dataset and it exists in the store, it doesn't need to be told where to find it.

That said, currently no work has been done for per-machine config files, beyond cache settings for the "data store".

aplavin · 2024-07-06T19:11:15Z

Thanks for a detailed response!

(1) Small sporadically updated table

Without more details, it sounds like it would be hard to make much of an improvement over just calling CSV.read/CSV.write.

It's generally fine, but could use features like accessing older version of the same dataset. Is it in scope for this package?

using DataToolkitBase + a Data.toml in the package, you'd get transparent caching via the "data store" (store plugin).
That said, if you've got 1000s of entries, asking for them to be put in a Data.toml is probably a bit much.

Oh, there's definitely thousands and more of entries! Even more, new ones are regularly being added.
Still, all of them have "permanent" keys, basically paper_id/table_id. Transparent caching would make sense here, together with "automatic cleanup" you mentioned. Maybe other DataToolkit functions could be useful as well, not sure – caching is just what immediately pops in my mind.

Something else you could consider doing is add an API for adding a particular catalog to a user's project Data.toml, which would by extension allow them to download all the data referenced in that project using the data> store fetch command. Whether or not this makes sense will depend on how you see the package being used, but if it does I'd be happy to help.

Interesting approach... Does it work nicely with Pluto and temp envs more generally?
The current API in its most basic form is just

vizcat = VizierCatalog("J/ApJ/923/67/table2")  # metadata only, no actual data loaded
tbl = table(vizcat)  # downloads and reads the actual table

How do you think it would look in this scenario from the user side? I wonder what exactly "API for adding a particular catalog to a user's project Data.toml" should entail...

(3) A large well-structured collection of files

How large is "large?"

I'm thinking few*10^5 files, some hundreds Gb. Don't think it's reasonable to put each file individually into the toml list.

The per-machine config makes things a little trickier. Since the "data store" is content-addressed, if you have the checksum for a dataset and it exists in the store, it doesn't need to be told where to find it.

What do you mean by "exists in the store"? The data is just available on each machine, either locally at some path, or at a remotely-mounted disk. And it should surely stay that way, some other tools also access it. Just looking for a nice approach to access it from Julia without hardcoding the path in the code, and potentially with some consistency checks.

tecosaur · 2024-09-26T04:59:01Z

It's generally fine, but could use features like accessing older version of the same dataset. Is it in scope for this package?

Well, there is the versions plugin already — does that do what you want?

Interesting approach... Does it work nicely with Pluto and temp envs more generally?

My gut feeling is it should be fine, but I do recall some issues with Pluto specifically when replying on packages made available in the notebook: it's to do with the way Pluto creates temporary modules for each cell.

I wonder what exactly "API for adding a particular catalog to a user's project Data.toml" should entail...

This is currently being reworked in v0.10. Taking your mention of "a particular catalogue" as a DataSet, it might look something like this:

using DataToolkitCore

usrcol = getlayer()

ds = create!(usrcol, DataSet, "J/ApJ/923/67/table2", "description" => "...", ...)
storage!(ds, :web, "url" => "...")
loader!(ds, :something, params...)

I'm thinking few*10^5 files, some hundreds Gb. Don't think it's reasonable to put each file individually into the toml list.

Yea, some new tooling/capabilities will be needed to handle something like that.

What do you mean by "exists in the store"? The data is just available on each machine, either locally at some path, or at a remotely-mounted disk. And it should surely stay that way, some other tools also access it.

"The store" is a directory managed by DataToolkitStore that serves as a garbage-collected content-addressed archive.

You can set a path and hard-code DataToolkit and other tools to use it, or use the store and get the path from DataToolkit.

aplavin · 2024-09-26T11:27:18Z

Taking your mention of "a particular catalogue" as a DataSet, it might look something like this: <...

Hm, that does sound reasonable! I wonder how caching would work when I have the same "catalogue" added independently in several different julia environment: will it only download and store it once? Would it need internet access at all to add the same catalogue in a new env?

tecosaur · 2024-09-26T18:29:24Z

I wonder how caching would work when I have the same "catalogue" added independently in several different julia environment: will it only download and store it once? Would it need internet access at all to add the same catalogue in a new env?

The whole point of a central store is that multiple projects can all reference the same data, and it will only be downloaded/stored once 🙂 (with no internet access needed so long as different projects look to be accessing the same file: by checksum or dataset attributes)

aplavin · 2024-09-26T18:33:35Z

Thanks, I'll probably try DataToolkit in this scenario in the near future! I have a couple of private packages that serve the purpose of convenient access to specific datasets, and they are quite isolated, so a nice playground to see how it works :)

tecosaur · 2024-09-26T18:38:23Z

Nice! I suspect this will work a bit more pleasent after v0.10 is out (the example I gave uses new API), but if you're interested in taking it for a test run I'd be happy to help you work out details/fix bugs you may run into (maybe even improve the API).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usecases question #48

Usecases question #48

aplavin commented Jun 20, 2024 •

edited

Loading

tecosaur commented Jul 5, 2024

aplavin commented Jul 6, 2024 •

edited

Loading

tecosaur commented Sep 26, 2024

aplavin commented Sep 26, 2024

tecosaur commented Sep 26, 2024

aplavin commented Sep 26, 2024 •

edited

Loading

tecosaur commented Sep 26, 2024

Usecases question #48

Usecases question #48

Comments

aplavin commented Jun 20, 2024 • edited Loading

tecosaur commented Jul 5, 2024

aplavin commented Jul 6, 2024 • edited Loading

tecosaur commented Sep 26, 2024

aplavin commented Sep 26, 2024

tecosaur commented Sep 26, 2024

aplavin commented Sep 26, 2024 • edited Loading

tecosaur commented Sep 26, 2024

aplavin commented Jun 20, 2024 •

edited

Loading

aplavin commented Jul 6, 2024 •

edited

Loading

aplavin commented Sep 26, 2024 •

edited

Loading