-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usecases question #48
Comments
Hi Alexander, thanks for your interest 🙂. By the sounds of it, DataToolkit should be able to help with some of the use cases you describe, and I'd be very happy to explore how it could do so better :)
Without more details, it sounds like it would be hard to make much of an improvement over just calling
This sounds rather applicable. I'd recommend holding off till the 0.10 release, but That said, if you've got 1000s of entries, asking for them to be put in a Something else you could consider doing is add an API for adding a particular catalog to a user's project
How large is "large?" I've only tested To put this collection in a I think the main benefit you'd get from this is data existence and integrity checks. The per-machine config makes things a little trickier. Since the "data store" is content-addressed, if you have the checksum for a dataset and it exists in the store, it doesn't need to be told where to find it. That said, currently no work has been done for per-machine config files, beyond cache settings for the "data store". |
Thanks for a detailed response!
It's generally fine, but could use features like accessing older version of the same dataset. Is it in scope for this package?
Oh, there's definitely thousands and more of entries! Even more, new ones are regularly being added.
Interesting approach... Does it work nicely with Pluto and temp envs more generally? vizcat = VizierCatalog("J/ApJ/923/67/table2") # metadata only, no actual data loaded
tbl = table(vizcat) # downloads and reads the actual table How do you think it would look in this scenario from the user side? I wonder what exactly "API for adding a particular catalog to a user's project Data.toml" should entail...
I'm thinking few*10^5 files, some hundreds Gb. Don't think it's reasonable to put each file individually into the toml list.
What do you mean by "exists in the store"? The data is just available on each machine, either locally at some path, or at a remotely-mounted disk. And it should surely stay that way, some other tools also access it. Just looking for a nice approach to access it from Julia without hardcoding the path in the code, and potentially with some consistency checks. |
Well, there is the
My gut feeling is it should be fine, but I do recall some issues with Pluto specifically when replying on packages made available in the notebook: it's to do with the way Pluto creates temporary modules for each cell.
This is currently being reworked in v0.10. Taking your mention of "a particular catalogue" as a using DataToolkitCore
usrcol = getlayer()
ds = create!(usrcol, DataSet, "J/ApJ/923/67/table2", "description" => "...", ...)
storage!(ds, :web, "url" => "...")
loader!(ds, :something, params...)
Yea, some new tooling/capabilities will be needed to handle something like that.
"The store" is a directory managed by You can set a path and hard-code DataToolkit and other tools to use it, or use the store and get the path from |
Hm, that does sound reasonable! I wonder how caching would work when I have the same "catalogue" added independently in several different julia environment: will it only download and store it once? Would it need internet access at all to add the same catalogue in a new env? |
The whole point of a central store is that multiple projects can all reference the same data, and it will only be downloaded/stored once 🙂 (with no internet access needed so long as different projects look to be accessing the same file: by checksum or dataset attributes) |
Thanks, I'll probably try DataToolkit in this scenario in the near future! I have a couple of private packages that serve the purpose of convenient access to specific datasets, and they are quite isolated, so a nice playground to see how it works :) |
Nice! I suspect this will work a bit more pleasent after v0.10 is out (the example I gave uses new API), but if you're interested in taking it for a test run I'd be happy to help you work out details/fix bugs you may run into (maybe even improve the API). |
Nice to see a modern take on datasets handling in Julia!
I've been looking at DataToolkit trying to understand how to apply it and what specific advantages would it bring. I have three different usecases in mind, and cannot really understand how to plug DataToolkit in any of them.
Briefly outlining them below, any suggestions are welcome!
Small and sporadically-updated table, like 500 rows. Currently, I just put a CSV file into a
data
folder of the Julia package, and provide a function that reads it into a Julia table with some minor cleanup.What can DataToolkit improve here?
An online collection of publicly-available tables. For a specific example, astronomical catalogs at https://vizier.cds.unistra.fr. Currently (in VirtualObservatory.jl) I provide a function that's basically
download_and_read_table(catalog_id::String)
with some conveniences.There are obvious issues with that: every time the dataset is downloaded anew, and one cannot access a dataset without internet/when the archive is down even if it was downloaded previously. Some transparent caching would be nice.
A large well-structured collection of files (tables, images, ...), think hundreds of GBs. Currently, I manually ensure that the collection is available on the machine I need to work at, and have an interface like
MyDataset("path-to-the-directory")
.Would be nice to have a per-machine config file so that the path is defined here, and then
MyDataset()
automatically finds it. Also, maybe some basic presence/sanity checks...The text was updated successfully, but these errors were encountered: