Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Dex datasets library #458

Open
4 tasks
dan-zheng opened this issue Jan 14, 2021 · 4 comments
Open
4 tasks

Create a Dex datasets library #458

dan-zheng opened this issue Jan 14, 2021 · 4 comments
Labels
libraries Libraries written in Dex

Comments

@dan-zheng
Copy link
Collaborator

dan-zheng commented Jan 14, 2021

Motivation

Create a structured datasets library within Dex: lib/datasets.dx.

The library should enable straightforward usage of machine learning datasets, including the following:

  • Downloading datasets to a shared machine location (in /tmp or ~/.dex/datasets/...)
  • Unzipping datasets, handling various compression formats.
  • Parsing: loading the dataset as a Dex data structure.
    • Example: input-output pairs. List (inputSize => Float & labelSize => Int)
  • Transforms: data transformations - batching, shuffling (nondeterminism), concatenation, filtering, mapping, and augmentation.

Implementation ideas

  • Dataset downloading and unzipping could be implemented via shell command support in Dex using IO effect.
    • Example: wget a named dataset with library-hardcoded URL to ~/.dex/datasets/... if it doesn't already exist.
  • Parsing could be implemented using parser combinators, or ad-hoc string processing logic.
  • Transforms could be implemented using Accum effect for MapReduce-like functionality and potential for parallelism.

Prior work

@dan-zheng dan-zheng added the libraries Libraries written in Dex label Jan 14, 2021
@dan-zheng dan-zheng mentioned this issue Jan 14, 2021
@oxinabox
Copy link
Contributor

oxinabox commented Jan 14, 2021

Prior Work:

I suggest splitting out transforms into a seperate issue.
Possibly also sperating out parsing into a seperate issue.
Both are huge

@apaszke
Copy link
Collaborator

apaszke commented Jan 14, 2021

Other prior work: torchvision. Still, I would say that this is somewhat low priority, because I don't expect we'll be able to make a big splash in the hyper-optimized space of standard ML models.

@dan-zheng
Copy link
Collaborator Author

That makes sense, thanks!

@srush
Copy link
Contributor

srush commented Jan 14, 2021

@dan-zheng I think a nice option here would be to write bindings to https://en.wikipedia.org/wiki/Apache_Arrow

https://github.com/huggingface/datasets has a ton of datasets in this form. It seems a bit crazy to rewrite this sort of infrastructure for each language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libraries Libraries written in Dex
Projects
None yet
Development

No branches or pull requests

4 participants