- open.quiltdata.com is a petabyte-scale open data portal that runs on Quilt
- quiltdata.com includes case studies, use cases, videos, and instructions on how to run a private Quilt instance
- Versioning data and models for rapid experimentation in machine learning shows how to use Quilt for real world projects
Quilt is for data-driven teams and offers features for coders (data scientists, data engineers, developers) and business users alike.
Quilt manages data like code so that teams in machine learning, biotech, and analytics can experiment faster, build smarter models, and recover from errors.
Quilt consists of a Python client, web catalog, lambda functions—all of which are open source—plus a suite of backend services and Docker containers orchestrated by CloudFormation.
The latter are available for private use under a paid license on quiltdata.com.
- Share data at scale. Quilt wraps AWS S3 to add simple URLs, web preview for large files, and sharing via email address (no need to create an IAM role).
- Understand data better through inline documentation (Jupyter notebooks, markdown) and visualizations (Vega, Vega Lite)
- Discover related data by indexing objects in ElasticSearch
- Model data by providing a home for large data and models that don't fit in git, and by providing immutable versions for objects and data sets (a.k.a. "Quilt Packages")
- Decide by broadening data access within the organization and supporting the documentation of decision processes through audit-able versioning and inline documentation
- Address performance issues with push (e.g. re-hash)
- Provide Presto-DB-powered services for filtering package repos with SQL
- Investigate and implement more efficient manifest formats (e.g. Parquet), that scale to 10M keys; consider abbreviated "fast manifests" for lazy browsing
- Refactor
s3://bucket/.quilt
for improved listing and delete performance
- Ability to fork/merge packages
- Data quality monitoring
- Evaluate min.io and ceph.io as shims
- Evaluate feasibility of on-prem local storage as a repo
- Evaluate K8s and Terraform to replace CloudFormation
- Shim lambdas (consider serverless.com)
- Shim ElasticSearch (consider SOLR)
- Shim IAM via RBAC