These are my hopes and dreams when it comes to an s3-compat proxy/cache/etc.
The goal of this phase is simple:
Provide an easy way to minimize the downsides of querying data straight from s3 using embedded olap systems such as DataFusion, DuckDB, chDB, Velox, etc.
- Configurable cache backends
- in-memory
- local filesystem
- s3 express
- dynamodb? something else?
- Lightweight aliasing
- Make
s3://tblA/$YEAR/$MONTH/$DAY
an alias ofs3://somebucket/somepath/$YEAR/$MONTH/$DAY
- Make
s3://tblB/
an alias ofs3://somebucket/anotherpath/
- Make
- Multiple (x-object store) backends
- Alias
s3://tblC
togcs://somebucket/somepath
- Alias
s3://tblD
tos3://yeesh/
- Alias
- Alias and object-level caching configuration
- Cache all objects from
s3://pathA
for 30 minutes - LRU-cache
s3://pathB
to a maximum of 20 gb - Cache and reload all objects from
s3://pathC
every hour
- Cache all objects from
- Auth
- Passthrough to aws/gcs iam (for now)
- Object-level purge/warm endpoint(s)
- Purge and optionally reload objects from cache when
- Alias-level purge/warm endpoint(s)
- Ditto to
object-level
purging, but incorporate it for aliases.
- Ditto to
- Lightweight object cataloging
- Speed up
list
operations such asaws s3 ls
- Incremental additions using object/bucket notifications
- Speed up
The goal of this phase is to start leveraging the benefits of having a sql engine (datafusion
) sitting between an s3-compatible shim (sprox
) and s3
/gcs
/azure blob
/minio
/ etc itself.
- Read
- Column/row-level obfuscation
- Lightweight read-time data transformation
- Enrichment-on-read using external sources
- X-backend query routing to leverage hybrid storage
- Cataloging
- Optimized onboard catalog updates (subscribe to bucket event notifications, etc)
- IAM
- Fine-grained authorization
The goal of this phase is to make s3-based data management easier.
- Compaction
- Configurable cataloging
- Delta table?
- Iceberg?
- Something else?
- Table versioning