Apache Hudi (Incubating) (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals
.
Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).
- Upsert support with fast, pluggable indexing
- Atomically publish data with rollback support
- Snapshot isolation between writer & queries
- Savepoints for data recovery
- Manages file sizes, layout using statistics
- Async compaction of row & columnar data
- Timeline metadata to track lineage
Hudi provides the ability to query via three types of views:
- Read Optimized View - Provides excellent snapshot query performance via purely columnar storage (e.g. Parquet).
- Incremental View - Provides a change stream with records inserted or updated after a point in time.
- Real-time View - Provides snapshot queries on real-time data, using a combination of columnar & row-based storage (e.g Parquet + Avro).
Learn more about Hudi at https://hudi.apache.org
Prerequisites for building Apache Hudi:
- Unix-like system (like Linux, Mac OS X)
- Java 8 (Java 9 or 10 may work)
- Git
- Maven
# Checkout code and build
git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
mvn clean package -DskipTests -DskipITs
Please visit https://hudi.apache.org/quickstart.html to quickly explore Hudi's capabilities using spark-shell.