Skip to content

Commit

Permalink
[SITE][MINOR] Add Matomo for site traffic and fix links in blogs (#12018
Browse files Browse the repository at this point in the history
)
  • Loading branch information
bhasudha authored Sep 27, 2024
1 parent 3b5a389 commit e60604f
Show file tree
Hide file tree
Showing 18 changed files with 22 additions and 21 deletions.
2 changes: 1 addition & 1 deletion website/blog/2021-07-21-streaming-data-lake-platform.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Thus, the best way to describe Apache Hudi is as a **Streaming Data Lake Platfor

**Streaming**: At its core, by optimizing for fast upserts & change streams, Hudi provides the primitives to data lake workloads that are comparable to what [Apache Kafka](https://kafka.apache.org/) does for event-streaming (namely, incremental produce/consume of events and a state-store for interactive querying).

**Data Lake**: Nonetheless, Hudi provides an optimized, self-managing data plane for large scale data processing on the lake (adhoc queries, ML pipelines, batch pipelines), powering arguably the [largest transactional lake](https://eng.uber.com/apache-hudi-graduation/) in the world. While Hudi can be used to build a [lakehouse](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html), given its transactional capabilities, Hudi goes beyond and unlocks an end-to-end streaming architecture. In contrast, the word “streaming” appears just 3 times in the lakehouse [paper](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf), and one of them is talking about Hudi.
**Data Lake**: Nonetheless, Hudi provides an optimized, self-managing data plane for large scale data processing on the lake (adhoc queries, ML pipelines, batch pipelines), powering arguably the [largest transactional lake](https://eng.uber.com/apache-hudi-graduation/) in the world. While Hudi can be used to build a [lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/), given its transactional capabilities, Hudi goes beyond and unlocks an end-to-end streaming architecture. In contrast, the word “streaming” appears just 3 times in the lakehouse [paper](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf), and one of them is talking about Hudi.

**Platform**: Oftentimes in open source, there is great tech, but there is just too many of them - all differing ever so slightly in their opinionated ways, ultimately making the integration task onerous on the end user. Lake users deserve the same great usability that cloud warehouses provide, with the additional freedom and transparency of a true open source community. Hudi’s data and table services, tightly integrated with the Hudi “kernel”, gives us the ability to deliver cross layer optimizations with reliability and ease of use.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ tags:
- apache hudi
---

Transactions on data lakes are now considered a key characteristic of a Lakehouse these days. But what has actually been accomplished so far? What are the current approaches? How do they fare in real-world scenarios? These questions are the focus of this blog.
Transactions on data lakes are now considered a key characteristic of a [Lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/) these days. But what has actually been accomplished so far? What are the current approaches? How do they fare in real-world scenarios? These questions are the focus of this blog.

<!--truncate-->

Expand Down Expand Up @@ -54,4 +54,4 @@ All this said, there are still many ways we can improve upon this foundation.
* While optimistic concurrency control is attractive when serializable snapshot isolation is desired, it's neither optimal nor the only method for dealing with concurrency between writers. We plan to implement a fully lock-free concurrency control using CRDTs and widely adopted stream processing concepts, over our log [merge API](https://github.com/apache/hudi/blob/bc8bf043d5512f7afbb9d94882c4e43ee61d6f06/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java#L38), that has already been [proven](https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/#functionality-support) to sustain enormous continuous write volumes for the data lake.
* Touching upon key constraints, Hudi is the only lake transactional layer that ensures unique [key](https://hudi.apache.org/docs/key_generation) constraints today, but limited to the record key of the table. We will be looking to expand this capability in a more general form to non-primary key fields, with the said newer concurrency models.

Finally, for data lakes to transform successfully into lakehouses, we must learn from the failing of the "hadoop warehouse" vision, which shared similar goals with the new "lakehouse" vision. Designers did not pay closer attention to the missing technology gaps against warehouses and created unrealistic expectations from the actual software. As transactions and database functionality finally goes mainstream on data lakes, we must apply these lessons and remain candid about the current shortcomings. If you are building a lakehouse, I hope this post encourages you to closely consider various operational and efficiency aspects around concurrency control. Join our fast growing community by trying out [Apache Hudi](https://hudi.apache.org/docs/overview) or join us in conversations on [Slack](https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g).
Finally, for data lakes to transform successfully into lakehouses, we must learn from the failing of the "hadoop warehouse" vision, which shared similar goals with the new "[lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/)" vision. Designers did not pay closer attention to the missing technology gaps against warehouses and created unrealistic expectations from the actual software. As transactions and database functionality finally goes mainstream on data lakes, we must apply these lessons and remain candid about the current shortcomings. If you are building a lakehouse, I hope this post encourages you to closely consider various operational and efficiency aspects around concurrency control. Join our fast growing community by trying out [Apache Hudi](https://hudi.apache.org/docs/overview) or join us in conversations on [Slack](https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g).
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ tags:
- apache hudi
---

The focus of this blog is to show you how to build an open lakehouse leveraging incremental data processing and performing field-level updates. We are excited to announce that you can now use Apache Hudi + dbt for building open data lakehouses.
The focus of this blog is to show you how to build an open lakehouse leveraging incremental data processing and performing field-level updates. We are excited to announce that you can now use Apache Hudi + dbt for building open [data lakehouses](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/).

![/assets/images/blog/hudi_dbt_lakehouse.png](/assets/images/blog/hudi_dbt_lakehouse.png)

Expand All @@ -20,7 +20,7 @@ Let's first clarify a few terminologies used in this blog before we dive into th

## What is Apache Hudi?

Apache Hudi brings ACID transactions, record-level updates/deletes, and change streams to data lakehouses.
Apache Hudi brings ACID transactions, record-level updates/deletes, and change streams to [data lakehouses](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/).

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ tags:

# Build Your First Hudi Lakehouse with AWS S3 and AWS Glue

Soumil Shah is a Hudi community champion building [YouTube content](https://www.youtube.com/@SoumilShah/playlists) so developers can easily get started incorporating a lakehouse into their data infrastructure. In this [video](https://www.youtube.com/watch?v=5zF4jc_3rFs&list=PLL2hlSFBmWwwbMpcyMjYuRn8cN99gFSY6), Soumil shows you how to get started with AWS Glue, AWS S3, Hudi and Athena.
Soumil Shah is a Hudi community champion building [YouTube content](https://www.youtube.com/@SoumilShah/playlists) so developers can easily get started incorporating a [lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/) into their data infrastructure. In this [video](https://www.youtube.com/watch?v=5zF4jc_3rFs&list=PLL2hlSFBmWwwbMpcyMjYuRn8cN99gFSY6), Soumil shows you how to get started with AWS Glue, AWS S3, Hudi and Athena.

In this tutorial, you’ll learn how to:
- Create and configure AWS Glue
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ where people are sharing and helping each other!

While there are too many features added in 2022 to list them all, take a look at some of the exciting highlights:

- [Multi-Modal Index](https://hudi.apache.org/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi) is a first-of-its-kind high-performance indexing subsystem for the Lakehouse. It improves metadata lookup performance by up to 100x and reduces overall query latency by up to 30x. Two new indices were added to the metadata table - Bloom filter index that enables faster upsert performance and[ column stats index along with Data skipping](https://hudi.apache.org/blog/2022/06/09/Singificant-queries-speedup-from-Hudi-Column-Stats-Index-and-Data-Skipping-features) helps speed up queries dramatically.
- [Multi-Modal Index](https://hudi.apache.org/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi) is a first-of-its-kind high-performance indexing subsystem for the [Lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/). It improves metadata lookup performance by up to 100x and reduces overall query latency by up to 30x. Two new indices were added to the metadata table - Bloom filter index that enables faster upsert performance and[ column stats index along with Data skipping](https://hudi.apache.org/blog/2022/06/09/Singificant-queries-speedup-from-Hudi-Column-Stats-Index-and-Data-Skipping-features) helps speed up queries dramatically.
- Hudi added support for [asynchronous indexing](https://hudi.apache.org/releases/release-0.11.0/#async-indexer) to assist building such indices without blocking ingestion so that regular writers don't need to scale up resources for such one off spikes.
- A new type of index called Bucket Index was introduced this year. This could be game changing for deterministic workloads with partitioned datasets. It is very light-weight and allows the distribution of records to buckets using a hash function.
- Filesystem based Lock Provider - This implementation avoids the need of external systems and leverages the abilities of underlying filesystem to support lock provider needed for optimistic concurrency control in case of multiple writers. Please check the [lock configuration](https://hudi.apache.org/docs/configurations#Locks-Configurations) for details.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ as well as Flink 1.16, 1.17, and 1.18.
While Apache Hudi continues its strong growth momentum, some members of the community also decided it is time to
start building interoperability bridges across Lakehouse table formats with Delta Lake and Iceberg. The
[recent announcement about OneTable becoming open source](https://www.onehouse.ai/blog/onetable-is-now-open-source)
marks a big leap forward for all developers looking to build a data lakehouse architecture. This development not
marks a big leap forward for all developers looking to build a [data lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/) architecture. This development not
only emphasizes Hudi's commitment to openness but also enables a wider range of users to experience the
technological advantages offered by Hudi.

Expand Down
2 changes: 1 addition & 1 deletion website/blog/2024-07-31-hudi-file-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Cons of Parquet:
* Small Data Sets: Parquet may not be the best choice for small datasets because the advantages of its columnar storage model aren’t as pronounced.

Use Cases for Parquet:
* Parquet is an excellent choice when dealing with large, complex, and nested data structures, especially for read-heavy workloads. Its columnar storage approach makes it an excellent choice for data lakehouse solutions where aggregation queries are common.
* Parquet is an excellent choice when dealing with large, complex, and nested data structures, especially for read-heavy workloads. Its columnar storage approach makes it an excellent choice for [data lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/) solutions where aggregation queries are common.

### Optimized Row Columnar (ORC)
[Apache ORC](https://orc.apache.org/) is another popular file format that is self-describing, and type-aware columnar file format.
Expand Down
4 changes: 2 additions & 2 deletions website/docs/hudi_stack.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ toc_max_heading_level: 3
last_modified_at:
---

Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform.
Apache Hudi is a Transactional [Data Lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/) Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust [data lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/) platform.

In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project.

Expand All @@ -24,7 +24,7 @@ The storage layer is where the data files (such as Parquet) are stored. Hudi int
File formats hold the raw data and are physically stored on the lake storage. Hudi operates on logical structures of File Groups and File Slices, which consist of Base File and Log Files. Base Files are compacted and optimized for reads and are augmented with Log Files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a Log File as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads.

## Transactional Database Layer
The transactional database layer of Hudi comprises the core components that are responsible for the fundamental operations and services that enable Hudi to store, retrieve, and manage data efficiently on data lakehouse storages.
The transactional database layer of Hudi comprises the core components that are responsible for the fundamental operations and services that enable Hudi to store, retrieve, and manage data efficiently on [data lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/) storages.

### Table Format
![Table Format](/assets/images/blog/hudistack/table_format_1.png)
Expand Down
2 changes: 1 addition & 1 deletion website/docs/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Database indices contain auxiliary data structures to quickly locate records nee
from storage. Given that Hudi’s design has been heavily optimized for handling mutable change streams, with different
write patterns, Hudi considers [indexing](#indexing) as an integral part of its design and has uniquely supported
[indexing capabilities](https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/) from its inception, to speed
up upserts on the Data Lakehouse. While Hudi's indices has benefited writers for fast upserts and deletes, Hudi's metadata table
up upserts on the [Data Lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/). While Hudi's indices has benefited writers for fast upserts and deletes, Hudi's metadata table
aims to tap these benefits more generally for both the readers and writers. The metadata table implemented as a single
internal Hudi Merge-On-Read table hosts different types of indices containing table metadata and is designed to be
serverless and independent of compute and query engines. This is similar to common practices in databases where metadata
Expand Down
2 changes: 1 addition & 1 deletion website/docs/rollbacks.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ page presents insights on how "rollback" in Hudi can automatically clean up hand
manual input from users.

### Handling partially failed commits
Hudi has a lot of platformization built in so as to ease the operationalization of lakehouse tables. One such feature
Hudi has a lot of platformization built in so as to ease the operationalization of [lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/) tables. One such feature
is the automatic cleanup of partially failed commits. Users don’t need to run any additional commands to clean up dirty
data or the data produced by failed commits. If you continue to write to hudi tables, one of your future commits will
take care of cleaning up older data that failed midway during a write/commit. We call this cleanup of a failed commit a
Expand Down
1 change: 1 addition & 0 deletions website/src/theme/Navbar/Content/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ function NavbarContentLayout({left, right}) {
return (
<div className={clsx("navbar__inner", [styles.navbarInnerStyle])}>
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=8f594acf-9b77-44fb-9475-3e82ead1910c" width={0} height={0} alt=""/>
<img referrerpolicy="no-referrer-when-downgrade" src="https://analytics.apache.org/matomo.php?idsite=47&amp;rec=1" width={0} height={0} alt="" />
<div className="navbar__items">{left}</div>
<div className="navbar__items navbar__items--right">{right}</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion website/versioned_docs/version-0.14.0/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Database indices contain auxiliary data structures to quickly locate records nee
from storage. Given that Hudi’s design has been heavily optimized for handling mutable change streams, with different
write patterns, Hudi considers [indexing](#indexing) as an integral part of its design and has uniquely supported
[indexing capabilities](https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/) from its inception, to speed
up upserts on the Data Lakehouse. While Hudi's indices has benefited writers for fast upserts and deletes, Hudi's metadata table
up upserts on the [Data Lakehouse](https://hudi.apache.org/blog/2024/07/11/what-is-a-data-lakehouse/). While Hudi's indices has benefited writers for fast upserts and deletes, Hudi's metadata table
aims to tap these benefits more generally for both the readers and writers. The metadata table implemented as a single
internal Hudi Merge-On-Read table hosts different types of indices containing table metadata and is designed to be
serverless and independent of compute and query engines. This is similar to common practices in databases where metadata
Expand Down
Loading

0 comments on commit e60604f

Please sign in to comment.