Skip to content

Commit

Permalink
[DOCS] Release notes 1.0.0-beta2 (#11618)
Browse files Browse the repository at this point in the history
* [DOCS] Release notes for 1.0.0-beta2

* add sql with limitations

* Fix build

* Update sidebars and some more items in release notes

* Fix sidebars, links and address other comments
  • Loading branch information
codope authored Jul 16, 2024
1 parent 9a6758a commit 7ec2812
Show file tree
Hide file tree
Showing 16 changed files with 238 additions and 13 deletions.
26 changes: 26 additions & 0 deletions website/docs/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,32 @@ Following are the different indices currently available under the metadata table
Hudi release, this index aids in locating records faster than other existing indices and can provide a speedup orders of magnitude
faster in large deployments where index lookup dominates write latencies.

#### New Indexes in 1.0.0

- ***Functional Index***:
A [functional index](https://github.com/apache/hudi/blob/3789840be3d041cbcfc6b24786740210e4e6d6ac/rfc/rfc-63/rfc-63.md)
is an index on a function of a column. If a query has a predicate on a function of a column, the functional index can
be used to speed up the query. Functional index is stored in *func_index_* prefixed partitions (one for each
function) under metadata table. Functional index can be created using SQL syntax. Please checkout SQL DDL
docs [here](/docs/next/sql_ddl#create-functional-index-experimental) for more details.

- ***Partition Stats Index***
Partition stats index aggregates statistics at the partition level for the columns for which it is enabled. This helps
in efficient partition pruning even for non-partition fields. The partition stats index is stored in *partition_stats*
partition under metadata table. Partition stats index can be enabled using the following configs (note it is required
to specify the columns for which stats should be aggregated):
```properties
hoodie.metadata.index.partition.stats.enable=true
hoodie.metadata.index.column.stats.columns=<comma-separated-column-names>
```

- ***Secondary Index***:
Secondary indexes allow users to create indexes on columns that are not part of record key columns in Hudi tables (for
record key fields, Hudi supports [Record-level Index](/blog/2023/11/01/record-level-index). Secondary indexes
can be used to speed up queries with predicate on columns other than record key columns.

To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-and-secondary-index-experimental).

## Enable Hudi Metadata Table and Multi-Modal Index in write side

Following are the Spark based basic configs that are needed to enable metadata and multi-modal indices. For advanced configs please refer
Expand Down
88 changes: 87 additions & 1 deletion website/docs/sql_ddl.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,13 @@ DROP INDEX [IF EXISTS] index_name ON [TABLE] table_name
- Both index and column on which the index is created can be qualified with some options in the form of key-value pairs.
We will see this with an example of functional index below.

#### Create Functional Index
:::note
Except for the `files`, `column_stats`, `bloom_filters` and `record_index`, all other indexes are experimental. We
encourage users to try out these features on new tables and provide feedback. Below, we have also listed current
limitations of these indexes.
:::

#### Create Functional Index (Experimental)

A [functional index](https://github.com/apache/hudi/blob/00ece7bce0a4a8d0019721a28049723821e01842/rfc/rfc-63/rfc-63.md)
is an index on a function of a column. It is a new addition to Hudi's [multi-modal indexing](https://hudi.apache.org/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi)
Expand Down Expand Up @@ -328,6 +334,86 @@ Project [city#2970, fare#2969, rider#2967, driver#2968], Statistics(sizeInBytes=
```
</details>

#### Create Partition Stats and Secondary Index (Experimental)

Hudi supports various [indexes](/docs/next/metadata#metadata-table-indices). Let us see how we can use them in the following example.

```sql
DROP TABLE IF EXISTS hudi_table;
-- Let us create a table with multiple partition fields, and enable record index and partition stats index
CREATE TABLE hudi_table (
ts BIGINT,
id STRING,
rider STRING,
driver STRING,
fare DOUBLE,
city STRING,
state STRING
) USING hudi
OPTIONS(
primaryKey ='id',
hoodie.metadata.record.index.enable = 'true', -- enable record index
hoodie.metadata.index.partition.stats.enable = 'true', -- enable partition stats index
hoodie.metadata.index.column.stats.column.list = 'rider' -- create partition stats index on rider column
)
PARTITIONED BY (city, state)
LOCATION 'file:///tmp/hudi_test_table';

INSERT INTO hudi_table VALUES (1695159649,'trip1','rider-A','driver-K',19.10,'san_francisco','california');
INSERT INTO hudi_table VALUES (1695091554,'trip2','rider-C','driver-M',27.70,'sunnyvale','california');
INSERT INTO hudi_table VALUES (1695332066,'trip3','rider-E','driver-O',93.50,'austin','texas');
INSERT INTO hudi_table VALUES (1695516137,'trip4','rider-F','driver-P',34.15,'houston','texas');

-- Enable data skipping for the reader
set hoodie.metadata.enable=true;
set hoodie.enable.data.skipping=true;

-- simple partition predicate --
select * from hudi_table where city = 'sunnyvale';
20240710215107477 20240710215107477_0_0 trip2 city=sunnyvale/state=california 1dcb14a9-bc4a-4eac-aab5-015f2254b7ec-0_0-40-75_20240710215107477.parquet 1695091554 trip2 rider-C driver-M 27.7 sunnyvale california
Time taken: 0.58 seconds, Fetched 1 row(s)

-- simple partition predicate on other partition field --
select * from hudi_table where state = 'texas';
20240710215119846 20240710215119846_0_0 trip4 city=houston/state=texas 08c6ed2c-a87b-4798-8f70-6d8b16cb1932-0_0-74-133_20240710215119846.parquet 1695516137 trip4 rider-F driver-P 34.15 houston texas
20240710215110584 20240710215110584_0_0 trip3 city=austin/state=texas 0ab2243c-cc08-4da3-8302-4ce0b4c47a08-0_0-57-104_20240710215110584.parquet 1695332066 trip3 rider-E driver-O 93.5 austin texas
Time taken: 0.124 seconds, Fetched 2 row(s)

-- predicate on a column for which partition stats are present --
select id, rider, city, state from hudi_table where rider > 'rider-D';
trip4 rider-F houston texas
trip3 rider-E austin texas
Time taken: 0.703 seconds, Fetched 2 row(s)

-- record key predicate --
SELECT id, rider, driver FROM hudi_table WHERE id = 'trip1';
trip1 rider-A driver-K
Time taken: 0.368 seconds, Fetched 1 row(s)

-- create secondary index on driver --
CREATE INDEX driver_idx ON hudi_table USING secondary_index(driver);

-- secondary key predicate --
SELECT id, driver, city, state FROM hudi_table WHERE driver IN ('driver-K', 'driver-M');
trip1 driver-K san_francisco california
trip2 driver-M sunnyvale california
Time taken: 0.83 seconds, Fetched 2 row(s)
```

**Limitations of using these indexes:**

- Unlike column stats, partition stats index is not created automatically for all columns. Users must specify list of
columns for which they want to create partition stats index.
- Predicate on internal meta fields such as `_hoodie_record_key` or `_hoodie_partition_path` cannot be used for data
skipping. Queries with such predicates cannot leverage the indexes.
- Secondary index is not supported for nested fields.
- Index update can fail with schema evolution.
- If there are multiple indexes present, then secondary index and functional index update can fail.
- Only one index can be created at a time using [async indexer](/docs/next/metadata_indexing).
- Ensure native HFile reader is disabled (`_hoodie.hfile.use.native.reader`) to leverage the secondary index. Default value for this config is `false`.

Limitations will be addressed before 1.0.0 is made generally available.

### Setting Hudi configs

There are different ways you can pass the configs for a given hudi table.
Expand Down
24 changes: 24 additions & 0 deletions website/docs/sql_dml.md
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,30 @@ DELETE FROM hudi_table WHERE price < 100;
Delete query only work with batch excution mode.
:::

### Lookup Joins

A lookup join is typically used to enrich a table with data that is queried from an external system. The join requires
one table to have a processing time attribute and the other table to be backed by a lookup source connector.

```sql
CREATE TABLE datagen_source(
id int,
name STRING,
proctime as PROCTIME()
) WITH (
'connector' = 'datagen',
'rows-per-second'='1',
'number-of-rows' = '2',
'fields.id.kind'='sequence',
'fields.id.start'='1',
'fields.id.end'='2'
);

SELECT o.id,o.name,b.id as id2
FROM datagen_source AS o
JOIN hudi_table/*+ OPTIONS('lookup.join.cache.ttl'= '2 day') */ FOR SYSTEM_TIME AS OF o.proctime AS b on o.id = b.id;
```

### Setting Writer/Reader Configs
With Flink SQL, you can additionally set the writer/reader writer configs along with the query.

Expand Down
6 changes: 5 additions & 1 deletion website/releases/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ toc: true
last_modified_at: 2022-12-27T15:59:57-04:00
---

### Release 1.0.0-beta2
* Source Release : [Apache Hudi 1.0.0-beta2 Source Release](https://downloads.apache.org/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz) ([asc](https://downloads.apache.org/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.sha512))
* Release Note : ([Release Note for Apache Hudi 1.0.0-beta2](/releases/release-1.0.0-beta2))

### Release 0.15.0
* Source Release : [Apache Hudi 0.15.0 Source Release](https://downloads.apache.org/hudi/0.15.0/hudi-0.15.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.15.0/hudi-0.15.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.15.0/hudi-0.15.0.src.tgz.sha512))
* Release Note : ([Release Note for Apache Hudi 0.15.0](/releases/release-0.15.0))
Expand All @@ -16,7 +20,7 @@ last_modified_at: 2022-12-27T15:59:57-04:00

### Release 1.0.0-beta1
* Source Release : [Apache Hudi 1.0.0-beta1 Source Release](https://www.apache.org/dyn/closer.lua/hudi/1.0.0-beta1/hudi-1.0.0-beta1.src.tgz) ([asc](https://downloads.apache.org/hudi/1.0.0-beta1/hudi-1.0.0-beta1.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/1.0.0-beta1/hudi-1.0.0-beta1.src.tgz.sha512))
* Release Note : ([Release Note for Apache Hudi 0.14.0](/releases/release-1.0.0-beta1))
* Release Note : ([Release Note for Apache Hudi 1.0.0-beta1](/releases/release-1.0.0-beta1))

### Release 0.12.3
[Long Term Support](/releases/release-0.12.3#long-term-support): this is the latest stable release
Expand Down
2 changes: 1 addition & 1 deletion website/releases/older-releases.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Older Releases"
sidebar_position: 19
sidebar_position: 20
layout: releases
toc: true
last_modified_at: 2020-05-28T08:40:00-07:00
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.10.0.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.10.0"
sidebar_position: 14
sidebar_position: 15
layout: releases
toc: true
---
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.10.1.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.10.1"
sidebar_position: 13
sidebar_position: 14
layout: releases
toc: true
---
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.11.0.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.11.0"
sidebar_position: 12
sidebar_position: 13
layout: releases
toc: true
last_modified_at: 2022-01-27T22:07:00+08:00
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.11.1.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.11.1"
sidebar_position: 11
sidebar_position: 12
layout: releases
toc: true
last_modified_at: 2022-06-19T23:30:00-07:00
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.12.0.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.12.0"
sidebar_position: 10
sidebar_position: 11
layout: releases
toc: true
---
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.12.1.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.12.1"
sidebar_position: 9
sidebar_position: 10
layout: releases
toc: true
---
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.12.2.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.12.2"
sidebar_position: 8
sidebar_position: 9
layout: releases
toc: true
---
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.12.3.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.12.3"
sidebar_position: 6
sidebar_position: 7
layout: releases
toc: true
last_modified_at: 2023-04-23T10:30:00+05:30
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.13.0.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.13.0"
sidebar_position: 7
sidebar_position: 8
layout: releases
toc: true
---
Expand Down
2 changes: 1 addition & 1 deletion website/releases/release-0.13.1.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Release 0.13.1"
sidebar_position: 5
sidebar_position: 6
layout: releases
toc: true
last_modified_at: 2023-05-25T13:00:00-08:00
Expand Down
85 changes: 85 additions & 0 deletions website/releases/release-1.0.0-beta2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
title: "Release 1.0.0-beta2"
sidebar_position: 1
layout: releases
toc: true
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

## [Release 1.0.0-beta2](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta2) ([docs](/docs/next/quick-start-guide))

Apache Hudi 1.0.0-beta2 is the second beta release of Apache Hudi. This release is meant for early adopters to try
out the new features and provide feedback. The release is not meant for production use.

## Migration Guide

This release contains major format changes as we will see in highlights below. We encourage users to try out the
**1.0.0-beta2** features on new tables. The 1.0 general availability (GA) release will support automatic table upgrades
from 0.x versions, while also ensuring full backward compatibility when reading 0.x Hudi tables using 1.0, ensuring a
seamless migration experience.

:::caution
Given that timeline format and log file format has changed in this **beta release**, it is recommended not to attempt to do
rolling upgrades from older versions to this release.
:::

## Highlights

### Format changes

[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic covering all the format changes proposals,
which are also partly covered in the [Hudi 1.0 tech specification](/tech-specs-1point0). The following are the main
changes in this release:

#### Timeline

No major changes in this release. Refer to [1.0.0-beta1#timeline](release-1.0.0-beta1.md#timeline) for more details.

#### Log File Format

In addition to the fields in the log file header added in [1.0.0-beta1](release-1.0.0-beta1.md#log-file-format), we also
store a flag, `IS_PARTIAL` to indicate whether the log block contains partial updates or not.

### Metadata indexes

In 1.0.0-beta1, we added support for functional index. In 1.0.0-beta2, we have added support for secondary indexes and
partition stats index to the [multi-modal indexing](/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi) subsystem.

#### Secondary Index

Secondary indexes allow users to create indexes on columns that are not part of record key columns in Hudi tables (for
record key fields, Hudi supports [Record-level Index](/blog/2023/11/01/record-level-index). Secondary indexes can be used to speed up
queries with predicate on columns other than record key columns.

#### Partition Stats Index

Partition stats index aggregates statistics at the partition level for the columns for which it is enabled. This helps
in efficient partition pruning even for non-partition fields.

To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-and-secondary-index-experimental).

### API Changes

#### Positional Merging with Filegroup Reader

In 1.0.0-beta1, we added a new [filegroup reader](/releases/release-1.0.0-beta1#new-filegroup-reader), which provides
5.7x performance benefits for snapshot queries on Merge-on-Read tables with updates. The reader now
provides position-based merging, as an alternative to existing key-based merging, and skipping pages based on record
positions. The new filegroup reader is integrated with Spark and Hive, and enabled by default. To enable positional
merging set below configs:

```properties
hoodie.merge.use.record.positions=true
```

### Hudi-Flink Enhancements

This release comes with the support for [lookup joins](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join).
A lookup join is typically used to enrich a table with data that is queried from an external system. The join requires
one table to have a processing time attribute and the other table to be backed by a lookup source connector. Head over
to the [FLink SQL guide](/docs/next/sql_dml#lookup-joins) to try out this feature.

## Raw Release Notes

The raw release notes are available [here](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12354810).

0 comments on commit 7ec2812

Please sign in to comment.