[DOCS] Release notes 1.0.0-beta2 (#11618)

* [DOCS] Release notes for 1.0.0-beta2 * add sql with limitations * Fix build * Update sidebars and some more items in release notes * Fix sidebars, links and address other comments
apache · Jul 16, 2024 · 7ec2812 · 7ec2812
1 parent 9a6758a
commit 7ec2812
Show file tree

Hide file tree

Showing 16 changed files with 238 additions and 13 deletions.
diff --git a/website/docs/metadata.md b/website/docs/metadata.md
@@ -90,6 +90,32 @@ Following are the different indices currently available under the metadata table
  Hudi release, this index aids in locating records faster than other existing indices and can provide a speedup orders of magnitude 
  faster in large deployments where index lookup dominates write latencies.
 
+#### New Indexes in 1.0.0
+
+- ***Functional Index***:
+ A [functional index](https://github.com/apache/hudi/blob/3789840be3d041cbcfc6b24786740210e4e6d6ac/rfc/rfc-63/rfc-63.md)
+ is an index on a function of a column. If a query has a predicate on a function of a column, the functional index can
+ be used to speed up the query. Functional index is stored in *func_index_* prefixed partitions (one for each
+ function) under metadata table. Functional index can be created using SQL syntax. Please checkout SQL DDL
+ docs [here](/docs/next/sql_ddl#create-functional-index-experimental) for more details.
+
+- ***Partition Stats Index***
+ Partition stats index aggregates statistics at the partition level for the columns for which it is enabled. This helps
+ in efficient partition pruning even for non-partition fields. The partition stats index is stored in *partition_stats*
+ partition under metadata table. Partition stats index can be enabled using the following configs (note it is required
+ to specify the columns for which stats should be aggregated):
+ ```properties
+ hoodie.metadata.index.partition.stats.enable=true
+ hoodie.metadata.index.column.stats.columns=<comma-separated-column-names>
+ ```
+
+- ***Secondary Index***:
+ Secondary indexes allow users to create indexes on columns that are not part of record key columns in Hudi tables (for
+ record key fields, Hudi supports [Record-level Index](/blog/2023/11/01/record-level-index). Secondary indexes
+ can be used to speed up queries with predicate on columns other than record key columns. 
+
+To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-and-secondary-index-experimental).
+
 ## Enable Hudi Metadata Table and Multi-Modal Index in write side
 
 Following are the Spark based basic configs that are needed to enable metadata and multi-modal indices. For advanced configs please refer 

diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
@@ -217,7 +217,13 @@ DROP INDEX [IF EXISTS] index_name ON [TABLE] table_name
 - Both index and column on which the index is created can be qualified with some options in the form of key-value pairs.
  We will see this with an example of functional index below. 
 
-#### Create Functional Index
+:::note
+Except for the `files`, `column_stats`, `bloom_filters` and `record_index`, all other indexes are experimental. We
+encourage users to try out these features on new tables and provide feedback. Below, we have also listed current
+limitations of these indexes.
+:::
+
+#### Create Functional Index (Experimental)
 
 A [functional index](https://github.com/apache/hudi/blob/00ece7bce0a4a8d0019721a28049723821e01842/rfc/rfc-63/rfc-63.md) 
 is an index on a function of a column. It is a new addition to Hudi's [multi-modal indexing](https://hudi.apache.org/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi) 
@@ -328,6 +334,86 @@ Project [city#2970, fare#2969, rider#2967, driver#2968], Statistics(sizeInBytes=
 ```
 </details>
 
+#### Create Partition Stats and Secondary Index (Experimental)
+
+Hudi supports various [indexes](/docs/next/metadata#metadata-table-indices). Let us see how we can use them in the following example.
+
+```sql
+DROP TABLE IF EXISTS hudi_table;
+-- Let us create a table with multiple partition fields, and enable record index and partition stats index 
+CREATE TABLE hudi_table (
+ ts BIGINT,
+ id STRING,
+ rider STRING,
+ driver STRING,
+ fare DOUBLE,
+ city STRING,
+ state STRING
+) USING hudi
+ OPTIONS(
+ primaryKey ='id',
+ hoodie.metadata.record.index.enable = 'true', -- enable record index
+ hoodie.metadata.index.partition.stats.enable = 'true', -- enable partition stats index
+ hoodie.metadata.index.column.stats.column.list = 'rider' -- create partition stats index on rider column
+)
+PARTITIONED BY (city, state)
+LOCATION 'file:///tmp/hudi_test_table';
+
+INSERT INTO hudi_table VALUES (1695159649,'trip1','rider-A','driver-K',19.10,'san_francisco','california');
+INSERT INTO hudi_table VALUES (1695091554,'trip2','rider-C','driver-M',27.70,'sunnyvale','california');
+INSERT INTO hudi_table VALUES (1695332066,'trip3','rider-E','driver-O',93.50,'austin','texas');
+INSERT INTO hudi_table VALUES (1695516137,'trip4','rider-F','driver-P',34.15,'houston','texas');
+
+-- Enable data skipping for the reader
+set hoodie.metadata.enable=true;
+set hoodie.enable.data.skipping=true;
+
+-- simple partition predicate --
+select * from hudi_table where city = 'sunnyvale';
+20240710215107477 20240710215107477_0_0 trip2 city=sunnyvale/state=california 1dcb14a9-bc4a-4eac-aab5-015f2254b7ec-0_0-40-75_20240710215107477.parquet 1695091554 trip2 rider-C driver-M 27.7 sunnyvale california
+Time taken: 0.58 seconds, Fetched 1 row(s)
+
+-- simple partition predicate on other partition field --
+select * from hudi_table where state = 'texas';
+20240710215119846 20240710215119846_0_0 trip4 city=houston/state=texas 08c6ed2c-a87b-4798-8f70-6d8b16cb1932-0_0-74-133_20240710215119846.parquet 1695516137 trip4 rider-F driver-P 34.15 houston texas
+20240710215110584 20240710215110584_0_0 trip3 city=austin/state=texas 0ab2243c-cc08-4da3-8302-4ce0b4c47a08-0_0-57-104_20240710215110584.parquet 1695332066 trip3 rider-E driver-O 93.5 austin texas
+Time taken: 0.124 seconds, Fetched 2 row(s)
+
+-- predicate on a column for which partition stats are present --
+select id, rider, city, state from hudi_table where rider > 'rider-D';
+trip4 rider-F houston texas
+trip3 rider-E austin texas
+Time taken: 0.703 seconds, Fetched 2 row(s)
+
+-- record key predicate --
+SELECT id, rider, driver FROM hudi_table WHERE id = 'trip1';
+trip1 rider-A driver-K
+Time taken: 0.368 seconds, Fetched 1 row(s)
+
+-- create secondary index on driver --
+CREATE INDEX driver_idx ON hudi_table USING secondary_index(driver);
+
+-- secondary key predicate --
+SELECT id, driver, city, state FROM hudi_table WHERE driver IN ('driver-K', 'driver-M');
+trip1 driver-K san_francisco california
+trip2 driver-M sunnyvale california
+Time taken: 0.83 seconds, Fetched 2 row(s)
+```
+
+**Limitations of using these indexes:**
+
+- Unlike column stats, partition stats index is not created automatically for all columns. Users must specify list of
+ columns for which they want to create partition stats index.
+- Predicate on internal meta fields such as `_hoodie_record_key` or `_hoodie_partition_path` cannot be used for data
+ skipping. Queries with such predicates cannot leverage the indexes.
+- Secondary index is not supported for nested fields.
+- Index update can fail with schema evolution.
+- If there are multiple indexes present, then secondary index and functional index update can fail.
+- Only one index can be created at a time using [async indexer](/docs/next/metadata_indexing).
+- Ensure native HFile reader is disabled (`_hoodie.hfile.use.native.reader`) to leverage the secondary index. Default value for this config is `false`.
+
+Limitations will be addressed before 1.0.0 is made generally available.
+
 ### Setting Hudi configs 
 
 There are different ways you can pass the configs for a given hudi table. 

diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md
@@ -266,6 +266,30 @@ DELETE FROM hudi_table WHERE price < 100;
 Delete query only work with batch excution mode.
 :::
 
+### Lookup Joins
+
+A lookup join is typically used to enrich a table with data that is queried from an external system. The join requires
+one table to have a processing time attribute and the other table to be backed by a lookup source connector.
+
+```sql
+CREATE TABLE datagen_source(
+ id int,
+ name STRING,
+ proctime as PROCTIME()
+) WITH (
+'connector' = 'datagen',
+'rows-per-second'='1',
+'number-of-rows' = '2',
+'fields.id.kind'='sequence',
+'fields.id.start'='1',
+'fields.id.end'='2'
+);
+
+SELECT o.id,o.name,b.id as id2
+FROM datagen_source AS o
+JOIN hudi_table/*+ OPTIONS('lookup.join.cache.ttl'= '2 day') */ FOR SYSTEM_TIME AS OF o.proctime AS b on o.id = b.id; 
+```
+
 ### Setting Writer/Reader Configs
 With Flink SQL, you can additionally set the writer/reader writer configs along with the query.
 

diff --git a/website/releases/download.md b/website/releases/download.md
@@ -6,6 +6,10 @@ toc: true
 last_modified_at: 2022-12-27T15:59:57-04:00
 ---
 
+### Release 1.0.0-beta2
+* Source Release : [Apache Hudi 1.0.0-beta2 Source Release](https://downloads.apache.org/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz) ([asc](https://downloads.apache.org/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.sha512))
+* Release Note : ([Release Note for Apache Hudi 1.0.0-beta2](/releases/release-1.0.0-beta2))
+
 ### Release 0.15.0
 * Source Release : [Apache Hudi 0.15.0 Source Release](https://downloads.apache.org/hudi/0.15.0/hudi-0.15.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.15.0/hudi-0.15.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.15.0/hudi-0.15.0.src.tgz.sha512))
 * Release Note : ([Release Note for Apache Hudi 0.15.0](/releases/release-0.15.0))
@@ -16,7 +20,7 @@ last_modified_at: 2022-12-27T15:59:57-04:00
 
 ### Release 1.0.0-beta1
 * Source Release : [Apache Hudi 1.0.0-beta1 Source Release](https://www.apache.org/dyn/closer.lua/hudi/1.0.0-beta1/hudi-1.0.0-beta1.src.tgz) ([asc](https://downloads.apache.org/hudi/1.0.0-beta1/hudi-1.0.0-beta1.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/1.0.0-beta1/hudi-1.0.0-beta1.src.tgz.sha512))
-* Release Note : ([Release Note for Apache Hudi 0.14.0](/releases/release-1.0.0-beta1))
+* Release Note : ([Release Note for Apache Hudi 1.0.0-beta1](/releases/release-1.0.0-beta1))
 
 ### Release 0.12.3
 [Long Term Support](/releases/release-0.12.3#long-term-support): this is the latest stable release

diff --git a/website/releases/older-releases.md b/website/releases/older-releases.md
@@ -1,6 +1,6 @@
 ---
 title: "Older Releases"
-sidebar_position: 19
+sidebar_position: 20
 layout: releases
 toc: true
 last_modified_at: 2020-05-28T08:40:00-07:00

diff --git a/website/releases/release-0.10.0.md b/website/releases/release-0.10.0.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.10.0"
-sidebar_position: 14
+sidebar_position: 15
 layout: releases
 toc: true
 ---

diff --git a/website/releases/release-0.10.1.md b/website/releases/release-0.10.1.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.10.1"
-sidebar_position: 13
+sidebar_position: 14
 layout: releases
 toc: true
 ---

diff --git a/website/releases/release-0.11.0.md b/website/releases/release-0.11.0.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.11.0"
-sidebar_position: 12
+sidebar_position: 13
 layout: releases
 toc: true
 last_modified_at: 2022-01-27T22:07:00+08:00

diff --git a/website/releases/release-0.11.1.md b/website/releases/release-0.11.1.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.11.1"
-sidebar_position: 11
+sidebar_position: 12
 layout: releases
 toc: true
 last_modified_at: 2022-06-19T23:30:00-07:00

diff --git a/website/releases/release-0.12.0.md b/website/releases/release-0.12.0.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.12.0"
-sidebar_position: 10
+sidebar_position: 11
 layout: releases
 toc: true
 ---

diff --git a/website/releases/release-0.12.1.md b/website/releases/release-0.12.1.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.12.1"
-sidebar_position: 9
+sidebar_position: 10
 layout: releases
 toc: true
 ---

diff --git a/website/releases/release-0.12.2.md b/website/releases/release-0.12.2.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.12.2"
-sidebar_position: 8
+sidebar_position: 9
 layout: releases
 toc: true
 ---

diff --git a/website/releases/release-0.12.3.md b/website/releases/release-0.12.3.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.12.3"
-sidebar_position: 6
+sidebar_position: 7
 layout: releases
 toc: true
 last_modified_at: 2023-04-23T10:30:00+05:30

diff --git a/website/releases/release-0.13.0.md b/website/releases/release-0.13.0.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.13.0"
-sidebar_position: 7
+sidebar_position: 8
 layout: releases
 toc: true
 ---

diff --git a/website/releases/release-0.13.1.md b/website/releases/release-0.13.1.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 0.13.1"
-sidebar_position: 5
+sidebar_position: 6
 layout: releases
 toc: true
 last_modified_at: 2023-05-25T13:00:00-08:00

diff --git a/website/releases/release-1.0.0-beta2.md b/website/releases/release-1.0.0-beta2.md
@@ -0,0 +1,85 @@
+---
+title: "Release 1.0.0-beta2"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 1.0.0-beta2](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta2) ([docs](/docs/next/quick-start-guide))
+
+Apache Hudi 1.0.0-beta2 is the second beta release of Apache Hudi. This release is meant for early adopters to try
+out the new features and provide feedback. The release is not meant for production use.
+
+## Migration Guide
+
+This release contains major format changes as we will see in highlights below. We encourage users to try out the
+**1.0.0-beta2** features on new tables. The 1.0 general availability (GA) release will support automatic table upgrades
+from 0.x versions, while also ensuring full backward compatibility when reading 0.x Hudi tables using 1.0, ensuring a
+seamless migration experience.
+
+:::caution
+Given that timeline format and log file format has changed in this **beta release**, it is recommended not to attempt to do
+rolling upgrades from older versions to this release.
+:::
+
+## Highlights
+
+### Format changes
+
+[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic covering all the format changes proposals,
+which are also partly covered in the [Hudi 1.0 tech specification](/tech-specs-1point0). The following are the main
+changes in this release:
+
+#### Timeline
+
+No major changes in this release. Refer to [1.0.0-beta1#timeline](release-1.0.0-beta1.md#timeline) for more details.
+
+#### Log File Format
+
+In addition to the fields in the log file header added in [1.0.0-beta1](release-1.0.0-beta1.md#log-file-format), we also
+store a flag, `IS_PARTIAL` to indicate whether the log block contains partial updates or not.
+
+### Metadata indexes
+
+In 1.0.0-beta1, we added support for functional index. In 1.0.0-beta2, we have added support for secondary indexes and
+partition stats index to the [multi-modal indexing](/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi) subsystem.
+
+#### Secondary Index
+
+Secondary indexes allow users to create indexes on columns that are not part of record key columns in Hudi tables (for 
+record key fields, Hudi supports [Record-level Index](/blog/2023/11/01/record-level-index). Secondary indexes can be used to speed up
+queries with predicate on columns other than record key columns.
+
+#### Partition Stats Index
+
+Partition stats index aggregates statistics at the partition level for the columns for which it is enabled. This helps
+in efficient partition pruning even for non-partition fields.
+
+To try out these features, refer to the [SQL guide](/docs/next/sql_ddl#create-partition-stats-and-secondary-index-experimental).
+
+### API Changes
+
+#### Positional Merging with Filegroup Reader
+
+In 1.0.0-beta1, we added a new [filegroup reader](/releases/release-1.0.0-beta1#new-filegroup-reader), which provides
+5.7x performance benefits for snapshot queries on Merge-on-Read tables with updates. The reader now
+provides position-based merging, as an alternative to existing key-based merging, and skipping pages based on record
+positions. The new filegroup reader is integrated with Spark and Hive, and enabled by default. To enable positional
+merging set below configs:
+
+```properties
+hoodie.merge.use.record.positions=true
+```
+
+### Hudi-Flink Enhancements
+
+This release comes with the support for [lookup joins](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join).
+A lookup join is typically used to enrich a table with data that is queried from an external system. The join requires
+one table to have a processing time attribute and the other table to be backed by a lookup source connector. Head over 
+to the [FLink SQL guide](/docs/next/sql_dml#lookup-joins) to try out this feature.
+
+## Raw Release Notes
+
+The raw release notes are available [here](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12354810).