Skip to content

Commit

Permalink
[Docs] Add docs and recent blogs (lakesoul-io#423)
Browse files Browse the repository at this point in the history
* add docs and recent blogs

Signed-off-by: chenxu <[email protected]>

* update package lock

Signed-off-by: chenxu <[email protected]>

---------

Signed-off-by: chenxu <[email protected]>
Co-authored-by: chenxu <[email protected]>
  • Loading branch information
xuchen-plus and dmetasoul01 authored Jan 17, 2024
1 parent cc05f46 commit 8ff6f19
Show file tree
Hide file tree
Showing 21 changed files with 9,557 additions and 15,169 deletions.
3 changes: 3 additions & 0 deletions website/babel.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,7 @@

module.exports = {
presets: [require.resolve('@docusaurus/core/lib/babel/preset')],
plugins: [
`@babel/plugin-syntax-dynamic-import`,
],
};
5 changes: 5 additions & 0 deletions website/blog/2023-12-01-lakesoul-introduction/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import LakeSoulIntroPdfViewer from '@site/src/components/LakeSoulIntroductionPdfView';

# LakeSoul Opensource Project Introduction

<LakeSoulIntroPdfViewer />
5 changes: 5 additions & 0 deletions website/blog/2024-01-10-lakesoul-native-io/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import NativeIOPdfViewer from '@site/src/components/NativeIOPdfViewer';

# LakeSoul NativeIO Introduction

<NativeIOPdfViewer />
11 changes: 11 additions & 0 deletions website/canvas-loader.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
module.exports = () => ({
name: 'canvas-loader',
configureWebpack() {
return {
// It's required by pdfjs-dist
externals: [{
canvas: 'canvas',
}],
};
},
});
4 changes: 2 additions & 2 deletions website/docs/03-Usage Docs/05-flink-cdc-sync.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ export LAKESOUL_PG_DRIVER=com.lakesoul.shaded.org.postgresql.Driver
export LAKESOUL_PG_URL=jdbc:postgresql://localhost:5432/test_lakesoul_meta?stringtype=unspecified
export LAKESOUL_PG_USERNAME=root
export LAKESOUL_PG_PASSWORD=root
````
```
:::

Description of required parameters:
Expand All @@ -69,7 +69,7 @@ Description of required parameters:
| --source_db.password | Password for source database | |
| --source.parallelism | The parallelism of a single table read task affects the data reading speed. The larger the value, the greater the pressure on source database. | The parallelism can be adjusted according to the write QPS of source database |
| --sink.parallelism | The parallelism of the single-table write task, which is also the number of primary key shards in the LakeSoul table. Affects the landing speed of data entering the lake. The larger the value, the greater the number of small files, which affects the subsequent read performance; the smaller the value, the greater the pressure on the write task, and the greater the possibility of data skew. It can be adjusted according to the data volume of the largest table. It is generally recommended that a degree of parallelism (primary key sharding) manage no more than 10 million rows of data. |
| --warehouse_path | Data storage path prefix (cluster prefix is ​​required for hdfs) | LakeSoul will write the corresponding table data to the ${warehouse_path}/database_name/table_name/ directory |
| --warehouse_path | Data storage path prefix (cluster prefix is ​​required for hdfs) | LakeSoul will write the corresponding table data to the warehouse_path/database_name/table_name/ directory |
| --flink.savepoint | Flink savepoint path (cluster prefix is ​​required for hdfs) | |
| --flink.checkpoint | Flink checkpoint path (cluster prefix is ​​required for hdfs) | |

Expand Down
35 changes: 35 additions & 0 deletions website/docs/03-Usage Docs/15-spark-gluten/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Read and write LakeSoul in Spark Gluten

:::tip
Since 2.5.0
:::

Spark Gluten (https://github.com/oap-project/gluten) is an open source project developed based on the Spark plug-in interface. It aims to inject native code vectorization execution capabilities into Apache Spark to greatly optimize the execution efficiency of Spark. The project has been jointly built by Intel and Kyligence since 2021. The underlying layer uses Meta's open source Velox physical execution framework, focusing on injecting more efficient instructions into Spark to execute physical plans.

In the Spark Gluten project, developers do not need to invade the Spark code base, but use Spark's extension mechanism to replace the physical execution layer implementation to achieve optimization effects. For the steps before physical planning, Spark's existing code can be used, which combines Spark's framework capabilities and enhances the performance of the executor.

Gluten is already able to receive batch data in Arrow format as input, but Gluten does not know that the LakeSoul data source supports Arrow. Therefore, in LakeSoul, when we detect that the Gluten plug-in is turned on, we insert a new physics plan rewrite rule, remove redundant column-row-column conversion, and directly connect LakeSoul's Scan physical plan to the subsequent Gluten calculation physical plan. As shown below:

![lakesoul-gluten](lakesoul-gluten.png)

## Spark task configuration
When the Spark job starts, configure the Gluten plug-in and LakeSoul in the following ways:
```shell
$SPARK_HOME/bin/spark-shell --master local\[1\] --driver-memory 4g \
# The following are the configuration items required by the Gluten plug-in
--conf "spark.driver.extraJavaOptions=--illegal-access=permit -Dio.netty.tryReflectionSetAccessible=true" \
--conf spark.plugins=io.glutenproject.GlutenPlugin \
--conf spark.memory.offHeap.enabled=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
--conf spark.memory.offHeap.size=1g \
--conf spark.sql.codegen.wholeStage=false \
# The following are the configuration items required by LakeSoul
--conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \
--conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \
--conf spark.sql.defaultCatalog=lakesoul \
# Introduce the jars of LakeSoul and Gluten
--jars lakesoul-spark-2.5.1-spark-3.3.jar,gluten-velox-bundle-spark3.3_2.12-1.1.0.jar
```
After starting the Spark task in this way, Gluten and LakeSoul can be enabled at the same time to achieve dual acceleration of IO performance and computing performance.

Gluten's Jar can be downloaded from https://github.com/oap-project/gluten/releases. Please choose Spark 3.3's jar.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions website/docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
// @ts-check
// Note: type annotations allow type checking and IDEs autocompletion

const lightCodeTheme = require('prism-react-renderer/themes/github');
const darkCodeTheme = require('prism-react-renderer/themes/dracula');
const lightCodeTheme = require('prism-react-renderer').themes.github;
const darkCodeTheme = require('prism-react-renderer').themes.dracula;

/** @type {import('@docusaurus/types').Config} */
const config = {
Expand Down Expand Up @@ -168,6 +168,7 @@ const config = {
],
},
}),
plugins: ['./canvas-loader.js'],
};

module.exports = config;
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import LakeSoulIntroPdfViewer from '@site/src/components/LakeSoulIntroductionPdfView';

# LakeSoul 开源项目介绍

<LakeSoulIntroPdfViewer />
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import NativeIOPdfViewer from '@site/src/components/NativeIOPdfViewer';

# LakeSoul's NativeIO 层实现原理介绍

<NativeIOPdfViewer />
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ export LAKESOUL_PG_PASSWORD=root
| --source_db.password | 源数据库的密码 | |
| --source.parallelism | 单表读取任务并行度,影响数据读取速度,值越大对 MySQL 压力越大 | 可以根据 MySQL 的写入 QPS 来调整并行度 |
| --sink.parallelism | 单表写任务并行度,同时也是LakeSoul表主键分片的个数。影响入湖数据落地速度。值越大,小文件数越多,影响后续读取性能;值越小对写任务压力越大,发生数据倾斜可能性越大 | 可以根据最大表的数据量进行调整。一般建议一个并行度(主键分片)管理不超过1千万行数据。 |
| --warehouse_path | 数据存储路径前缀(hdfs需要带上集群前缀) | LakeSoul 会将对应表数据写入到 ${warehouse_path}/database_name/table_name/ 目录下 |
| --warehouse_path | 数据存储路径前缀(hdfs需要带上集群前缀) | LakeSoul 会将对应表数据写入到 warehouse_path/database_name/table_name/ 目录下 |
| --flink.savepoint | Flink savepoint路径(hdfs需要带上集群前缀) | |
| --flink.checkpoint | Flink checkpoint路径(hdfs需要带上集群前缀) | |

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# 在 Spark Gluten 中读写 LakeSoul

:::tip
自 LakeSoul 2.5.0 起支持
:::

Spark Gluten (https://github.com/oap-project/gluten) 是一个基于 Spark 插件接口开发的开源项目,旨在为 Apache Spark 注入原生代码向量化执行的能力,以极大优化 Spark 的执行效率和成本。该项目由 Intel 和 Kyligence 自2021年开始合作共建,底层使用 Meta 开源的 Velox 物理执行层框架,专注于为 Spark 注入更高效的指令来执行物理计划。

在 Spark Gluten 项目中,开发人员不需要侵入 Spark 代码库,而是通过 Spark 的扩展机制,替换物理执行层实现,来达到优化效果。对于物理计划之前的步骤则可延用Spark现有代码,这样既结合了 Spark 的框架能力又增强了执行器的性能。

Gluten 已经能够接收 Arrow 格式的 batch 数据作为输入,但是 Gluten 并不知道 LakeSoul 数据源支持 Arrow。因此我们在 LakeSoul 中,当检测到 Gluten 插件开启时,插入一条新的物理计划重写规则,去除多余的 列-行-列 转化,直接将 LakeSoul 的 Scan 物理计划对接到后续的 Gluten 计算物理计划,如下图所示:
![lakesoul-gluten](lakesoul-gluten.png)

## Spark 任务配置
在 Spark 作业启动时,通过以下方式,配置 Gluten 插件以及 LakeSoul:
```shell
$SPARK_HOME/bin/spark-shell --master local\[1\] --driver-memory 4g \
# 以下为 Gluten 插件所需的配置项
--conf "spark.driver.extraJavaOptions=--illegal-access=permit -Dio.netty.tryReflectionSetAccessible=true" \
--conf spark.plugins=io.glutenproject.GlutenPlugin \
--conf spark.memory.offHeap.enabled=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
--conf spark.memory.offHeap.size=1g \
--conf spark.sql.codegen.wholeStage=false \
# 以下为 LakeSoul 所需的配置项
--conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \
--conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \
--conf spark.sql.defaultCatalog=lakesoul \
# 引入 LakeSoul、Gluten 的 jar
--jars lakesoul-spark-2.5.1-spark-3.3.jar,gluten-velox-bundle-spark3.3_2.12-1.1.0.jar
```
以这样的方式启动 Spark 任务后,即可同时启用 Gluten 和 LakeSoul,实现 IO 性能、计算性能的双重加速。

Gluten 的 Jar 可以从 https://github.com/oap-project/gluten/releases 这里下载,需要选择 Spark 3.3 版本。
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 8ff6f19

Please sign in to comment.