[Docs] Add docs and recent blogs (lakesoul-io#423)

* add docs and recent blogs Signed-off-by: chenxu <[email protected]> * update package lock Signed-off-by: chenxu <[email protected]> --------- Signed-off-by: chenxu <[email protected]> Co-authored-by: chenxu <[email protected]>
F-PHantam · Jan 17, 2024 · 8ff6f19 · 8ff6f19
1 parent cc05f46
commit 8ff6f19
Show file tree

Hide file tree

Showing 21 changed files with 9,557 additions and 15,169 deletions.
diff --git a/website/babel.config.js b/website/babel.config.js
@@ -4,4 +4,7 @@
 
 module.exports = {
   presets: [require.resolve('@docusaurus/core/lib/babel/preset')],
+  plugins: [
+    `@babel/plugin-syntax-dynamic-import`,
+  ],
 };
diff --git a/website/blog/2023-12-01-lakesoul-introduction/index.mdx b/website/blog/2023-12-01-lakesoul-introduction/index.mdx
@@ -0,0 +1,5 @@
+import LakeSoulIntroPdfViewer from '@site/src/components/LakeSoulIntroductionPdfView';
+
+# LakeSoul Opensource Project Introduction
+
+<LakeSoulIntroPdfViewer />
diff --git a/website/blog/2024-01-10-lakesoul-native-io/index.mdx b/website/blog/2024-01-10-lakesoul-native-io/index.mdx
@@ -0,0 +1,5 @@
+import NativeIOPdfViewer from '@site/src/components/NativeIOPdfViewer';
+
+# LakeSoul NativeIO Introduction
+
+<NativeIOPdfViewer />
diff --git a/website/canvas-loader.js b/website/canvas-loader.js
@@ -0,0 +1,11 @@
+module.exports = () => ({
+    name: 'canvas-loader',
+    configureWebpack() {
+        return {
+            // It's required by pdfjs-dist
+            externals: [{
+                canvas: 'canvas',
+            }],
+        };
+    },
+});
diff --git a/website/docs/03-Usage Docs/05-flink-cdc-sync.md b/website/docs/03-Usage Docs/05-flink-cdc-sync.md
@@ -53,7 +53,7 @@ export LAKESOUL_PG_DRIVER=com.lakesoul.shaded.org.postgresql.Driver
 export LAKESOUL_PG_URL=jdbc:postgresql://localhost:5432/test_lakesoul_meta?stringtype=unspecified
 export LAKESOUL_PG_USERNAME=root
 export LAKESOUL_PG_PASSWORD=root
-````
+```
 :::
 
 Description of required parameters:
@@ -69,7 +69,7 @@ Description of required parameters:
 | --source_db.password | Password for source database                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                               |
 | --source.parallelism | The parallelism of a single table read task affects the data reading speed. The larger the value, the greater the pressure on source database.                                                                                                                                                                                                                                                                                                                                                                                                                                                             | The parallelism can be adjusted according to the write QPS of source database                                 |
 | --sink.parallelism | The parallelism of the single-table write task, which is also the number of primary key shards in the LakeSoul table. Affects the landing speed of data entering the lake. The larger the value, the greater the number of small files, which affects the subsequent read performance; the smaller the value, the greater the pressure on the write task, and the greater the possibility of data skew. It can be adjusted according to the data volume of the largest table. It is generally recommended that a degree of parallelism (primary key sharding) manage no more than 10 million rows of data. |
-| --warehouse_path | Data storage path prefix (cluster prefix is required for hdfs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | LakeSoul will write the corresponding table data to the ${warehouse_path}/database_name/table_name/ directory |
+| --warehouse_path | Data storage path prefix (cluster prefix is required for hdfs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | LakeSoul will write the corresponding table data to the warehouse_path/database_name/table_name/ directory |
 | --flink.savepoint | Flink savepoint path (cluster prefix is required for hdfs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                               |
 | --flink.checkpoint | Flink checkpoint path (cluster prefix is required for hdfs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                               |
 

diff --git a/website/docs/03-Usage Docs/15-spark-gluten/index.md b/website/docs/03-Usage Docs/15-spark-gluten/index.md
@@ -0,0 +1,35 @@
+# Read and write LakeSoul in Spark Gluten
+
+:::tip
+Since 2.5.0
+:::
+
+Spark Gluten (https://github.com/oap-project/gluten) is an open source project developed based on the Spark plug-in interface. It aims to inject native code vectorization execution capabilities into Apache Spark to greatly optimize the execution efficiency of Spark. The project has been jointly built by Intel and Kyligence since 2021. The underlying layer uses Meta's open source Velox physical execution framework, focusing on injecting more efficient instructions into Spark to execute physical plans.
+
+In the Spark Gluten project, developers do not need to invade the Spark code base, but use Spark's extension mechanism to replace the physical execution layer implementation to achieve optimization effects. For the steps before physical planning, Spark's existing code can be used, which combines Spark's framework capabilities and enhances the performance of the executor.
+
+Gluten is already able to receive batch data in Arrow format as input, but Gluten does not know that the LakeSoul data source supports Arrow. Therefore, in LakeSoul, when we detect that the Gluten plug-in is turned on, we insert a new physics plan rewrite rule, remove redundant column-row-column conversion, and directly connect LakeSoul's Scan physical plan to the subsequent Gluten calculation physical plan. As shown below:
+
+![lakesoul-gluten](lakesoul-gluten.png)
+
+## Spark task configuration
+When the Spark job starts, configure the Gluten plug-in and LakeSoul in the following ways:
+```shell
+$SPARK_HOME/bin/spark-shell --master local\[1\] --driver-memory 4g \
+   # The following are the configuration items required by the Gluten plug-in
+   --conf "spark.driver.extraJavaOptions=--illegal-access=permit -Dio.netty.tryReflectionSetAccessible=true" \
+   --conf spark.plugins=io.glutenproject.GlutenPlugin \
+   --conf spark.memory.offHeap.enabled=true \
+   --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
+   --conf spark.memory.offHeap.size=1g \
+   --conf spark.sql.codegen.wholeStage=false \
+   # The following are the configuration items required by LakeSoul
+   --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \
+   --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \
+   --conf spark.sql.defaultCatalog=lakesoul \
+   # Introduce the jars of LakeSoul and Gluten
+   --jars lakesoul-spark-2.5.1-spark-3.3.jar,gluten-velox-bundle-spark3.3_2.12-1.1.0.jar
+```
+After starting the Spark task in this way, Gluten and LakeSoul can be enabled at the same time to achieve dual acceleration of IO performance and computing performance.
+
+Gluten's Jar can be downloaded from https://github.com/oap-project/gluten/releases. Please choose Spark 3.3's jar.
diff --git a/website/docs/03-Usage Docs/15-spark-gluten/lakesoul-gluten.png b/website/docs/03-Usage Docs/15-spark-gluten/lakesoul-gluten.png
diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
@@ -5,8 +5,8 @@
 // @ts-check
 // Note: type annotations allow type checking and IDEs autocompletion
 
-const lightCodeTheme = require('prism-react-renderer/themes/github');
-const darkCodeTheme = require('prism-react-renderer/themes/dracula');
+const lightCodeTheme = require('prism-react-renderer').themes.github;
+const darkCodeTheme = require('prism-react-renderer').themes.dracula;
 
 /** @type {import('@docusaurus/types').Config} */
 const config = {
@@ -168,6 +168,7 @@ const config = {
         ],
       },
     }),
+  plugins: ['./canvas-loader.js'],
 };
 
 module.exports = config;
diff --git a/...-Hans/docusaurus-plugin-content-blog/2023-12-01-lakesoul-introduction/index.mdx b/...-Hans/docusaurus-plugin-content-blog/2023-12-01-lakesoul-introduction/index.mdx
@@ -0,0 +1,5 @@
+import LakeSoulIntroPdfViewer from '@site/src/components/LakeSoulIntroductionPdfView';
+
+# LakeSoul 开源项目介绍
+
+<LakeSoulIntroPdfViewer />
diff --git a/.../zh-Hans/docusaurus-plugin-content-blog/2024-01-10-lakesoul-native-io/index.mdx b/.../zh-Hans/docusaurus-plugin-content-blog/2024-01-10-lakesoul-native-io/index.mdx
@@ -0,0 +1,5 @@
+import NativeIOPdfViewer from '@site/src/components/NativeIOPdfViewer';
+
+# LakeSoul's NativeIO 层实现原理介绍
+
+<NativeIOPdfViewer />
diff --git a/...-Hans/docusaurus-plugin-content-docs/current/03-Usage Docs/05-flink-cdc-sync.md b/...-Hans/docusaurus-plugin-content-docs/current/03-Usage Docs/05-flink-cdc-sync.md
@@ -71,7 +71,7 @@ export LAKESOUL_PG_PASSWORD=root
 | --source_db.password | 源数据库的密码                                                                              |                                                                     |
 | --source.parallelism | 单表读取任务并行度，影响数据读取速度，值越大对 MySQL 压力越大                                                   | 可以根据 MySQL 的写入 QPS 来调整并行度                                           |
 | --sink.parallelism   | 单表写任务并行度，同时也是LakeSoul表主键分片的个数。影响入湖数据落地速度。值越大，小文件数越多，影响后续读取性能；值越小对写任务压力越大，发生数据倾斜可能性越大 | 可以根据最大表的数据量进行调整。一般建议一个并行度（主键分片）管理不超过1千万行数据。                         |
-| --warehouse_path     | 数据存储路径前缀（hdfs需要带上集群前缀）                                                               | LakeSoul 会将对应表数据写入到 ${warehouse_path}/database_name/table_name/ 目录下 |
+| --warehouse_path     | 数据存储路径前缀（hdfs需要带上集群前缀）                                                               | LakeSoul 会将对应表数据写入到 warehouse_path/database_name/table_name/ 目录下 |
 | --flink.savepoint    | Flink savepoint路径（hdfs需要带上集群前缀）                                                      |                                                                     |
 | --flink.checkpoint   | Flink checkpoint路径（hdfs需要带上集群前缀）                                                     |                                                                     |
 

diff --git a/...s/docusaurus-plugin-content-docs/current/03-Usage Docs/15-spark-gluten/index.md b/...s/docusaurus-plugin-content-docs/current/03-Usage Docs/15-spark-gluten/index.md
@@ -0,0 +1,34 @@
+# 在 Spark Gluten 中读写 LakeSoul
+
+:::tip
+自 LakeSoul 2.5.0 起支持
+:::
+
+Spark Gluten （https://github.com/oap-project/gluten） 是一个基于 Spark 插件接口开发的开源项目，旨在为 Apache Spark 注入原生代码向量化执行的能力，以极大优化 Spark 的执行效率和成本。该项目由 Intel 和 Kyligence 自2021年开始合作共建，底层使用 Meta 开源的 Velox 物理执行层框架，专注于为 Spark 注入更高效的指令来执行物理计划。
+
+在 Spark Gluten 项目中，开发人员不需要侵入 Spark 代码库，而是通过 Spark 的扩展机制，替换物理执行层实现，来达到优化效果。对于物理计划之前的步骤则可延用Spark现有代码，这样既结合了 Spark 的框架能力又增强了执行器的性能。
+
+Gluten 已经能够接收 Arrow 格式的 batch 数据作为输入，但是 Gluten 并不知道 LakeSoul 数据源支持 Arrow。因此我们在 LakeSoul 中，当检测到 Gluten 插件开启时，插入一条新的物理计划重写规则，去除多余的 列-行-列 转化，直接将 LakeSoul 的 Scan 物理计划对接到后续的 Gluten 计算物理计划，如下图所示：
+![lakesoul-gluten](lakesoul-gluten.png)
+
+## Spark 任务配置
+在 Spark 作业启动时，通过以下方式，配置 Gluten 插件以及 LakeSoul：
+```shell
+$SPARK_HOME/bin/spark-shell --master local\[1\] --driver-memory 4g \
+  # 以下为 Gluten 插件所需的配置项
+  --conf "spark.driver.extraJavaOptions=--illegal-access=permit -Dio.netty.tryReflectionSetAccessible=true" \
+  --conf spark.plugins=io.glutenproject.GlutenPlugin \
+  --conf spark.memory.offHeap.enabled=true \
+  --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
+  --conf spark.memory.offHeap.size=1g \
+  --conf spark.sql.codegen.wholeStage=false \
+  # 以下为 LakeSoul 所需的配置项
+  --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \
+  --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \
+  --conf spark.sql.defaultCatalog=lakesoul \
+  # 引入 LakeSoul、Gluten 的 jar
+  --jars lakesoul-spark-2.5.1-spark-3.3.jar,gluten-velox-bundle-spark3.3_2.12-1.1.0.jar
+```
+以这样的方式启动 Spark 任务后，即可同时启用 Gluten 和 LakeSoul，实现 IO 性能、计算性能的双重加速。
+
+Gluten 的 Jar 可以从 https://github.com/oap-project/gluten/releases 这里下载，需要选择 Spark 3.3 版本。
diff --git a/...s-plugin-content-docs/current/03-Usage Docs/15-spark-gluten/lakesoul-gluten.png b/...s-plugin-content-docs/current/03-Usage Docs/15-spark-gluten/lakesoul-gluten.png