forked from lakesoul-io/LakeSoul
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Docs] Add docs and recent blogs (lakesoul-io#423)
* add docs and recent blogs Signed-off-by: chenxu <[email protected]> * update package lock Signed-off-by: chenxu <[email protected]> --------- Signed-off-by: chenxu <[email protected]> Co-authored-by: chenxu <[email protected]>
- Loading branch information
1 parent
cc05f46
commit 8ff6f19
Showing
21 changed files
with
9,557 additions
and
15,169 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
import LakeSoulIntroPdfViewer from '@site/src/components/LakeSoulIntroductionPdfView'; | ||
|
||
# LakeSoul Opensource Project Introduction | ||
|
||
<LakeSoulIntroPdfViewer /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
import NativeIOPdfViewer from '@site/src/components/NativeIOPdfViewer'; | ||
|
||
# LakeSoul NativeIO Introduction | ||
|
||
<NativeIOPdfViewer /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
module.exports = () => ({ | ||
name: 'canvas-loader', | ||
configureWebpack() { | ||
return { | ||
// It's required by pdfjs-dist | ||
externals: [{ | ||
canvas: 'canvas', | ||
}], | ||
}; | ||
}, | ||
}); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Read and write LakeSoul in Spark Gluten | ||
|
||
:::tip | ||
Since 2.5.0 | ||
::: | ||
|
||
Spark Gluten (https://github.com/oap-project/gluten) is an open source project developed based on the Spark plug-in interface. It aims to inject native code vectorization execution capabilities into Apache Spark to greatly optimize the execution efficiency of Spark. The project has been jointly built by Intel and Kyligence since 2021. The underlying layer uses Meta's open source Velox physical execution framework, focusing on injecting more efficient instructions into Spark to execute physical plans. | ||
|
||
In the Spark Gluten project, developers do not need to invade the Spark code base, but use Spark's extension mechanism to replace the physical execution layer implementation to achieve optimization effects. For the steps before physical planning, Spark's existing code can be used, which combines Spark's framework capabilities and enhances the performance of the executor. | ||
|
||
Gluten is already able to receive batch data in Arrow format as input, but Gluten does not know that the LakeSoul data source supports Arrow. Therefore, in LakeSoul, when we detect that the Gluten plug-in is turned on, we insert a new physics plan rewrite rule, remove redundant column-row-column conversion, and directly connect LakeSoul's Scan physical plan to the subsequent Gluten calculation physical plan. As shown below: | ||
|
||
![lakesoul-gluten](lakesoul-gluten.png) | ||
|
||
## Spark task configuration | ||
When the Spark job starts, configure the Gluten plug-in and LakeSoul in the following ways: | ||
```shell | ||
$SPARK_HOME/bin/spark-shell --master local\[1\] --driver-memory 4g \ | ||
# The following are the configuration items required by the Gluten plug-in | ||
--conf "spark.driver.extraJavaOptions=--illegal-access=permit -Dio.netty.tryReflectionSetAccessible=true" \ | ||
--conf spark.plugins=io.glutenproject.GlutenPlugin \ | ||
--conf spark.memory.offHeap.enabled=true \ | ||
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \ | ||
--conf spark.memory.offHeap.size=1g \ | ||
--conf spark.sql.codegen.wholeStage=false \ | ||
# The following are the configuration items required by LakeSoul | ||
--conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \ | ||
--conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \ | ||
--conf spark.sql.defaultCatalog=lakesoul \ | ||
# Introduce the jars of LakeSoul and Gluten | ||
--jars lakesoul-spark-2.5.1-spark-3.3.jar,gluten-velox-bundle-spark3.3_2.12-1.1.0.jar | ||
``` | ||
After starting the Spark task in this way, Gluten and LakeSoul can be enabled at the same time to achieve dual acceleration of IO performance and computing performance. | ||
|
||
Gluten's Jar can be downloaded from https://github.com/oap-project/gluten/releases. Please choose Spark 3.3's jar. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
5
...-Hans/docusaurus-plugin-content-blog/2023-12-01-lakesoul-introduction/index.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
import LakeSoulIntroPdfViewer from '@site/src/components/LakeSoulIntroductionPdfView'; | ||
|
||
# LakeSoul 开源项目介绍 | ||
|
||
<LakeSoulIntroPdfViewer /> |
5 changes: 5 additions & 0 deletions
5
.../zh-Hans/docusaurus-plugin-content-blog/2024-01-10-lakesoul-native-io/index.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
import NativeIOPdfViewer from '@site/src/components/NativeIOPdfViewer'; | ||
|
||
# LakeSoul's NativeIO 层实现原理介绍 | ||
|
||
<NativeIOPdfViewer /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
34 changes: 34 additions & 0 deletions
34
...s/docusaurus-plugin-content-docs/current/03-Usage Docs/15-spark-gluten/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# 在 Spark Gluten 中读写 LakeSoul | ||
|
||
:::tip | ||
自 LakeSoul 2.5.0 起支持 | ||
::: | ||
|
||
Spark Gluten (https://github.com/oap-project/gluten) 是一个基于 Spark 插件接口开发的开源项目,旨在为 Apache Spark 注入原生代码向量化执行的能力,以极大优化 Spark 的执行效率和成本。该项目由 Intel 和 Kyligence 自2021年开始合作共建,底层使用 Meta 开源的 Velox 物理执行层框架,专注于为 Spark 注入更高效的指令来执行物理计划。 | ||
|
||
在 Spark Gluten 项目中,开发人员不需要侵入 Spark 代码库,而是通过 Spark 的扩展机制,替换物理执行层实现,来达到优化效果。对于物理计划之前的步骤则可延用Spark现有代码,这样既结合了 Spark 的框架能力又增强了执行器的性能。 | ||
|
||
Gluten 已经能够接收 Arrow 格式的 batch 数据作为输入,但是 Gluten 并不知道 LakeSoul 数据源支持 Arrow。因此我们在 LakeSoul 中,当检测到 Gluten 插件开启时,插入一条新的物理计划重写规则,去除多余的 列-行-列 转化,直接将 LakeSoul 的 Scan 物理计划对接到后续的 Gluten 计算物理计划,如下图所示: | ||
![lakesoul-gluten](lakesoul-gluten.png) | ||
|
||
## Spark 任务配置 | ||
在 Spark 作业启动时,通过以下方式,配置 Gluten 插件以及 LakeSoul: | ||
```shell | ||
$SPARK_HOME/bin/spark-shell --master local\[1\] --driver-memory 4g \ | ||
# 以下为 Gluten 插件所需的配置项 | ||
--conf "spark.driver.extraJavaOptions=--illegal-access=permit -Dio.netty.tryReflectionSetAccessible=true" \ | ||
--conf spark.plugins=io.glutenproject.GlutenPlugin \ | ||
--conf spark.memory.offHeap.enabled=true \ | ||
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \ | ||
--conf spark.memory.offHeap.size=1g \ | ||
--conf spark.sql.codegen.wholeStage=false \ | ||
# 以下为 LakeSoul 所需的配置项 | ||
--conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \ | ||
--conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \ | ||
--conf spark.sql.defaultCatalog=lakesoul \ | ||
# 引入 LakeSoul、Gluten 的 jar | ||
--jars lakesoul-spark-2.5.1-spark-3.3.jar,gluten-velox-bundle-spark3.3_2.12-1.1.0.jar | ||
``` | ||
以这样的方式启动 Spark 任务后,即可同时启用 Gluten 和 LakeSoul,实现 IO 性能、计算性能的双重加速。 | ||
|
||
Gluten 的 Jar 可以从 https://github.com/oap-project/gluten/releases 这里下载,需要选择 Spark 3.3 版本。 |
Binary file added
BIN
+209 KB
...s-plugin-content-docs/current/03-Usage Docs/15-spark-gluten/lakesoul-gluten.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.