diff --git a/CHANGELOG.md b/CHANGELOG.md index 86b48d454..60fd14471 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,13 +1,151 @@ # Change log -Generated on 2021-04-29 +Generated on 2021-06-02 + +## Release 1.1.1 + +### Native SQL Engine + +#### Features +||| +|:---|:---| +|[#304](https://github.com/oap-project/native-sql-engine/issues/304)|Upgrade to Arrow 4.0.0| +|[#285](https://github.com/oap-project/native-sql-engine/issues/285)|ColumnarWindow: Support Date/Timestamp input in MAX/MIN| +|[#297](https://github.com/oap-project/native-sql-engine/issues/297)|Disable incremental compiler in CI| +|[#245](https://github.com/oap-project/native-sql-engine/issues/245)|Support columnar rdd cache| +|[#276](https://github.com/oap-project/native-sql-engine/issues/276)|Add option to switch Hadoop version| +|[#274](https://github.com/oap-project/native-sql-engine/issues/274)|Comment to trigger tpc-h RAM test| +|[#256](https://github.com/oap-project/native-sql-engine/issues/256)|CI: do not run ram report for each PR| + +#### Bugs Fixed +||| +|:---|:---| +|[#325](https://github.com/oap-project/native-sql-engine/issues/325)|java.util.ConcurrentModificationException: mutation occurred during iteration| +|[#329](https://github.com/oap-project/native-sql-engine/issues/329)|numPartitions are not the same| +|[#318](https://github.com/oap-project/native-sql-engine/issues/318)|fix Spark 311 on data source v2| +|[#311](https://github.com/oap-project/native-sql-engine/issues/311)|Build reports errors| +|[#302](https://github.com/oap-project/native-sql-engine/issues/302)|test on v2 failed due to an exception| +|[#257](https://github.com/oap-project/native-sql-engine/issues/257)|different version of slf4j-log4j| +|[#293](https://github.com/oap-project/native-sql-engine/issues/293)|Fix BHJ loss if key = 0| +|[#248](https://github.com/oap-project/native-sql-engine/issues/248)|arrow dependency must put after arrow installation| + +#### PRs +||| +|:---|:---| +|[#332](https://github.com/oap-project/native-sql-engine/pull/332)|[NSE-325] fix incremental compile issue with 4.5.x scala-maven-plugin| +|[#335](https://github.com/oap-project/native-sql-engine/pull/335)|[NSE-329] fix out partitioning in BHJ and SHJ| +|[#328](https://github.com/oap-project/native-sql-engine/pull/328)|[NSE-318]check schema before reuse exchange| +|[#307](https://github.com/oap-project/native-sql-engine/pull/307)|[NSE-304] Upgrade to Arrow 4.0.0| +|[#312](https://github.com/oap-project/native-sql-engine/pull/312)|[NSE-311] Build reports errors| +|[#272](https://github.com/oap-project/native-sql-engine/pull/272)|[NSE-273] support spark311| +|[#303](https://github.com/oap-project/native-sql-engine/pull/303)|[NSE-302] fix v2 test| +|[#306](https://github.com/oap-project/native-sql-engine/pull/306)|[NSE-304] Upgrade to Arrow 4.0.0: Change basic GHA TPC-H test target …| +|[#286](https://github.com/oap-project/native-sql-engine/pull/286)|[NSE-285] ColumnarWindow: Support Date input in MAX/MIN| +|[#298](https://github.com/oap-project/native-sql-engine/pull/298)|[NSE-297] Disable incremental compiler in GHA CI| +|[#291](https://github.com/oap-project/native-sql-engine/pull/291)|[NSE-257] fix multiple slf4j bindings| +|[#294](https://github.com/oap-project/native-sql-engine/pull/294)|[NSE-293] fix unsafemap with key = '0'| +|[#233](https://github.com/oap-project/native-sql-engine/pull/233)|[NSE-207] fix issues found from aggregate unit tests| +|[#246](https://github.com/oap-project/native-sql-engine/pull/246)|[NSE-245]Adding columnar RDD cache support| +|[#289](https://github.com/oap-project/native-sql-engine/pull/289)|[NSE-206]Update installation guide and configuration guide.| +|[#277](https://github.com/oap-project/native-sql-engine/pull/277)|[NSE-276] Add option to switch Hadoop version| +|[#275](https://github.com/oap-project/native-sql-engine/pull/275)|[NSE-274] Comment to trigger tpc-h RAM test| +|[#271](https://github.com/oap-project/native-sql-engine/pull/271)|[NSE-196] clean up configs in unit tests| +|[#258](https://github.com/oap-project/native-sql-engine/pull/258)|[NSE-257] fix different versions of slf4j-log4j12| +|[#259](https://github.com/oap-project/native-sql-engine/pull/259)|[NSE-248] fix arrow dependency order| +|[#249](https://github.com/oap-project/native-sql-engine/pull/249)|[NSE-241] fix hashagg result length| +|[#255](https://github.com/oap-project/native-sql-engine/pull/255)|[NSE-256] do not run ram report test on each PR| + + +### SQL DS Cache + +#### Features +||| +|:---|:---| +|[#118](https://github.com/oap-project/sql-ds-cache/issues/118)|port to Spark 3.1.1| + +#### Bugs Fixed +||| +|:---|:---| +|[#121](https://github.com/oap-project/sql-ds-cache/issues/121)|OAP Index creation stuck issue| + +#### PRs +||| +|:---|:---| +|[#132](https://github.com/oap-project/sql-ds-cache/pull/132)|Fix SampleBasedStatisticsSuite UnitTest case| +|[#122](https://github.com/oap-project/sql-ds-cache/pull/122)|[ sql-ds-cache-121] Fix Index stuck issues| +|[#119](https://github.com/oap-project/sql-ds-cache/pull/119)|[SQL-DS-CACHE-118][POAE7-1130] port sql-ds-cache to Spark3.1.1| + + +### OAP MLlib + +#### Features +||| +|:---|:---| +|[#26](https://github.com/oap-project/oap-mllib/issues/26)|[PIP] Support Spark 3.0.1 / 3.0.2 and upcoming 3.1.1| + +#### PRs +||| +|:---|:---| +|[#39](https://github.com/oap-project/oap-mllib/pull/39)|[ML-26] Build for different spark version by -Pprofile| + + +### PMEM Spill + +#### Features +||| +|:---|:---| +|[#34](https://github.com/oap-project/pmem-spill/issues/34)|Support vanilla spark 3.1.1| + +#### PRs +||| +|:---|:---| +|[#41](https://github.com/oap-project/pmem-spill/pull/41)|[PMEM-SPILL-34][POAE7-1119]Port RDD cache to Spark 3.1.1 as separate module| + + +### PMEM Common + +#### Features +||| +|:---|:---| +|[#10](https://github.com/oap-project/pmem-common/issues/10)|add -mclflushopt flag to enable clflushopt for gcc| +|[#8](https://github.com/oap-project/pmem-common/issues/8)|use clflushopt instead of clflush | + +#### PRs +||| +|:---|:---| +|[#11](https://github.com/oap-project/pmem-common/pull/11)|[PMEM-COMMON-10][POAE7-1010]Add -mclflushopt flag to enable clflushop…| +|[#9](https://github.com/oap-project/pmem-common/pull/9)|[PMEM-COMMON-8][POAE7-896]use clflush optimize version for clflush| + + +### PMEM Shuffle + +#### Features +||| +|:---|:---| +|[#15](https://github.com/oap-project/pmem-shuffle/issues/15)|Doesn't work with Spark3.1.1| + +#### PRs +||| +|:---|:---| +|[#16](https://github.com/oap-project/pmem-shuffle/pull/16)|[pmem-shuffle-15] Make pmem-shuffle support Spark3.1.1| + + +### Remote Shuffle + +#### Features +||| +|:---|:---| +|[#18](https://github.com/oap-project/remote-shuffle/issues/18)|upgrade to Spark-3.1.1| +|[#11](https://github.com/oap-project/remote-shuffle/issues/11)|Support DAOS Object Async API| + +#### PRs +||| +|:---|:---| +|[#19](https://github.com/oap-project/remote-shuffle/pull/19)|[REMOTE-SHUFFLE-18] upgrade to Spark-3.1.1| +|[#14](https://github.com/oap-project/remote-shuffle/pull/14)|[REMOTE-SHUFFLE-11] Support DAOS Object Async API| + + ## Release 1.1.0 -* [Native SQL Engine](#native-sql-engine) -* [SQL DS Cache](#sql-ds-cache) -* [OAP MLlib](#oap-mllib) -* [PMEM Spill](#pmem-spill) -* [PMEM Shuffle](#pmem-shuffle) -* [Remote Shuffle](#remote-shuffle) ### Native SQL Engine @@ -264,7 +402,7 @@ Generated on 2021-04-29 |[#6](https://github.com/oap-project/pmem-shuffle/pull/6)|[PMEM-SHUFFLE-7] enable fsdax mode in pmem-shuffle| -### Remote-Shuffle +### Remote Shuffle #### Features ||| diff --git a/arrow-data-source/README.md b/arrow-data-source/README.md index b9781ba89..3d4650add 100644 --- a/arrow-data-source/README.md +++ b/arrow-data-source/README.md @@ -18,8 +18,8 @@ Please make sure you have already installed the software in your system. 3. cmake 3.16 or higher version 4. maven 3.6 or higher version 5. Hadoop 2.7.5 or higher version -6. Spark 3.0.0 or higher version -7. Intel Optimized Arrow 3.0.0 +6. Spark 3.1.1 or higher version +7. Intel Optimized Arrow 4.0.0 ### Building by Conda @@ -145,14 +145,14 @@ mvn clean -DskipTests package readlink -f standard/target/spark-arrow-datasource-standard--jar-with-dependencies.jar ``` -### Download Spark 3.0.0 +### Download Spark 3.1.1 -Currently ArrowDataSource works on the Spark 3.0.0 version. +Currently ArrowDataSource works on the Spark 3.1.1 version. ``` -wget http://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz -tar -xf ./spark-3.0.0-bin-hadoop2.7.tgz -export SPARK_HOME=`pwd`/spark-3.0.0-bin-hadoop2.7 +wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz +tar -xf ./spark-3.1.1-bin-hadoop2.7.tgz +export SPARK_HOME=`pwd`/spark-3.1.1-bin-hadoop2.7 ``` If you are new to Apache Spark, please go though [Spark's official deploying guide](https://spark.apache.org/docs/latest/cluster-overview.html) before getting started with ArrowDataSource. diff --git a/docs/Installation.md b/docs/Installation.md index 6d63c13d2..01ad1bd79 100644 --- a/docs/Installation.md +++ b/docs/Installation.md @@ -26,8 +26,8 @@ Based on the different environment, there are some parameters can be set via -D | arrow_root | When build_arrow set to False, arrow_root will be enabled to find the location of your existing arrow library. | /usr/local | | build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True | -When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow) -If you wish to change any parameters from Arrow, you can change it from the build_arrow.sh script under native-sql-enge/arrow-data-source/script/. +When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap-1.1.1) +If you wish to change any parameters from Arrow, you can change it from the `build_arrow.sh` script under `native-sql-engine/arrow-data-source/script/`. ### Additional Notes [Notes for Installation Issues](./InstallationNotes.md) diff --git a/docs/OAP-Developer-Guide.md b/docs/OAP-Developer-Guide.md index 6525bb985..ff173c6f7 100644 --- a/docs/OAP-Developer-Guide.md +++ b/docs/OAP-Developer-Guide.md @@ -3,13 +3,13 @@ This document contains the instructions & scripts on installing necessary dependencies and building OAP modules. You can get more detailed information from OAP each module below. -* [SQL Index and Data Source Cache](https://github.com/oap-project/sql-ds-cache/blob/v1.1.0-spark-3.0.0/docs/Developer-Guide.md) -* [PMem Common](https://github.com/oap-project/pmem-common/tree/v1.1.0-spark-3.0.0) -* [PMem Spill](https://github.com/oap-project/pmem-spill/tree/v1.1.0-spark-3.0.0) -* [PMem Shuffle](https://github.com/oap-project/pmem-shuffle/tree/v1.1.0-spark-3.0.0#5-install-dependencies-for-pmem-shuffle) -* [Remote Shuffle](https://github.com/oap-project/remote-shuffle/tree/v1.1.0-spark-3.0.0) -* [OAP MLlib](https://github.com/oap-project/oap-mllib/tree/v1.1.0-spark-3.0.0) -* [Native SQL Engine](https://github.com/oap-project/native-sql-engine/tree/v1.1.0-spark-3.0.0) +* [SQL Index and Data Source Cache](https://github.com/oap-project/sql-ds-cache/blob/v1.1.1-spark-3.1.1/docs/Developer-Guide.md) +* [PMem Common](https://github.com/oap-project/pmem-common/tree/v1.1.1-spark-3.1.1) +* [PMem Spill](https://github.com/oap-project/pmem-spill/tree/v1.1.1-spark-3.1.1) +* [PMem Shuffle](https://github.com/oap-project/pmem-shuffle/tree/v1.1.1-spark-3.1.1#5-install-dependencies-for-pmem-shuffle) +* [Remote Shuffle](https://github.com/oap-project/remote-shuffle/tree/v1.1.1-spark-3.1.1) +* [OAP MLlib](https://github.com/oap-project/oap-mllib/tree/v1.1.1-spark-3.1.1) +* [Native SQL Engine](https://github.com/oap-project/native-sql-engine/tree/v1.1.1-spark-3.1.1) ## Building OAP @@ -22,7 +22,7 @@ We provide scripts to help automatically install dependencies required, please c # cd oap-tools # sh dev/install-compile-time-dependencies.sh ``` -*Note*: oap-tools tag version `v1.1.0-spark-3.0.0` corresponds to all OAP modules' tag version `v1.1.0-spark-3.0.0`. +*Note*: oap-tools tag version `v1.1.1-spark-3.1.1` corresponds to all OAP modules' tag version `v1.1.1-spark-3.1.1`. Then the dependencies below will be installed: diff --git a/docs/OAP-Installation-Guide.md b/docs/OAP-Installation-Guide.md index d98fe7ef8..bdf10b745 100644 --- a/docs/OAP-Installation-Guide.md +++ b/docs/OAP-Installation-Guide.md @@ -20,7 +20,7 @@ $ wget -c https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh $ chmod +x Miniconda2-latest-Linux-x86_64.sh $ bash Miniconda2-latest-Linux-x86_64.sh ``` -For changes to take effect, ***reload*** your current shell. +For changes to take effect, ***close and re-open*** your current shell. To test your installation, run the command `conda list` in your terminal window. A list of installed packages appears if it has been installed correctly. ### Installing OAP @@ -29,7 +29,7 @@ Create a Conda environment and install OAP Conda package. ```bash $ conda create -n oapenv -y python=3.7 $ conda activate oapenv -$ conda install -c conda-forge -c intel -y oap=1.1.0 +$ conda install -c conda-forge -c intel -y oap=1.1.1 ``` Once finished steps above, you have completed OAP dependencies installation and OAP building, and will find built OAP jars under `$HOME/miniconda2/envs/oapenv/oap_jars` @@ -38,8 +38,8 @@ Dependencies below are required by OAP and all of them are included in OAP Conda - [Arrow](https://github.com/oap-project/arrow/tree/arrow-4.0.0-oap-1.1.1) - [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/) -- [Memkind](https://anaconda.org/intel/memkind) -- [Vmemcache](https://anaconda.org/intel/vmemcache) +- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1) +- [Vmemcache](https://github.com/pmem/vmemcache.git) - [HPNL](https://anaconda.org/intel/hpnl) - [PMDK](https://github.com/pmem/pmdk) - [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html) diff --git a/docs/Prerequisite.md b/docs/Prerequisite.md index b0bf543e9..3c29c492f 100644 --- a/docs/Prerequisite.md +++ b/docs/Prerequisite.md @@ -9,8 +9,8 @@ Please make sure you have already installed the software in your system. 4. cmake 3.16 or higher version 5. Maven 3.6.3 or higher version 6. Hadoop 2.7.5 or higher version -7. Spark 3.0.0 or higher version -8. Intel Optimized Arrow 3.0.0 +7. Spark 3.1.1 or higher version +8. Intel Optimized Arrow 4.0.0 ## gcc installation diff --git a/docs/SparkInstallation.md b/docs/SparkInstallation.md index 9d2a864ae..d9fd4a65f 100644 --- a/docs/SparkInstallation.md +++ b/docs/SparkInstallation.md @@ -1,12 +1,12 @@ -### Download Spark 3.0.1 +### Download Spark 3.1.1 -Currently Native SQL Engine works on the Spark 3.0.1 version. +Currently Native SQL Engine works on the Spark 3.1.1 version. ``` -wget http://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz -sudo mkdir -p /opt/spark && sudo mv spark-3.0.1-bin-hadoop3.2.tgz /opt/spark -sudo cd /opt/spark && sudo tar -xf spark-3.0.1-bin-hadoop3.2.tgz -export SPARK_HOME=/opt/spark/spark-3.0.1-bin-hadoop3.2/ +wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz +sudo mkdir -p /opt/spark && sudo mv spark-3.1.1-bin-hadoop3.2.tgz /opt/spark +sudo cd /opt/spark && sudo tar -xf spark-3.1.1-bin-hadoop3.2.tgz +export SPARK_HOME=/opt/spark/spark-3.1.1-bin-hadoop3.2/ ``` ### [Or building Spark from source](https://spark.apache.org/docs/latest/building-spark.html) diff --git a/docs/User-Guide.md b/docs/User-Guide.md index 725d30c9f..c4d904739 100644 --- a/docs/User-Guide.md +++ b/docs/User-Guide.md @@ -6,7 +6,11 @@ A Native Engine for Spark SQL with vectorized SIMD optimizations ![Overview](./image/nativesql_arch.png) -Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technologies and brought better performance to Spark SQL. +Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD-optimized kernels and LLVM-based SQL engine Gandiva are also very efficient. + +Native SQL Engine reimplements Spark SQL execution layer with SIMD-friendly columnar data processing based on Apache Arrow, +and leverages Arrow's CPU-cache friendly columnar in-memory layout, SIMD-optimized kernels and LLVM-based expression engine to bring better performance to Spark SQL. + ## Key Features @@ -36,7 +40,20 @@ We implemented columnar shuffle to improve the shuffle performance. With the col Please check the operator supporting details [here](./operators.md) -## Build the Plugin +## How to use OAP: Native SQL Engine + +There are three ways to use OAP: Native SQL Engine, +1. Use precompiled jars +2. Building by Conda Environment +3. Building by Yourself + +### Use precompiled jars + +Please go to [OAP's Maven Central Repository](https://repo1.maven.org/maven2/com/intel/oap/) to find Native SQL Engine jars. +For usage, you will require below two jar files: +1. spark-arrow-datasource-standard--jar-with-dependencies.jar is located in com/intel/oap/spark-arrow-datasource-standard// +2. spark-columnar-core--jar-with-dependencies.jar is located in com/intel/oap/spark-columnar-core// +Please notice the files are fat jars shipped with our custom Arrow library and pre-compiled from our server(using GCC 9.3.0 and LLVM 7.0.1), which means you will require to pre-install GCC 9.3.0 and LLVM 7.0.1 in your system for normal usage. ### Building by Conda @@ -47,18 +64,18 @@ Then you can just skip below steps and jump to [Get Started](#get-started). If you prefer to build from the source code on your hand, please follow below steps to set up your environment. -### Prerequisite +#### Prerequisite + There are some requirements before you build the project. Please check the document [Prerequisite](./Prerequisite.md) and make sure you have already installed the software in your system. If you are running a SPARK Cluster, please make sure all the software are installed in every single node. -### Installation -Please check the document [Installation Guide](./Installation.md) +#### Installation -### Configuration & Testing -Please check the document [Configuration Guide](./Configuration.md) +Please check the document [Installation Guide](./Installation.md) ## Get started + To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core--jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard--jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. SPARK related options are: @@ -71,6 +88,8 @@ SPARK related options are: For Spark Standalone Mode, please set the above value as relative path to the jar file. For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file. +More Configuration, please check the document [Configuration Guide](./Configuration.md) + Example to run Spark Shell with ArrowDataSource jar file ``` ${SPARK_HOME}/bin/spark-shell \ @@ -99,7 +118,7 @@ orders.createOrReplaceTempView("orders") spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false) ``` -The result should show up on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./limitations.md). +The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./limitations.md). ## Performance data diff --git a/docs/index.md b/docs/index.md index 725d30c9f..c4d904739 100644 --- a/docs/index.md +++ b/docs/index.md @@ -6,7 +6,11 @@ A Native Engine for Spark SQL with vectorized SIMD optimizations ![Overview](./image/nativesql_arch.png) -Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technologies and brought better performance to Spark SQL. +Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD-optimized kernels and LLVM-based SQL engine Gandiva are also very efficient. + +Native SQL Engine reimplements Spark SQL execution layer with SIMD-friendly columnar data processing based on Apache Arrow, +and leverages Arrow's CPU-cache friendly columnar in-memory layout, SIMD-optimized kernels and LLVM-based expression engine to bring better performance to Spark SQL. + ## Key Features @@ -36,7 +40,20 @@ We implemented columnar shuffle to improve the shuffle performance. With the col Please check the operator supporting details [here](./operators.md) -## Build the Plugin +## How to use OAP: Native SQL Engine + +There are three ways to use OAP: Native SQL Engine, +1. Use precompiled jars +2. Building by Conda Environment +3. Building by Yourself + +### Use precompiled jars + +Please go to [OAP's Maven Central Repository](https://repo1.maven.org/maven2/com/intel/oap/) to find Native SQL Engine jars. +For usage, you will require below two jar files: +1. spark-arrow-datasource-standard--jar-with-dependencies.jar is located in com/intel/oap/spark-arrow-datasource-standard// +2. spark-columnar-core--jar-with-dependencies.jar is located in com/intel/oap/spark-columnar-core// +Please notice the files are fat jars shipped with our custom Arrow library and pre-compiled from our server(using GCC 9.3.0 and LLVM 7.0.1), which means you will require to pre-install GCC 9.3.0 and LLVM 7.0.1 in your system for normal usage. ### Building by Conda @@ -47,18 +64,18 @@ Then you can just skip below steps and jump to [Get Started](#get-started). If you prefer to build from the source code on your hand, please follow below steps to set up your environment. -### Prerequisite +#### Prerequisite + There are some requirements before you build the project. Please check the document [Prerequisite](./Prerequisite.md) and make sure you have already installed the software in your system. If you are running a SPARK Cluster, please make sure all the software are installed in every single node. -### Installation -Please check the document [Installation Guide](./Installation.md) +#### Installation -### Configuration & Testing -Please check the document [Configuration Guide](./Configuration.md) +Please check the document [Installation Guide](./Installation.md) ## Get started + To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core--jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard--jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. SPARK related options are: @@ -71,6 +88,8 @@ SPARK related options are: For Spark Standalone Mode, please set the above value as relative path to the jar file. For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file. +More Configuration, please check the document [Configuration Guide](./Configuration.md) + Example to run Spark Shell with ArrowDataSource jar file ``` ${SPARK_HOME}/bin/spark-shell \ @@ -99,7 +118,7 @@ orders.createOrReplaceTempView("orders") spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false) ``` -The result should show up on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./limitations.md). +The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./limitations.md). ## Performance data