From bf721f363d287ec2ce59fe575643b61e2290e896 Mon Sep 17 00:00:00 2001 From: aiceflower Date: Wed, 19 Jul 2023 22:30:59 +0800 Subject: [PATCH] 1.4.0 feature update --- docs/deployment/deploy-quick.md | 4 +- docs/engine-usage/impala.md | 76 +-- docs/feature/base-engine-compatibilty.md | 72 ++- docs/feature/datasource-generate-sql.md | 182 ++++--- docs/feature/ecm-takes-over-ec.md | 12 - .../feature/hive-engine-support-concurrent.md | 25 +- docs/feature/other.md | 30 ++ docs/feature/overview.md | 24 +- docs/feature/remove-json4s-from-linkis.md | 11 - .../remove-underlying-engine-depdency.md | 17 - docs/feature/spark-etl.md | 499 +++++++++++------- docs/feature/storage-add-support-oss.md | 36 -- docs/feature/upgrade-base-engine-version.md | 20 - docs/feature/version-and-branch-intro.md | 13 - .../current/deployment/deploy-quick.md | 6 +- .../current/engine-usage/impala.md | 74 +-- .../feature/base-engine-compatibilty.md | 62 ++- .../feature/datasource-generate-sql.md | 144 ++--- .../current/feature/ecm-takes-over-ec.md | 10 - .../feature/hive-engine-support-concurrent.md | 14 +- .../current/feature/other.md | 31 ++ .../current/feature/overview.md | 15 +- .../feature/remove-json4s-from-linkis.md | 11 - .../remove-underlying-engine-depdency.md | 17 - .../current/feature/spark-etl.md | 441 ++++++++++------ .../feature/storage-add-support-oss.md | 32 -- .../feature/upgrade-base-engine-version.md | 17 - .../feature/version-and-branch-intro.md | 13 - 28 files changed, 1036 insertions(+), 872 deletions(-) delete mode 100644 docs/feature/ecm-takes-over-ec.md create mode 100644 docs/feature/other.md delete mode 100644 docs/feature/remove-json4s-from-linkis.md delete mode 100644 docs/feature/remove-underlying-engine-depdency.md delete mode 100644 docs/feature/storage-add-support-oss.md delete mode 100644 docs/feature/upgrade-base-engine-version.md delete mode 100644 docs/feature/version-and-branch-intro.md delete mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/ecm-takes-over-ec.md create mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/other.md delete mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/remove-json4s-from-linkis.md delete mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/remove-underlying-engine-depdency.md delete mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/storage-add-support-oss.md delete mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/upgrade-base-engine-version.md delete mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/version-and-branch-intro.md diff --git a/docs/deployment/deploy-quick.md b/docs/deployment/deploy-quick.md index 19e2f036da9..73ecd2e83e1 100644 --- a/docs/deployment/deploy-quick.md +++ b/docs/deployment/deploy-quick.md @@ -235,7 +235,7 @@ HADOOP_KEYTAB_PATH=/appcom/keytab/ > > Note: Linkis has not adapted permissions for S3, so it is not possible to grant authorization for it. -`vim linkis.properties` +`vim $LINKIS_HOME/conf/linkis.properties` ```shell script # s3 file system linkis.storage.s3.access.key=xxx @@ -245,7 +245,7 @@ linkis.storage.s3.region=xxx linkis.storage.s3.bucket=xxx ``` -`vim linkis-cg-entrance.properties` +`vim $LINKIS_HOME/conf/linkis-cg-entrance.properties` ```shell script wds.linkis.entrance.config.log.path=s3:///linkis/logs wds.linkis.resultSet.store.path=s3:///linkis/results diff --git a/docs/engine-usage/impala.md b/docs/engine-usage/impala.md index 282dd8073d3..8fa24bf0b03 100644 --- a/docs/engine-usage/impala.md +++ b/docs/engine-usage/impala.md @@ -1,42 +1,32 @@ --- title: Impala -sidebar_position: 15 +sidebar_position: 12 --- This article mainly introduces the installation, usage and configuration of the `Impala` engine plugin in `Linkis`. - ## 1. Pre-work -### 1.1 Engine installation +### 1.1 Environment installation -If you want to use the `Impala` engine on your `Linkis` service, you need to prepare the Impala service and provide connection information, such as the connection address of the Impala cluster, SASL username and password, etc. +If you want to use the Impala engine on your server, you need to prepare the Impala service and provide connection information, such as the connection address of the Impala cluster, SASL user name and password, etc. -### 1.2 Service Verification +### 1.2 Environment verification -```shell -# prepare trino-cli -wget https://repo1.maven.org/maven2/io/trino/trino-cli/374/trino-cli-374-executable.jar -mv trill-cli-374-executable.jar trill-cli -chmod +x trino-cli - -# Execute the task -./trino-cli --server localhost:8080 --execute 'show tables from system.jdbc' - -# Get the following output to indicate that the service is available -"attributes" -"catalogs" -"columns" -"procedure_columns" -"procedures" -"pseudo_columns" -"schemas" -"super_tables" -"super_types" -"table_types" -"tables" -"types" -"udts" +Execute the impala-shell command to get the following output, indicating that the impala service is available. +``` +[root@8f43473645b1 /]# impala-shell +Starting Impala Shell without Kerberos authentication +Connected to 8f43473645b1:21000 +Server version: impalad version 2.12.0-cdh5.15.0 RELEASE (build 23f574543323301846b41fa5433690df32efe085) +***************************************************** ********************************* +Welcome to the Impala shell. +(Impala Shell v2.12.0-cdh5.15.0 (23f5745) built on Thu May 24 04:07:31 PDT 2018) + +When pretty-printing is disabled, you can use the '--output_delimiter' flag to set +the delimiter for fields in the same row. The default is ','. +***************************************************** ********************************* +[8f43473645b1:21000] > ``` ## 2. Engine plugin deployment @@ -101,7 +91,7 @@ select * from linkis_cg_engine_conn_plugin_bml_resources; ```shell sh ./bin/linkis-cli -submitUser impala \ --engineType impala-3.4.0 -code 'select * from default.test limit 10' \ +-engineType impala-3.4.0 -code 'show databases;' \ -runtimeMap linkis.es.http.method=GET \ -runtimeMap linkis.impala.servers=127.0.0.1:21050 ``` @@ -143,37 +133,23 @@ More `Linkis-Cli` command parameter reference: [Linkis-Cli usage](../user-guide/ If the default parameters are not satisfied, there are the following ways to configure some basic parameters -#### 4.2.1 Management console configuration - -![](./images/trino-config.png) - -Note: After modifying the configuration under the `IDE` tag, you need to specify `-creator IDE` to take effect (other tags are similar), such as: - -```shell -sh ./bin/linkis-cli -creator IDE -submitUser hadoop \ - -engineType impala-3.4.0 -codeType sql \ - -code 'select * from system.jdbc.schemas limit 10' -``` - -#### 4.2.2 Task interface configuration +#### 4.2.1 Task interface configuration Submit the task interface and configure it through the parameter `params.configuration.runtime` ```shell Example of http request parameters { - "executionContent": {"code": "select * from system.jdbc.schemas limit 10;", "runType": "sql"}, + "executionContent": {"code": "show databases;", "runType": "sql"}, "params": { "variable": {}, "configuration": { "runtime": { - "linkis.trino.url":"http://127.0.0.1:8080", - "linkis.trino.catalog ":"hive", - "linkis.trino.schema ":"default" - } + "linkis.impala.servers"="127.0.0.1:21050" } - }, + } + }, "labels": { - "engineType": "trino-371", + "engineType": "impala-3.4.0", "userCreator": "hadoop-IDE" } } @@ -185,7 +161,7 @@ Example of http request parameters ``` linkis_ps_configuration_config_key: Insert the key and default values ​​​​of the configuration parameters of the engine -linkis_cg_manager_label: insert engine label such as: trino-375 +linkis_cg_manager_label: insert engine label such as: impala-3.4.0 linkis_ps_configuration_category: Insert the directory association of the engine linkis_ps_configuration_config_value: Insert the configuration that the engine needs to display linkis_ps_configuration_key_engine_relation: the relationship between configuration items and engines diff --git a/docs/feature/base-engine-compatibilty.md b/docs/feature/base-engine-compatibilty.md index e0c615f40a1..81062d30c43 100644 --- a/docs/feature/base-engine-compatibilty.md +++ b/docs/feature/base-engine-compatibilty.md @@ -1,36 +1,74 @@ --- -title: reduce base engine compatibility issues +title: Base Engine Dependency, Compatibility, Default Version Optimization sidebar_position: 0.2 --- -## 1. Requirement Background -before we may need to modify linkis source code to fit different hive and spark version and compilation may fail for some certain versions, we need to reduce compilation and installation problems caused by base engine versions +## 1. Requirement background +1. The lower version of linkis needs to modify the code to adapt to different versions of Hive, Spark, etc. Because of compatibility issues, the compilation may fail, which can reduce the compatibility issues of these basic engines. +2. Hadoop, Hive, and Spark 3.x are very mature, and lower versions of the engine may have potential risks. Many users in the community use the 3.x version by default, so consider changing the default compiled version of Linkis to 3.x. ## 2. Instructions for use -for different hive compilation, we only to compile with target hive versions, such as -``` -mvn clean install package -Dhive.version=3.1.3 -``` +## 2.1 Default version adjustment instructions + +Linkis 1.4.0 changes the default versions of Hadoop, Hive, and Spark to 3.x, and the specific versions are Hadoop 3.3.4, Hive 3.1.3, and Spark 3.2.1. -for different spark compilation, we only to compile with target spark versions, here are normal scenes for usage. +## 2.2 Different version adaptation + +To compile different hive versions, we only need to specify `-D=xxx`, for example: +``` +mvn clean install package -Dhive.version=2.3.3 ``` -spark3+hadoop3 +To compile different versions of spark, we only need to specify `-D=xxx`. Common usage scenarios are as follows: +``` +#spark3+hadoop3 mvn install package -spark3+hadoop2 -mvn install package -Phadoop-2.7 +#spark3+hadoop2 +mvn install package -Phadoop-2.7 -spark2+hadoop2 +#spark2+hadoop2 mvn install package -Pspark-2.4 -Phadoop-2.7 -spark2+ hadoop3 +#spark2+ hadoop3 mvn install package -Pspark-2.4 - ``` ## 3. Precautions -spark subversion can be specified by -Dspark.version=xxx -hadoop subversion can be specified by -Dhadoop.version=xxx +1. When the default version is compiled, the basic version is: hadoop3.3.4 + hive3.1.3 + spark3.2.1 +``` +mvn install package +``` +Due to the default version upgrade of the default base engine, `spark-3.2`, `hadoop-3.3` and `spark-2.4-hadoop-3.3` profiles were removed, and profiles `hadoop-2.7` and `spark-2.4` were added. + +2. The sub-version of spark can be specified by `-Dspark.version=xxx`. The default scala version used by the system is 2.12.17, which can be adapted to spark 3.x version. To compile spark 2.x, you need to use scala 2.11 version. Can be compiled with -Pspark-2.4 parameter, or -Dspark.version=2.xx -Dscala.version=2.11.12 -Dscala.binary.version=2.11. + +3. The subversion of hadoop can be specified by `-Dhadoop.version=xxx` for example : -mvn install package -Pspark-3.2 -Phadoop-3.3 -Dspark.version=3.1.3 \ No newline at end of file +``` +mvn install package -Pspark-3.2 -Phadoop-3.3 -Dspark.version=3.1.3 +``` + +4. Version 2.x of hive needs to rely on jersey. Hive EC does not add jersey dependency when compiling by default. You can compile it through the following guidelines. + +**Compile hive version 2.3.3** + +When compiling hive EC, the profile that activates adding jersey dependencies when specifying version 2.3.3 is added by default. Users can compile by specifying the -Dhive.version=2.3.3 parameter + +**Compile other hive 2.x versions** + +Modify the linkis-engineconn-plugins/hive/pom.xml file, modify 2.3.3 to the user-compiled version, such as 2.1.0 +```xml + + hive-jersey-dependencies + + + hive.version + + 2.1.0 + + + ... + +``` +Add -Dhive.version=2.1.0 parameter when compiling. diff --git a/docs/feature/datasource-generate-sql.md b/docs/feature/datasource-generate-sql.md index 32178e36b55..e9b0ec5a341 100644 --- a/docs/feature/datasource-generate-sql.md +++ b/docs/feature/datasource-generate-sql.md @@ -1,28 +1,31 @@ --- -title: Generate SQL from the data source -sidebar_position: 0.2 ---- +title: Generate SQL according to the data source +sidebar_position: 0.5 +--- ## 1. Background -SparkSQL and JdbcSQL are generated based on data source information, including DDL, DML, and DQL +Generate SparkSQL and JdbcSQL based on data source information, including DDL, DML, and DQL. ## 2. Instructions for use -### Generate SparkSQL -Parameter Description: +### generate SparkSQL -| parameter name | description | default value | -|------------------------------|-------|-----| -| `dataSourceName` | Data source name | - | -| `system` | System name | - | -| `database` | Database name | - | -| `table` | Table name | - | +Interface address: /api/rest_j/v1/metadataQuery/getSparkSql -Submit the task through RestFul, the request example is as follows. -```json -GET /api/rest_j/v1/metadataQuery/getSparkSql?dataSourceName=mysql&system=system&database=test&table=test -``` +Request method: GET + +Request data type: application/x-www-form-urlencoded + +Request parameters: + +| Parameter name | Description | Required | Data type | +|-------------------------------|-------|-----|--| +| `dataSourceName` | data source name | is | String | +| `system` | system name | is | String | +| `database` | database name | is | String | +| `table` | table name | is | String | + +Example response: -The following is an example of the response. ```json { "method": null, @@ -30,112 +33,117 @@ The following is an example of the response. "message": "OK", "data": { "sparkSql": { - "ddl": "CREATE TEMPORARY TABLE test USING org.apache.spark.sql.jdbc OPTIONS ( url 'jdbc:mysql://localhost:3306/test', dbtable 'test', user 'root', password 'password')", + "ddl": "CREATE TEMPORARY TABLE test USING org.apache.spark.sql.jdbc OPTIONS ( url 'jdbc:mysql://localhost:3306/test', dbtable 'test', user 'root', password 'password' )", "dml": "INSERT INTO test SELECT * FROM ${resultTable}", "dql": "SELECT id,name FROM test" } } } ``` -Currently, jdbc, kafka, elasticsearch, and mongo data sources are supported. You can register spark table based on SparkSQLDdl for query +Currently supports jdbc, kafka, elasticsearch, mongo data source, you can register spark table according to SparkSQLDDL for query ### Generate JdbcSQL -Parameter Description: -| parameter name | description | default value | -|------------------------------|-------|-----| -| `dataSourceName` | Data source name | - | -| `system` | System name | - | -| `database` | Database name | - | -| `table` | Table name | - | +Interface address: /api/rest_j/v1/metadataQuery/getJdbcSql -Submit the task through RestFul, the request example is as follows. -```json -GET /api/rest_j/v1/metadataQuery/getJdbcSql?dataSourceName=mysql&system=system&database=test&table=test -``` +Request method: GET + +Request data type: application/x-www-form-urlencoded + +Request parameters: + +| Parameter name | Description | Required | Data type | +|-------------------------------|-------|-----|--| +| `dataSourceName` | data source name | is | String | +| `system` | system name | is | String | +| `database` | database name | is | String | +| `table` | table name | is | String | + +Example response: -The following is an example of the response. ```json { - "method": null, - "status": 0, - "message": "OK", - "data": { - "jdbcSql": { - "ddl": "CREATE TABLE `test` (\n\t `id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT '列名是id',\n\t `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT '列名是name',\n\t PRIMARY KEY (`id`)\n\t) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci", - "dml": "INSERT INTO test SELECT * FROM ${resultTable}", - "dql": "SELECT id,name FROM test" + "method": null, + "status": 0, + "message": "OK", + "data": { + "jdbcSql": { + "ddl": "CREATE TABLE `test` (\n\t `id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT 'The column name is id',\n\t `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT 'The column name is name',\n\t PRIMARY KEY (`id`)\n\t) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci", + "dml": "INSERT INTO test SELECT * FROM ${resultTable}", + "dql": "SELECT id,name FROM test" + } } - } } ``` -Currently, jdbc data sources are supported, such as mysql, oracle, and postgres. JdbcSQLDdl can be used for front-end display + +Currently supports JDBC data sources, such as: mysql, oracle, postgres, etc. JdbcSQLDDL can be used for front-end display. ## 3. Precautions 1. You need to register the data source first ## 4. Implementation principle -### Generate SparkSQL implementation principles -Define DDL_SQL_TEMPLATE to retrieve data source information for replacement +### Generate SparkSQL implementation principle +Define DDL_SQL_TEMPLATE to obtain data source information for replacement ```java public static final String JDBC_DDL_SQL_TEMPLATE = - "CREATE TEMPORARY TABLE %s " - + "USING org.apache.spark.sql.jdbc " + "CREATE TEMPORARY TABLE %s" + + "USING org.apache.spark.sql.jdbc" + "OPTIONS (" - + " url '%s'," - + " dbtable '%s'," - + " user '%s'," - + " password '%s'" + + "url '%s'," + + "dbtable '%s'," + + " user '%s'," + + "password '%s'" + ")"; ``` -### Generate JdbcSQL implementation principles -Concatenate DDL based on the table schema information +### Generate JdbcSQL implementation principle +Splicing DDL according to table schema information ```java - public String generateJdbcDdlSql(String database, String table) { - StringBuilder ddl = new StringBuilder(); - ddl.append("CREATE TABLE ").append(String.format("%s.%s", database, table)).append(" ("); - - try { - List columns = getColumns(database, table); - if (CollectionUtils.isNotEmpty(columns)) { - for (MetaColumnInfo column : columns) { - ddl.append("\n\t").append(column.getName()).append(" ").append(column.getType()); - if (column.getLength() > 0) { - ddl.append("(").append(column.getLength()).append(")"); - } - if (!column.isNullable()) { - ddl.append(" NOT NULL"); - } - ddl.append(","); - } - String primaryKeys = - columns.stream() - .filter(MetaColumnInfo::isPrimaryKey) - .map(MetaColumnInfo::getName) - .collect(Collectors.joining(", ")); - if (StringUtils.isNotBlank(primaryKeys)) { - ddl.append(String.format("\n\tPRIMARY KEY (%s),", primaryKeys)); - } - ddl.deleteCharAt(ddl.length() - 1); - } - } catch (Exception e) { - LOG.warn("Fail to get Sql columns(获取字段列表失败)"); +public String generateJdbcDdlSql(String database, String table) { + StringBuilder ddl = new StringBuilder(); + ddl.append("CREATE TABLE ").append(String.format("%s.%s", database, table)).append(" ("); + + try { + List < MetaColumnInfo > columns = getColumns(database, table); + if (CollectionUtils. isNotEmpty(columns)) { + for (MetaColumnInfo column: columns) { + ddl.append("\n\t").append(column.getName()).append(" ").append(column.getType()); + if (column. getLength() > 0) { + ddl.append("(").append(column.getLength()).append(")"); + } + if (!column. isNullable()) { + ddl.append("NOT NULL"); + } + ddl.append(","); + } + String primaryKeys = + columns. stream() + .filter(MetaColumnInfo::isPrimaryKey) + .map(MetaColumnInfo::getName) + .collect(Collectors.joining(", ")); + if (StringUtils. isNotBlank(primaryKeys)) { + ddl.append(String.format("\n\tPRIMARY KEY (%s),", primaryKeys)); + } + ddl. deleteCharAt(ddl. length() - 1); } + } catch (Exception e) { + LOG.warn("Fail to get Sql columns(Failed to get the field list)"); + } - ddl.append("\n)"); + ddl.append("\n)"); - return ddl.toString(); - } + return ddl. toString(); +} ``` -Some data sources support fetching DDL directly -mysql +Some data sources support direct access to DDL + +**mysql** ```sql SHOW CREATE TABLE 'table' ``` -oracle +**oracle** ```sql -SELECT DBMS_METADATA.GET_DDL('TABLE', 'table', 'database') AS DDL FROM DUAL +SELECT DBMS_METADATA.GET_DDL('TABLE', 'table', 'database') AS DDL FROM DUAL ``` \ No newline at end of file diff --git a/docs/feature/ecm-takes-over-ec.md b/docs/feature/ecm-takes-over-ec.md deleted file mode 100644 index d2714e94882..00000000000 --- a/docs/feature/ecm-takes-over-ec.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -title: ECM takes over EC -sidebar_position: 0.2 ---- - -## 1. Requirement Background -When the ECM restarts, can choose not to kill the engine, and it can take over the existing surviving engine. -Make engine conn manager(ECM) service stateless . - - -## 2. Instructions for use -this feature is enabled by default \ No newline at end of file diff --git a/docs/feature/hive-engine-support-concurrent.md b/docs/feature/hive-engine-support-concurrent.md index 53b0e7c0e3d..fc19a66f916 100644 --- a/docs/feature/hive-engine-support-concurrent.md +++ b/docs/feature/hive-engine-support-concurrent.md @@ -1,21 +1,24 @@ --- -title: hive engine support concurrent -sidebar_position: 0.2 +title: hive engine supports concurrency and multiplexing +sidebar_position: 0.3 --- -## 1. Requirement Background -hiveEngineConn supports concurrency, reducing the resource consumption of starting hive engine. +## 1. Requirement background +hiveEngineConn supports concurrency, reduces the resource consumption of starting the hive engine, and improves the engine reuse rate. ## 2. Instructions for use -First, modify linkis-engineconn.properties file in linkis-engineconn-plugins/hive/src/main/resources directory, -and set linkis.hive.engineconn.concurrent.support to true. -``` -# support parallelism execution -linkis.hive.engineconn.concurrent.support=true +First, modify the linkis-engineconn.properties file in the linkis-engineconn-plugins/hive/src/main/resources directory, +And set linkis.hive.engineconn.concurrent.support to true. ``` +# Support parallel execution +wds.linkis.engineconn.support.parallelism=true -Second, submit a hive task ,when first task finished ,submit another task. Your could see hive engine has been reused. +# Concurrency limit, the default is 10 +linkis.hive.engineconn.concurrent.limit=10 +``` +Submit a hive job, and when the first job is complete, submit another job. You can see that the hive engine has been reused. +Restart the cg-linkismanager service after configuration modification, or make the configuration take effect through [Engine Refresh API](../api/http/linkis-cg-engineplugin-api/engineconn-plugin-refresh.md). ## 3. Precautions -1, Submit second hive task when first task has been finished. +1. Wait for the first hive task to execute successfully before submitting the second hive task. Submitting multiple tasks at the same time for the first time may cause multiple ECs to be started due to no available ECs. \ No newline at end of file diff --git a/docs/feature/other.md b/docs/feature/other.md new file mode 100644 index 00000000000..293c79620cc --- /dev/null +++ b/docs/feature/other.md @@ -0,0 +1,30 @@ +--- +title: Description of other features +sidebar_position: 0.6 +--- + +## 1. Linkis 1.4.0 other feature upgrade instructions + +### 1.1 Do not kill EC when ECM restarts +When the ECM restarts, there is an option not to kill the engine, but to take over the existing surviving engine. Makes the Engine Connection Manager (ECM) service stateless. + +### 1.2 Remove json4s dependency +Different versions of spark depend on different json4s versions, which is not conducive to the support of multiple versions of spark. We need to reduce this json4s dependency and remove json4s from linkis. +For example: spark2.4 needs json4s v3.5.3, spark3.2 needs json4s v3.7.0-M11. + +### 1.3 EngineConn module definition depends on engine version +The version definition of the engine is in `EngineConn` by default. Once the relevant version changes, it needs to be modified in many places. We can put the relevant version definition in the top-level pom file. When compiling a specified engine module, it needs to be compiled in the project root directory, and use `-pl` to compile the specific engine module, for example: +``` +mvn install package -pl linkis-engineconn-plugins/spark -Dspark.version=3.2.1 +``` +The version of the engine can be specified by the -D parameter of mvn compilation, such as -Dspark.version=xxx, -Dpresto.version=0.235 +At present, all underlying engine versions have been moved to the top-level pom file. When compiling a specified engine module, it needs to be compiled in the project root directory, and `-pl` is used to compile the specific engine module. + +### 1.4 Linkis Main Version Number Modification Instructions + +Linkis will no longer be upgraded by minor version after version 1.3.2. The next version will be 1.4.0, and the version number will be 1.5.0, 1.6.0 and so on. When encountering a major defect in a released version that needs to be fixed, it will pull a minor version to fix the defect, such as 1.4.1. + + +## 1.5 LInkis code submission main branch description + +The modified code of Linkis 1.3.2 and earlier versions is merged into the dev branch by default. In fact, the development community of Apache Linkis is very active, and new development requirements or repair functions will be submitted to the dev branch, but when users visit the Linkis code base, the master branch is displayed by default. Since we only release a new version every quarter, it seems that the community is not very active from the perspective of the master branch. Therefore, we decided to merge the code submitted by developers into the master branch by default starting from version 1.4.0. diff --git a/docs/feature/overview.md b/docs/feature/overview.md index fdd8a887191..6937df5ecbe 100644 --- a/docs/feature/overview.md +++ b/docs/feature/overview.md @@ -1,17 +1,16 @@ ---- -title: Version Feature -sidebar_position: 0.1 ---- +--- +title: version overview +sidebar_position: 0.1 +--- -- [hadoop, spark, hive default version upgraded to 3.x](./upgrade-base-engine-version.md) -- [Reduce compatibility issues of different versions of the base engine](./base-engine-compatibilty.md) +- [Base engine dependencies, compatibility, default version optimization](./base-engine-compatibilty.md) - [Hive engine connector supports concurrent tasks](./hive-engine-support-concurrent.md) -- [linkis-storage supports S3 and OSS file systems](./storage-add-support-oss.md) -- [Support more data sources](./spark-etl.md) -- [Add postgresql database support](../deployment/deploy-quick.md) -- [Do not kill EC when ECM restarts](./ecm-takes-over-ec.md) +- [add Impala plugin support](../engine-usage/impala.md) +- [linkis-storage supports S3 file system](../deployment/deploy-quick#s3-mode-optional) +- [Add postgresql database support](../deployment/deploy-quick#33-add-postgresql-driver-package-optional) - [Spark ETL enhancements](./spark-etl.md) -- [version number and branch modification instructions](./version-and-branch-intro.md) +- [Generate SQL from data source](./datasource-generate-sql.md) +- [Other feature description](./other.md) - [version of Release-Notes](/download/release-notes-1.4.0) ## Parameter changes @@ -22,9 +21,10 @@ sidebar_position: 0.1 | mg-eureka | Add | eureka.instance.metadata-map.linkis.conf.version | None | Eureka metadata report Linkis service version information | | mg-eureka | Modify | eureka.client.registry-fetch-interval-seconds | 8 | Eureka Client pull service registration information interval (seconds) | | mg-eureka | New | eureka.instance.lease-renewal-interval-in-seconds | 4 | The frequency (seconds) at which the eureka client sends heartbeats to the server | -| mg-eureka | new | eureka.instance.lease-expiration-duration-in-seconds | 12 | eureka waits for the next heartbeat timeout (seconds) | +| mg-eureka | new | eureka.instance.lease-expiration-duration-in-seconds | 12 | eureka waits for the next heartbeat timeout (seconds)| | EC-shell | Modify | wds.linkis.engineconn.support.parallelism | true | Whether to enable parallel execution of shell tasks | | EC-shell | Modify | linkis.engineconn.shell.concurrent.limit | 15 | Concurrent number of shell tasks | +| Entrance | Modify | linkis.entrance.auto.clean.dirty.data.enable | true | Whether to clean dirty data during startup | ## Database table changes diff --git a/docs/feature/remove-json4s-from-linkis.md b/docs/feature/remove-json4s-from-linkis.md deleted file mode 100644 index 69a7f5d04f5..00000000000 --- a/docs/feature/remove-json4s-from-linkis.md +++ /dev/null @@ -1,11 +0,0 @@ ---- -title: remove json4s from linkis -sidebar_position: 0.2 ---- - -## 1. Requirement Background -we need to bind specific json4s version and spark version, we need to reduce the dependency. -for example: spark2.4 use json4s v3.5.3, spark3.2 use json4s v3.7.0-M11 - -## 2. Instructions for use -no need to specific json4s versions when we specify different spark versions \ No newline at end of file diff --git a/docs/feature/remove-underlying-engine-depdency.md b/docs/feature/remove-underlying-engine-depdency.md deleted file mode 100644 index 8dc01599989..00000000000 --- a/docs/feature/remove-underlying-engine-depdency.md +++ /dev/null @@ -1,17 +0,0 @@ ---- -title: EngineConn no longer imports the dependencies of the underlying engine -sidebar_position: 0.2 ---- - -## 1. Requirement Background -engine version hides in EngineConn, we may need to change several modules pom files if some engine version changes, we should move it to project pom file. - -## 2. Instructions for use -if we need to compile only one engineconn module, we will need to compile with `-pl` to specific modules -``` -mvn install package -pl linkis-engineconn-plugins/spark -Dspark.version=3.2.2 - -``` -## 3. Precautions -engine version can also be specified like -Dspark.version=xxx 、 -Dpresto.version=0.235 -now all our supported engine version has been moved to project pom file. please compile with the `-pl` command \ No newline at end of file diff --git a/docs/feature/spark-etl.md b/docs/feature/spark-etl.md index e21f3b6c2c6..965f0509752 100644 --- a/docs/feature/spark-etl.md +++ b/docs/feature/spark-etl.md @@ -1,42 +1,77 @@ --- -title: Supports spark etl -sidebar_position: 0.2 ---- +title: Support spark ETL data synchronization +sidebar_position: 0.4 +--- ## 1. Background -You can configure json for spark etl +Using the Spark ETL function, users can synchronize Spark data by configuring json. ## 2. Supported types -Currently supported types +currently supported types ```text -jdbc、file、redis、kafka、elasticsearch、mongo、solr、rocketmq、datalake(hudi、delta) +jdbc, file, redis, kafka, elasticsearch, mongo, datalake (hudi, delta) ``` ## 3. General configuration instructions ```text -name: Data source name -type: Contain `source`、`transformation`、`sink`,Corresponding to input, transformation and output respectively -options: Configuration parameter -saveMode: Save mode, currently supported: `overwrite` and `append` -path: File path,The value can be: 'file://' or 'hdfs://'(default) -'resultTable' needs to correspond to 'sourceTable' +name: data source name +type: Contains `source`, `transformation`, `sink`, corresponding to input, transformation, and output respectively +options: configuration parameters +saveMode: save mode, currently supports: `overwrite` and `append` +path: file path, can be: 'file://' or 'hdfs://'(default) +`resultTable` needs to correspond to `sourceTable` ``` - ## 4. Instructions for use -Common etl examples are as follows: -### jdbc +### 4.1 Add the required jar package +When using the data source, you need to upload the corresponding spark connector jar to the spark/jars directory, the directory location is $SPARK_HOME/jars + +The spark connector jar can be obtained by the following command + +```text +git clone https://github.com/apache/linkis.git + +cd link is + +git checkout master + +cd linkis-engineconn-plugins/spark/scala-2.12 + +mvn clean install -Dmaven.test.skip=true +``` + +The compiled spark connector jar is located in the following directory +```text +linkis/linkis-engineconn-plugins/spark/scala-2.12/target/out/spark/dist/3.2.1/lib +``` + +### 4.2 linkis-cli submit task example + +Just pass in the specific json code in code, pay attention to the conversion of quotation marks. + +```shell +sh /appcom/Install/linkis/bin/linkis-cli -engineType spark-3.2.1 -codeType data_calc -code "" -submitUser hadoop -proxyUser hadoop +``` + +Linkis-cli submits redis data synchronization task example +```shell +sh ./bin/linkis-cli -engineType spark-3.2.1 -codeType data_calc -code "{\"plugins\":[{\"name\":\"file\",\"type\":\" source\",\"config\":{\"resultTable\":\"test\",\"path\":\"hdfs://linkishdfs/tmp/linkis/spark_etl_test/etltest.dolphin\",\ "serializer\":\"csv\",\"options\":{\"header\":\"true\",\"delimiter\":\";\"},\"columnNames\":[ \"name\",\"age\"]}},{\"name\":\"redis\",\"type\":\"sink\",\"config\":{\"sourceTable \":\"test\",\"host\":\"wds07\",\"port\":\"6679\",\"auth\":\"password\",\"targetTable\" :\"spark_etl_test\",\"saveMode\":\"append\"}}]}" -submitUser hadoop -proxyUser hadoop +``` +### 4.3 Synchronization json script description of each data source + +### 4.3.1 jdbc -Configuration description +Configuration instructions ```text -url: jdbc连接信息 -user: User name +url: jdbc connection information +user: user name password: password query: sql query statement ``` +json code ```json { @@ -85,15 +120,25 @@ query: sql query statement } ``` -### file +A new jar needs to be added, and the corresponding jar should be selected according to the specific data source used +```text +DmJdbcDriver18.jar +kingbase8-8.6.0.jar +postgresql-42.3.8.jar +``` + +### 4.3.2 file + +Configuration instructions -Configuration description ```text -serializer: File format, can be 'csv', 'parquet', etc -columnNames: Column name +serializer: file format, can be `csv`, `parquet`, etc. +columnNames: column names ``` +json code + ```json { "sources": [ @@ -123,128 +168,49 @@ columnNames: Column name } ``` - -### delta - -Configuration description -```text -tableFormat: `hudi` and `delta` are currently supported +Need to add new jar ``` - - -Data write -```json -{ - "sources": [ - { - "name": "file", - "type": "source", - "config": { - "resultTable": "T1654611700631", - "path": "file://{filePath}/etltest.dolphin", - "serializer": "csv", - "options": { - "header":"true", - "delimiter":";" - }, - "columnNames": ["name", "age"] - } - } - ], - "sinks": [ - { - "name": "datalake", - "config": { - "sourceTable": "T1654611700631", - "tableFormat": "delta", - "path": "file://{filePath}/delta", - "saveMode": "overwrite" - } - } - ] -} -``` - -Data read -```json -{ - "sources": [ - { - "name": "datalake", - "type": "source", - "config": { - "resultTable": "T1654611700631", - "tableFormat": "delta", - "path": "file://{filePath}/delta", - } - } - ], - "sinks": [ - { - "name": "file", - "config": { - "sourceTable": "T1654611700631", - "path": "file://{filePath}/csv", - "saveMode": "overwrite", - "options": { - "header":"true" - }, - "serializer": "csv" - } - } - ] -} +spark-excel-2.12.17-3.2.2_2.12-3.2.2_0.18.1.jar ``` -### hudi +### 4.3.3 redis -Configuration description ```text -tableFormat: `hudi` and `delta` are currently supported +sourceTable: source table, +host: ip address, +port": port, +auth": password, +targetTable: target table, +saveMode: support append ``` - -Data write +json code ```json { - "sources": [ + "plugins":[ { "name": "file", "type": "source", "config": { - "resultTable": "T1654611700631", - "path": "file://{filePath}/etltest.dolphin", + "resultTable": "test", + "path": "hdfs://linkishdfs/tmp/linkis/spark_etl_test/etltest.dolphin", "serializer": "csv", "options": { - "header":"true", - "delimiter":";" + "header": "true", + "delimiter": ";" }, "columnNames": ["name", "age"] } - } - ], - "transformations": [ - { - "name": "sql", - "type": "transformation", - "config": { - "resultTable": "T111", - "sql": "select * from T1654611700631" - } - } - ], - "sinks": [ + }, { - "name": "datalake", + "name": "redis", + "type": "sink", "config": { - "sourceTable": "T1654611700631", - "tableFormat": "hudi", - "options": { - "hoodie.table.name":"huditest", - "hoodie.datasource.write.recordkey.field":"age", - "hoodie.datasource.write.precombine.field":"age" - }, - "path": "file://{filePath}/hudi", + "sourceTable": "test", + "host": "wds07", + "port": "6679", + "auth": "password", + "targetTable": "spark_etl_test", "saveMode": "append" } } @@ -252,59 +218,23 @@ Data write } ``` -Data read -```json -{ - "sources": [ - { - "name": "datalake", - "type": "source", - "config": { - "resultTable": "T1654611700631", - "tableFormat": "hudi", - "path": "file://{filePath}/hudi", - } - } - ], - "transformations": [ - { - "name": "sql", - "type": "transformation", - "config": { - "resultTable": "T111", - "sql": "select * from T1654611700631" - } - } - ], - "sinks": [ - { - "name": "file", - "config": { - "sourceTable": "T1654611700631", - "path": "file://{filePath}/csv", - "saveMode": "overwrite", - "options": { - "header":"true" - }, - "serializer": "csv" - } - } - ] -} +Need to add new jar +```text +jedis-3.2.0.jar +commons-pool2-2.8.1.jar +spark-redis_2.12-2.6.0.jar ``` +### 4.3.4 kafka -### kafka - -Configuration description +Configuration instructions ```text servers: kafka connection information -mode: Currently `batch` and `stream` are supported -topic: Name of the kafka topic +mode: currently supports `batch` and `stream` +topic: kafka topic name ``` - -Data write +Data written to json code ```json { "sources": [ @@ -316,8 +246,8 @@ Data write "path": "file://{filePath}/etltest.dolphin", "serializer": "csv", "options": { - "header":"true", - "delimiter":";" + "header": "true", + "delimiter": ";" }, "columnNames": ["name", "age"] } @@ -337,7 +267,7 @@ Data write } ``` -Data read +Data read json code ```json { "sources": [ @@ -365,9 +295,16 @@ Data read } ``` -### elasticsearch +Need to add new jar +``` +kafka-clients-2.8.0.jar +spark-sql-kafka-0-10_2.12-3.2.1.jar +spark-token-provider-kafka-0-10_2.12-3.2.1.jar +``` + +###elasticsearch -Configuration description +Configuration instructions ```text node: elasticsearch ip port: elasticsearch port @@ -375,7 +312,7 @@ index: elasticsearch index name ``` -Data write +Data written to json code ```json { "sources": [ @@ -387,8 +324,8 @@ Data write "path": "file://{filePath}/etltest.dolphin", "serializer": "csv", "options": { - "header":"true", - "delimiter":";" + "header": "true", + "delimiter": ";" }, "columnNames": ["name", "age"] } @@ -409,7 +346,7 @@ Data write } ``` -Data read +Data read json code ```json { "sources": [ @@ -438,10 +375,14 @@ Data read } ``` +Need to add new jar +``` +elasticsearch-spark-30_2.12-7.17.7.jar +``` -### mongo +###mongo -Configuration description +Configuration instructions ```text uri: mongo connection information database: mongo database @@ -449,7 +390,7 @@ collection: mongo collection ``` -Data write +Data written to json code ```json { "sources": [ @@ -461,8 +402,8 @@ Data write "path": "file://{filePath}/etltest.dolphin", "serializer": "csv", "options": { - "header":"true", - "delimiter":";" + "header": "true", + "delimiter": ";" }, "columnNames": ["name", "age"] } @@ -483,7 +424,7 @@ Data write } ``` -Data read +Data read json code ```json { "sources": [ @@ -511,3 +452,191 @@ Data read ] } ``` + +Need to add new jar +``` +bson-3.12.8.jar +mongo-spark-connector_2.12-3.0.1.jar +mongodb-driver-core-3.12.8.jar +mongodb-driver-sync-3.12.8.jar +``` + +###delta + +Configuration instructions +```text +tableFormat: currently supports `hudi` and `delta` +``` + + +Data written to json code +```json +{ + "sources": [ + { + "name": "file", + "type": "source", + "config": { + "resultTable": "T1654611700631", + "path": "file://{filePath}/etltest.dolphin", + "serializer": "csv", + "options": { + "header": "true", + "delimiter": ";" + }, + "columnNames": ["name", "age"] + } + } + ], + "sinks": [ + { + "name": "datalake", + "config": { + "sourceTable": "T1654611700631", + "tableFormat": "delta", + "path": "file://{filePath}/delta", + "saveMode": "overwrite" + } + } + ] +} +``` + +Data read json code +```json +{ + "sources": [ + { + "name": "datalake", + "type": "source", + "config": { + "resultTable": "T1654611700631", + "tableFormat": "delta", + "path": "file://{filePath}/delta", + } + } + ], + "sinks": [ + { + "name": "file", + "config": { + "sourceTable": "T1654611700631", + "path": "file://{filePath}/csv", + "saveMode": "overwrite", + "options": { + "header": "true" + }, + "serializer": "csv" + } + } + ] +} +``` + +Need to add new jar +``` +delta-core_2.12-2.0.2.jar +delta-storage-2.0.2.jar +``` + +###hudi + +Configuration instructions +```text +tableFormat: currently supports `hudi` and `delta` +``` + + +Data written to json code +```json +{ + "sources": [ + { + "name": "file", + "type": "source", + "config": { + "resultTable": "T1654611700631", + "path": "file://{filePath}/etltest.dolphin", + "serializer": "csv", + "options": { + "header": "true", + "delimiter": ";" + }, + "columnNames": ["name", "age"] + } + } + ], + "transformations": [ + { + "name": "sql", + "type": "transformation", + "config": { + "resultTable": "T111", + "sql": "select * from T1654611700631" + } + } + ], + "sinks": [ + { + "name": "datalake", + "config": { + "sourceTable": "T1654611700631", + "tableFormat": "hudi", + "options": { + "hoodie.table.name": "huditest", + "hoodie.datasource.write.recordkey.field": "age", + "hoodie.datasource.write.precombine.field":"age" + }, + "path": "file://{filePath}/hudi", + "saveMode": "append" + } + } + ] +} +``` + +Data read json code +```json +{ + "sources": [ + { + "name": "datalake", + "type": "source", + "config": { + "resultTable": "T1654611700631", + "tableFormat": "hudi", + "path": "file://{filePath}/hudi", + } + } + ], + "transformations": [ + { + "name": "sql", + "type": "transformation", + "config": { + "resultTable": "T111", + "sql": "select * from T1654611700631" + } + } + ], + "sinks": [ + { + "name": "file", + "config": { + "sourceTable": "T1654611700631", + "path": "file://{filePath}/csv", + "saveMode": "overwrite", + "options": { + "header": "true" + }, + "serializer": "csv" + } + } + ] +} +``` + +Need to add new jar +``` +hudi-spark3.2-bundle_2.12-0.13.0.jar +``` \ No newline at end of file diff --git a/docs/feature/storage-add-support-oss.md b/docs/feature/storage-add-support-oss.md deleted file mode 100644 index 83e3c8912e6..00000000000 --- a/docs/feature/storage-add-support-oss.md +++ /dev/null @@ -1,36 +0,0 @@ ---- -title: Extend linkis-storage add support OSS filesystem -sidebar_position: 0.2 ---- - -## 1. Requirement Background -Extend linkis-storage add support OSS filesystem - -## 2. Instructions for use -To store log and resultSet in OSS, add the following configs in conf/linkis-cg-entrance.properties. -``` -#eg: -wds.linkis.entrance.config.log.path=oss://linkis/tmp/ -wds.linkis.resultSet.store.path=oss://linkis/tmp/ -wds.linkis.filesystem.hdfs.root.path=oss://taihao-linkis/tmp/ -wds.linkis.fs.oss.endpoint=https://oss-cn-hangzhou.aliyuncs.com -wds.linkis.fs.oss.bucket.name=linkis -wds.linkis.fs.oss.accessKeyId=your accessKeyId -wds.linkis.fs.oss.accessKeySecret=your accessKeySecret -``` - -Add the following configs in engine engineconn plugins conf. Let me use hive conf for example: -modify linkis-engineconn-plugins/hive/src/main/resources/linkis-engineconn.properties and -add the following configs. -``` -#eg: -wds.linkis.fs.oss.endpoint=https://oss-cn-hangzhou.aliyuncs.com -wds.linkis.fs.oss.bucket.name=linkis -wds.linkis.fs.oss.accessKeyId=your accessKeyId -wds.linkis.fs.oss.accessKeySecret=your accessKeySecret -``` - - -## 3. Precautions -1, you have an OSS bucket. -2, you have accessKeyId, accessKeySecret to access the above OSS bucket. \ No newline at end of file diff --git a/docs/feature/upgrade-base-engine-version.md b/docs/feature/upgrade-base-engine-version.md deleted file mode 100644 index 164781e3929..00000000000 --- a/docs/feature/upgrade-base-engine-version.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -title: upgrade hadoop\spark\hive default version to 3.x -sidebar_position: 0.2 ---- - -## 1. Requirement Background -fow now we support different hadoop, hive ,spark version compile, and lower engine version may have potential CVE risk - -## 2. Instructions for use -default hadoop version changes from 2.7.2 to 3.3.4 -default hive version changes from 2.3.3 to 3.1.3 -default spark version changes from 2.4.3 to 3.2.1 - - -## 3. Precautions -for the default compilation, versions will be spark3.2.1+hadoop3.3.4+hive3.1.3. -``` -mvn install package -``` -profile spark-3.2 、 hadoop-3.3 、spark-2.4-hadoop-3.3 profiles have been removed and profile hadoop-2.7 and spark-2.4 have been added instead. \ No newline at end of file diff --git a/docs/feature/version-and-branch-intro.md b/docs/feature/version-and-branch-intro.md deleted file mode 100644 index b4ff4c3af7d..00000000000 --- a/docs/feature/version-and-branch-intro.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: version number and branch modification instructions -sidebar_position: 0.4 ---- - -## 1. Linkis main version number modification instructions - -Linkis will no longer be upgraded by minor version after version 1.3.2. The next version will be 1.4.0, and the version number will be 1.5.0, 1.6.0 and so on. When encountering a major defect in a released version that needs to be fixed, it will pull a minor version to fix the defect, such as 1.4.1. - - -## 2. Linkis code submission master branch instructions - -The modified code of Linkis 1.3.2 and earlier versions is merged into the dev branch by default. In fact, the development community of Apache Linkis is very active, and new development requirements or repair functions will be submitted to the dev branch, but when users visit the Linkis code base, the master branch is displayed by default. Since we only release a new version every quarter, it seems that the community is not very active from the perspective of the master branch. Therefore, we decided to merge the code submitted by developers into the master branch by default starting from version 1.4.0. \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/deployment/deploy-quick.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/deployment/deploy-quick.md index 383d805cb8e..25d47036056 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/deployment/deploy-quick.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/deployment/deploy-quick.md @@ -235,7 +235,7 @@ HADOOP_KEYTAB_PATH=/appcom/keytab/ > > 注意: linkis没有对S3做权限适配,所以无法对其做赋权操作 -`vim linkis.properties` +`vim $LINKIS_HOME/conf/linkis.properties` ```shell script # s3 file system linkis.storage.s3.access.key=xxx @@ -245,7 +245,7 @@ linkis.storage.s3.region=xxx linkis.storage.s3.bucket=xxx ``` -`vim linkis-cg-entrance.properties` +`vim $LINKIS_HOME/conf/linkis-cg-entrance.properties` ```shell script wds.linkis.entrance.config.log.path=s3:///linkis/logs wds.linkis.resultSet.store.path=s3:///linkis/results @@ -273,7 +273,7 @@ select * from linkis_mg_gateway_auth_token; Linkis 服务本身使用 Token 时,配置文件中 Token 需与数据库中 Token 一致。通过应用简称前缀匹配。 -$LINKIS_HOME/conf/linkis.properites文件 Token 配置 +$LINKIS_HOME/conf/linkis.properties文件 Token 配置 ``` linkis.configuration.linkisclient.auth.token.value=BML-928a721518014ba4a28735ec2a0da799 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/engine-usage/impala.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/engine-usage/impala.md index b91f93bed33..4ba63a71f51 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/engine-usage/impala.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/engine-usage/impala.md @@ -5,38 +5,28 @@ sidebar_position: 12 本文主要介绍在 `Linkis` 中,`Impala` 引擎插件的安装、使用和配置。 - ## 1. 前置工作 -### 1.1 引擎安装 +### 1.1 环境安装 -如果您希望在您的 `Linkis` 服务上使用 `Impala` 引擎,您需要准备 Impala 服务并提供连接信息,如 Impala 集群的连接地址、SASL用户名和密码等 +如果您希望在您的服务器上使用 Impala 引擎,您需要准备 Impala 服务并提供连接信息,如 Impala 集群的连接地址、SASL用户名和密码等 -### 1.2 服务验证 +### 1.2 环境验证 -```shell -# 准备 trino-cli -wget https://repo1.maven.org/maven2/io/trino/trino-cli/374/trino-cli-374-executable.jar -mv trino-cli-374-executable.jar trino-cli -chmod +x trino-cli - -# 执行任务 -./trino-cli --server localhost:8080 --execute 'show tables from system.jdbc' - -# 得到如下输出代表服务可用 -"attributes" -"catalogs" -"columns" -"procedure_columns" -"procedures" -"pseudo_columns" -"schemas" -"super_tables" -"super_types" -"table_types" -"tables" -"types" -"udts" +执行 impala-shell 命令得到如下输出代表 impala 服务可用。 +``` +[root@8f43473645b1 /]# impala-shell +Starting Impala Shell without Kerberos authentication +Connected to 8f43473645b1:21000 +Server version: impalad version 2.12.0-cdh5.15.0 RELEASE (build 23f574543323301846b41fa5433690df32efe085) +*********************************************************************************** +Welcome to the Impala shell. +(Impala Shell v2.12.0-cdh5.15.0 (23f5745) built on Thu May 24 04:07:31 PDT 2018) + +When pretty-printing is disabled, you can use the '--output_delimiter' flag to set +the delimiter for fields in the same row. The default is ','. +*********************************************************************************** +[8f43473645b1:21000] > ``` ## 2. 引擎插件部署 @@ -101,7 +91,7 @@ select * from linkis_cg_engine_conn_plugin_bml_resources; ```shell sh ./bin/linkis-cli -submitUser impala \ --engineType impala-3.4.0 -code 'select * from default.test limit 10' \ +-engineType impala-3.4.0 -code 'show databases;' \ -runtimeMap linkis.es.http.method=GET \ -runtimeMap linkis.impala.servers=127.0.0.1:21050 ``` @@ -143,37 +133,23 @@ sh ./bin/linkis-cli -submitUser impala \ 如果默认参数不满足时,有如下几中方式可以进行一些基础参数配置 -#### 4.2.1 管理台配置 - -![](./images/trino-config.png) - -注意: 修改 `IDE` 标签下的配置后需要指定 `-creator IDE` 才会生效(其它标签类似),如: - -```shell -sh ./bin/linkis-cli -creator IDE -submitUser hadoop \ - -engineType impala-3.4.0 -codeType sql \ - -code 'select * from system.jdbc.schemas limit 10' -``` - -#### 4.2.2 任务接口配置 +#### 4.2.1 任务接口配置 提交任务接口,通过参数 `params.configuration.runtime` 进行配置 ```shell http 请求参数示例 { - "executionContent": {"code": "select * from system.jdbc.schemas limit 10;", "runType": "sql"}, + "executionContent": {"code": "show databases;", "runType": "sql"}, "params": { "variable": {}, "configuration": { "runtime": { - "linkis.trino.url":"http://127.0.0.1:8080", - "linkis.trino.catalog ":"hive", - "linkis.trino.schema ":"default" - } + "linkis.impala.servers"="127.0.0.1:21050" } - }, + } + }, "labels": { - "engineType": "trino-371", + "engineType": "impala-3.4.0", "userCreator": "hadoop-IDE" } } @@ -185,7 +161,7 @@ http 请求参数示例 ``` linkis_ps_configuration_config_key: 插入引擎的配置参数的key和默认values -linkis_cg_manager_label:插入引擎label如:trino-375 +linkis_cg_manager_label:插入引擎label如:impala-3.4.0 linkis_ps_configuration_category: 插入引擎的目录关联关系 linkis_ps_configuration_config_value: 插入引擎需要展示的配置 linkis_ps_configuration_key_engine_relation:配置项和引擎的关联关系 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/base-engine-compatibilty.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/base-engine-compatibilty.md index 65fb8ece362..83229cbe6b4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/base-engine-compatibilty.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/base-engine-compatibilty.md @@ -1,37 +1,75 @@ --- -title: 减少基础引擎不同版本兼容性问题 +title: 基础引擎依赖性、兼容性、默认版本优化 sidebar_position: 0.2 --- ## 1. 需求背景 -之前我们需要修改linkis代码来适配不同的hive版本、spark版本,因为兼容性问题,编译可能会失败,我们可以减少这些基础引擎的兼容性问题。 +1. 低版本 linkis 需要通过修改代码来适配不同的 Hive、Spark 等版本,因为兼容性问题,编译可能会失败,可以减少这些基础引擎的兼容性问题。 +2. Hadoop、Hive、Spark 3.x 已经很成熟,并且低版本的引擎可能有潜在的风险点,社区很多用户默认使用 3.x 版本,因此考虑将 Linkis 默认编译的版本修改为 3.x 。 ## 2. 使用说明 + +## 2.1 默认版本调整说明 + +Linkis 1.4.0 将 Hadoop、Hive、Spark 默认版本修改为 3.x,具体版本分别为 Hadoop 3.3.4、Hive 3.1.3、Spark 3.2.1 。 + +## 2.2 不同版本适配 + 不同的hive版本的编译,我们只需要指定`-D=xxx`就可以了,比如: ``` -mvn clean install package -Dhive.version=3.1.3 - +mvn clean install package -Dhive.version=2.3.3 ``` 不同版本的spark编译,我们也只需要指定`-D=xxx`就可以了,常用的使用场景如下: ``` -spark3+hadoop3 +#spark3+hadoop3 mvn install package -spark3+hadoop2 +#spark3+hadoop2 mvn install package -Phadoop-2.7 -spark2+hadoop2 +#spark2+hadoop2 mvn install package -Pspark-2.4 -Phadoop-2.7 -spark2+ hadoop3 +#spark2+ hadoop3 mvn install package -Pspark-2.4 - ``` ## 3. 注意事项 -spark的子版本可以通过`-Dspark.version=xxx` 来指定 -hadoop的子版本可以通过`-Dhadoop.version=xxx` 来指定 +1. 默认版本编译时,基础版本为:hadoop3.3.4 + hive3.1.3 + spark3.2.1 +``` +mvn install package +``` +由于默认基础引擎的默认版本升级,`spark-3.2`、`hadoop-3.3`和`spark-2.4-hadoop-3.3` profile被移除,新增profile `hadoop-2.7` and `spark-2.4`。 + +2. spark的子版本可以通过`-Dspark.version=xxx` 来指定,系统默认使用的 scala 版本为 2.12.17,可适配 spark 3.x 版本 。如需编译 spark 2.x,需要使用 scala 2.11 版本。可通过 -Pspark-2.4 参数,或者 -Dspark.version=2.xx -Dscala.version=2.11.12 -Dscala.binary.version=2.11 编译。 + +3. hadoop的子版本可以通过`-Dhadoop.version=xxx` 来指定 举个例子 : ``` mvn install package -Pspark-3.2 -Phadoop-3.3 -Dspark.version=3.1.3 -``` \ No newline at end of file +``` + +4. hive 2.x 版本需要依赖 jersey,hive EC 默认编译时未添加 jersey依赖,可通过如下指引编译。 + +**编译 hive 2.3.3 版本** + +编译 hive EC 时默认添加了指定 2.3.3 版本时激活添加 jersey 依赖的 profile,用户可通过指定 -Dhive.version=2.3.3 参数编译 + +**编译其它 hive 2.x 版本** + +修改 linkis-engineconn-plugins/hive/pom.xml 文件,将 2.3.3 修改为用户编译版本,如 2.1.0 +```xml + + hive-jersey-dependencies + + + hive.version + + 2.1.0 + + + ... + +``` +编译时添加 -Dhive.version=2.1.0 参数。 + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/datasource-generate-sql.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/datasource-generate-sql.md index 31ba0a2c375..1750237fdee 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/datasource-generate-sql.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/datasource-generate-sql.md @@ -1,28 +1,31 @@ --- title: 根据数据源生成SQL -sidebar_position: 0.2 +sidebar_position: 0.5 --- ## 1. 背景 -根据数据源信息生成SparkSQL和JdbcSQL,包含DDL、DML、DQL +根据数据源信息生成 SparkSQL 和 JdbcSQL,包含DDL、DML、DQL。 ## 2. 使用说明 ### 生成SparkSQL -参数说明: -| 参数名 | 说明 | 默认值 | -|------------------------------|-------|-----| -| `dataSourceName` | 数据源名称 | - | -| `system` | 系统名称 | - | -| `database` | 数据库名称 | - | -| `table` | 表名称 | - | +接口地址:/api/rest_j/v1/metadataQuery/getSparkSql -通过 RestFul 的方式提交任务,请求示例如下。 -```json -GET /api/rest_j/v1/metadataQuery/getSparkSql?dataSourceName=mysql&system=system&database=test&table=test -``` +请求方式:GET + +请求数据类型:application/x-www-form-urlencoded + +请求参数: + +| 参数名 | 说明 | 是否必须 | 数据类型 | +|------------------------------|-------|-----|--| +| `dataSourceName` | 数据源名称 | 是 | String | +| `system` | 系统名称 | 是 | String | +| `database` | 数据库名称 | 是 | String | +| `table` | 表名称 | 是 | String | + +响应示例: -响应示例如下。 ```json { "method": null, @@ -37,39 +40,43 @@ GET /api/rest_j/v1/metadataQuery/getSparkSql?dataSourceName=mysql&system=system& } } ``` -目前支持jdbc、kafka、elasticsearch、mongo数据源,可以根据SparkSQLDdl注册spark table进行查询 +目前支持jdbc、kafka、elasticsearch、mongo 数据源,可以根据SparkSQLDDL注册 spark table 进行查询 ### 生成JdbcSQL -参数说明: -| 参数名 | 说明 | 默认值 | -|------------------------------|-------|-----| -| `dataSourceName` | 数据源名称 | - | -| `system` | 系统名称 | - | -| `database` | 数据库名称 | - | -| `table` | 表名称 | - | +接口地址:/api/rest_j/v1/metadataQuery/getJdbcSql -通过 RestFul 的方式提交任务,请求示例如下。 -```json -GET /api/rest_j/v1/metadataQuery/getJdbcSql?dataSourceName=mysql&system=system&database=test&table=test -``` +请求方式:GET + +请求数据类型:application/x-www-form-urlencoded + +请求参数: + +| 参数名 | 说明 | 是否必须 | 数据类型 | +|------------------------------|-------|-----|--| +| `dataSourceName` | 数据源名称 | 是 | String | +| `system` | 系统名称 | 是 | String | +| `database` | 数据库名称 | 是 | String | +| `table` | 表名称 | 是 | String | + +响应示例: -响应示例如下。 ```json { - "method": null, - "status": 0, - "message": "OK", - "data": { - "jdbcSql": { - "ddl": "CREATE TABLE `test` (\n\t `id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT '列名是id',\n\t `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT '列名是name',\n\t PRIMARY KEY (`id`)\n\t) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci", - "dml": "INSERT INTO test SELECT * FROM ${resultTable}", - "dql": "SELECT id,name FROM test" + "method": null, + "status": 0, + "message": "OK", + "data": { + "jdbcSql": { + "ddl": "CREATE TABLE `test` (\n\t `id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT '列名是id',\n\t `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT '列名是name',\n\t PRIMARY KEY (`id`)\n\t) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci", + "dml": "INSERT INTO test SELECT * FROM ${resultTable}", + "dql": "SELECT id,name FROM test" + } } - } } ``` -目前支持jdbc数据源,如:mysql、oracle、postgres等,JdbcSQLDdl可以用于前端展示 + +目前支持 JDBC 数据源,如:mysql、oracle、postgres等,JdbcSQLDDL可以用于前端展示。 ## 3. 注意事项 1. 需要先注册数据源 @@ -92,50 +99,51 @@ GET /api/rest_j/v1/metadataQuery/getJdbcSql?dataSourceName=mysql&system=system&d ### 生成JdbcSQL实现原理 根据表schema信息拼接DDL ```java - public String generateJdbcDdlSql(String database, String table) { - StringBuilder ddl = new StringBuilder(); - ddl.append("CREATE TABLE ").append(String.format("%s.%s", database, table)).append(" ("); +public String generateJdbcDdlSql(String database, String table) { + StringBuilder ddl = new StringBuilder(); + ddl.append("CREATE TABLE ").append(String.format("%s.%s", database, table)).append(" ("); - try { - List columns = getColumns(database, table); + try { + List < MetaColumnInfo > columns = getColumns(database, table); if (CollectionUtils.isNotEmpty(columns)) { - for (MetaColumnInfo column : columns) { - ddl.append("\n\t").append(column.getName()).append(" ").append(column.getType()); - if (column.getLength() > 0) { - ddl.append("(").append(column.getLength()).append(")"); - } - if (!column.isNullable()) { - ddl.append(" NOT NULL"); - } - ddl.append(","); + for (MetaColumnInfo column: columns) { + ddl.append("\n\t").append(column.getName()).append(" ").append(column.getType()); + if (column.getLength() > 0) { + ddl.append("(").append(column.getLength()).append(")"); + } + if (!column.isNullable()) { + ddl.append(" NOT NULL"); + } + ddl.append(","); + } + String primaryKeys = + columns.stream() + .filter(MetaColumnInfo::isPrimaryKey) + .map(MetaColumnInfo::getName) + .collect(Collectors.joining(", ")); + if (StringUtils.isNotBlank(primaryKeys)) { + ddl.append(String.format("\n\tPRIMARY KEY (%s),", primaryKeys)); + } + ddl.deleteCharAt(ddl.length() - 1); } - String primaryKeys = - columns.stream() - .filter(MetaColumnInfo::isPrimaryKey) - .map(MetaColumnInfo::getName) - .collect(Collectors.joining(", ")); - if (StringUtils.isNotBlank(primaryKeys)) { - ddl.append(String.format("\n\tPRIMARY KEY (%s),", primaryKeys)); - } - ddl.deleteCharAt(ddl.length() - 1); - } - } catch (Exception e) { + } catch (Exception e) { LOG.warn("Fail to get Sql columns(获取字段列表失败)"); - } + } - ddl.append("\n)"); + ddl.append("\n)"); - return ddl.toString(); - } + return ddl.toString(); +} ``` + 部分数据源支持直接获取DDL -mysql +**mysql** ```sql SHOW CREATE TABLE 'table' ``` -oracle +**oracle** ```sql SELECT DBMS_METADATA.GET_DDL('TABLE', 'table', 'database') AS DDL FROM DUAL ``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/ecm-takes-over-ec.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/ecm-takes-over-ec.md deleted file mode 100644 index 8ee92a55071..00000000000 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/ecm-takes-over-ec.md +++ /dev/null @@ -1,10 +0,0 @@ ---- -title: 当ECM重新启动时,可以选择不杀死引擎,而是可以接管现有的存活引擎 -sidebar_position: 0.2 ---- - -## 需求背景 -当ECM重新启动时,可以选择不杀死引擎,而是可以接管现有的存活引擎。使引擎连接管理器(ECM)服务无状态。 - -## 使用说明 -此功能默认已启用。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/hive-engine-support-concurrent.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/hive-engine-support-concurrent.md index 086a2187df0..a07ebba71c9 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/hive-engine-support-concurrent.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/hive-engine-support-concurrent.md @@ -1,20 +1,24 @@ --- title: hive engine支持并发,支持复用 -sidebar_position: 0.2 +sidebar_position: 0.3 --- ## 1. 需求背景 -hiveEngineConn支持并发,减少启动hive引擎的资源消耗。 +hiveEngineConn支持并发,减少启动hive引擎的资源消耗,提高引擎复用率。 ## 2. 使用说明 首先,在linkis-engineconn-plugins/hive/src/main/resources目录下修改linkis-engineconn.properties文件, 并将linkis.hive.engineconn.concurrent.support设置为true。 ``` # 支持并行执行 -linkis.hive.engineconn.concurrent.support=true +wds.linkis.engineconn.support.parallelism=true + +# 并发数限制,默认为 10 +linkis.hive.engineconn.concurrent.limit=10 ``` -第二,提交一个hive任务,当第一个任务完成后,再提交另一个任务。您可以看到hive引擎已被重用。 +提交一个hive任务,当第一个任务完成后,再提交另一个任务。您可以看到hive引擎已被重用。 +配置修改后重启 cg-linkismanager 服务,或通过 [引擎刷新接口](../api/http/linkis-cg-engineplugin-api/engineconn-plugin-refresh.md) 使配置生效。 ## 3. 注意事项 -1、等第一个hive任务执行成功后,再提交第二个hive任务。 \ No newline at end of file +1、等待第一个hive任务执行成功后,再提交第二个hive任务。初次同时提交多个任务可能由于暂无可用的 EC 导致启动多个 EC。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/other.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/other.md new file mode 100644 index 00000000000..9dafd70c4f1 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/other.md @@ -0,0 +1,31 @@ +--- +title: 其它特性说明 +sidebar_position: 0.6 +--- + +## 1. Linkis 1.4.0 其它特性升级说明 + +### 1.1 ECM 重启时不 kill EC +当ECM重新启动时,可以选择不杀死引擎,而是可以接管现有的存活引擎。使引擎连接管理器 (ECM) 服务无状态。 + +### 1.2 移除 json4s 依赖 +spark 不同版本依赖不同的json4s版本,不利于spark多版本的支持,我们需要减少这个json4s依赖,从linkis中移除了json4s. +比如: spark2.4 需要json4s v3.5.3, spark3.2需要json4s v3.7.0-M11。 + +### 1.3 EngineConn模块定义依赖引擎版本 +引擎的版本定义默认在 `EngineConn`中,一旦相关版本变更,需要修改多处,我们可以把相关的版本定义统一放到顶层pom文件中。编译指定引擎模块时,需要在项目根目录编译,并使用`-pl`来编译具体的引擎模块,比如: +``` +mvn install package -pl linkis-engineconn-plugins/spark -Dspark.version=3.2.1 +``` +引擎的版本可以通过mvn编译-D参数来指定,比如 -Dspark.version=xxx 、 -Dpresto.version=0.235 +目前所有的底层引擎版本新都已经移到顶层pom文件中,编译指定引擎模块时,需要在项目根目录编译,并使用`-pl`来编译具体的引擎模块。 + +### 1.4 Linkis 主版本号修改说明 + +Linkis 从 1.3.2 版本后将不再按小版本升级,下一个版本为 1.4.0,再往后升级时版本号为1.5.0,1.6.0 以此类推。当遇到某个发布版本有重大缺陷需要修复时会拉取小版本修复缺陷,如 1.4.1 。 + + +## 1.5 LInkis 代码提交主分支说明 + +Linkis 1.3.2 及之前版本修改代码默认是合并到 dev 分支。实际上 Apache Linkis 的开发社区很活跃,对于新开发的需求或修复功能都会提交到 dev 分支,但是用户访问 Linkis 代码库的时候默认显示的是 master 分支。由于我们一个季度才会发布一个新版本,从 master 分支来看显得社区活跃的不高。因此我们决定从 1.4.0 版本开始,将开发者提交的代码默认合并到 master 分支。 + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/overview.md index 9885a468055..ba393a2fdc6 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/overview.md @@ -3,15 +3,14 @@ title: 版本总览 sidebar_position: 0.1 --- -- [hadoop、spark、hive 默认版本升级为3.x](./upgrade-base-engine-version.md) -- [减少基础引擎不同版本兼容性问题](./base-engine-compatibilty.md) +- [基础引擎依赖性、兼容性、默认版本优化](./base-engine-compatibilty.md) - [Hive 引擎连接器支持并发任务](./hive-engine-support-concurrent.md) -- [linkis-storage 支持 S3 和 OSS 文件系统](./storage-add-support-oss.md) -- [支持更多的数据源](./spark-etl.md) -- [增加 postgresql 数据库支持](../deployment/deploy-quick.md) -- [ECM重启时不kill EC](./ecm-takes-over-ec.md) +- [新增 Impala 引擎支持](../engine-usage/impala.md) +- [linkis-storage 支持 S3 文件系统](../deployment/deploy-quick#s3模式可选) +- [增加 postgresql 数据库支持](../deployment/deploy-quick.md#33-添加postgresql驱动包-可选) - [Spark ETL 功能增强](./spark-etl.md) -- [版本号及分支修改说明](./version-and-branch-intro.md) +- [根据数据源生成SQL](./datasource-generate-sql.md) +- [其它特性说明](./other.md) - [版本的 Release-Notes](/download/release-notes-1.4.0) ## 参数变化 @@ -25,6 +24,8 @@ sidebar_position: 0.1 | mg-eureka | 新增 | eureka.instance.lease-expiration-duration-in-seconds | 12 | eureka 等待下一次心跳的超时时间(秒)| | EC-shell | 修改 | wds.linkis.engineconn.support.parallelism | true | 是否开启 shell 任务并行执行| | EC-shell | 修改 | linkis.engineconn.shell.concurrent.limit | 15 | shell 任务并发数 | +| Entrance | 修改 | linkis.entrance.auto.clean.dirty.data.enable | true | 启动时是否清理脏数据 | + ## 数据库表变化 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/remove-json4s-from-linkis.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/remove-json4s-from-linkis.md deleted file mode 100644 index 02ed6b76a2d..00000000000 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/remove-json4s-from-linkis.md +++ /dev/null @@ -1,11 +0,0 @@ ---- -title: linkis中移除json4s依赖 -sidebar_position: 0.2 ---- - -## 1. 需求背景 -spark 不同版本依赖不同的json4s版本,不利于spark多版本的支持,我们需要减少这个json4s依赖,从linkis中移除json4s. -比如: spark2.4 需要json4s v3.5.3, spark3.2需要json4s v3.7.0-M11 - -## 2. 使用说明 -spark自定义版本源码编译时不需要修改json4s的依赖 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/remove-underlying-engine-depdency.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/remove-underlying-engine-depdency.md deleted file mode 100644 index 81061b048b7..00000000000 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/remove-underlying-engine-depdency.md +++ /dev/null @@ -1,17 +0,0 @@ ---- -title: EngineConn模块定义依赖引擎版本 -sidebar_position: 0.2 ---- - -## 1. 需求背景 -引擎的版本定义默认在 `EngineConn`中,一旦相关版本变更,需要修改多处,我们可以把相关的版本定义统一放到顶层pom文件中 - -## 2. 使用说明 -编译指定引擎模块时,需要在项目根目录编译,并使用`-pl`来编译具体的引擎模块,比如: -``` -mvn install package -pl linkis-engineconn-plugins/spark -Dspark.version=3.2.2 - -``` -## 3. 注意事项 -引擎的版本可以通过mvn编译-D参数来指定,比如 -Dspark.version=xxx 、 -Dpresto.version=0.235 -目前所有的底层引擎版本新都已经移到顶层pom文件中,编译指定引擎模块时,需要在项目根目录编译,并使用`-pl`来编译具体的引擎模块 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/spark-etl.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/spark-etl.md index a19702ac34a..40c2a646851 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/spark-etl.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/spark-etl.md @@ -1,16 +1,16 @@ --- -title: 支持spark数据同步 -sidebar_position: 0.2 +title: 支持 spark ETL 数据同步 +sidebar_position: 0.4 --- ## 1. 背景 -用户可以通过配置json的方式进行spark数据同步 +使用 Spark ETL 功能,用户可以通过配置 json 的方式进行 Spark 数据同步。 ## 2. 支持的类型 目前支持的类型 ```text -jdbc、file、redis、kafka、elasticsearch、mongo、solr、rocketmq、datalake(hudi、delta) +jdbc、file、redis、kafka、elasticsearch、mongo、datalake(hudi、delta) ``` ## 3. 通用配置说明 @@ -24,9 +24,44 @@ path: 文件路径,可以是: 'file://' or 'hdfs://'(default) ``` ## 4. 使用说明 -常见的数据同步示例如下: -### jdbc +### 4.1 添加所需的 jar 包 +使用数据源时需要将对应的 spark connector jar 上传至 spark/jars目录,目录位置 $SPARK_HOME/jars + +spark connector jar 可以通过以下命令获取 + +```text +git clone https://github.com/apache/linkis.git + +cd linkis + +git checkout master + +cd linkis-engineconn-plugins/spark/scala-2.12 + +mvn clean install -Dmaven.test.skip=true +``` + +编译完成的spark connector jar位于以下目录中 +```text +linkis/linkis-engineconn-plugins/spark/scala-2.12/target/out/spark/dist/3.2.1/lib +``` + +### 4.2 linkis-cli 提交任务示例 + +在 code 传入具体的 json 代码即可,注意引号格式转换。 + +```shell +sh /appcom/Install/linkis/bin/linkis-cli -engineType spark-3.2.1 -codeType data_calc -code "" -submitUser hadoop -proxyUser hadoop +``` + +linkis-cli 提交 redis 数据同步任务示例 +```shell +sh ./bin/linkis-cli -engineType spark-3.2.1 -codeType data_calc -code "{\"plugins\":[{\"name\":\"file\",\"type\":\"source\",\"config\":{\"resultTable\":\"test\",\"path\":\"hdfs://linkishdfs/tmp/linkis/spark_etl_test/etltest.dolphin\",\"serializer\":\"csv\",\"options\":{\"header\":\"true\",\"delimiter\":\";\"},\"columnNames\":[\"name\",\"age\"]}},{\"name\":\"redis\",\"type\":\"sink\",\"config\":{\"sourceTable\":\"test\",\"host\":\"wds07\",\"port\":\"6679\",\"auth\":\"password\",\"targetTable\":\"spark_etl_test\",\"saveMode\":\"append\"}}]}" -submitUser hadoop -proxyUser hadoop +``` +### 4.3 各数据源同步 json 脚本说明 + +### 4.3.1 jdbc 配置说明 ```text @@ -36,6 +71,8 @@ password: 密码 query: sql查询语句 ``` +json code + ```json { "sources": [ @@ -83,15 +120,25 @@ query: sql查询语句 } ``` -### file +需要新增的jar,根据具体使用的数据源选择对应的 jar +```text +DmJdbcDriver18.jar +kingbase8-8.6.0.jar +postgresql-42.3.8.jar +``` + +### 4.3.2 file 配置说明 + ```text serializer: 文件格式,可以是`csv`、`parquet`等 columnNames: 列名 ``` +json code + ```json { "sources": [ @@ -121,97 +168,32 @@ columnNames: 列名 } ``` - -### delta - -配置说明 -```text -tableFormat: 目前支持`hudi`和`delta` +需要新增的 jar ``` - - -数据写入 -```json -{ - "sources": [ - { - "name": "file", - "type": "source", - "config": { - "resultTable": "T1654611700631", - "path": "file://{filePath}/etltest.dolphin", - "serializer": "csv", - "options": { - "header":"true", - "delimiter":";" - }, - "columnNames": ["name", "age"] - } - } - ], - "sinks": [ - { - "name": "datalake", - "config": { - "sourceTable": "T1654611700631", - "tableFormat": "delta", - "path": "file://{filePath}/delta", - "saveMode": "overwrite" - } - } - ] -} +spark-excel-2.12.17-3.2.2_2.12-3.2.2_0.18.1.jar ``` -数据读取 -```json -{ - "sources": [ - { - "name": "datalake", - "type": "source", - "config": { - "resultTable": "T1654611700631", - "tableFormat": "delta", - "path": "file://{filePath}/delta", - } - } - ], - "sinks": [ - { - "name": "file", - "config": { - "sourceTable": "T1654611700631", - "path": "file://{filePath}/csv", - "saveMode": "overwrite", - "options": { - "header":"true" - }, - "serializer": "csv" - } - } - ] -} -``` +### 4.3.3 redis -### hudi - -配置说明 ```text -tableFormat: 目前支持`hudi`和`delta` +sourceTable: 源表, +host: ip地址, +port": 端口, +auth": 密码, +targetTable: 目标表, +saveMode: 支持 append ``` - -数据写入 +json code ```json { - "sources": [ + "plugins":[ { "name": "file", "type": "source", "config": { - "resultTable": "T1654611700631", - "path": "file://{filePath}/etltest.dolphin", + "resultTable": "test", + "path": "hdfs://linkishdfs/tmp/linkis/spark_etl_test/etltest.dolphin", "serializer": "csv", "options": { "header":"true", @@ -219,30 +201,16 @@ tableFormat: 目前支持`hudi`和`delta` }, "columnNames": ["name", "age"] } - } - ], - "transformations": [ + }, { - "name": "sql", - "type": "transformation", + "name": "redis", + "type":"sink", "config": { - "resultTable": "T111", - "sql": "select * from T1654611700631" - } - } - ], - "sinks": [ - { - "name": "datalake", - "config": { - "sourceTable": "T1654611700631", - "tableFormat": "hudi", - "options": { - "hoodie.table.name":"huditest", - "hoodie.datasource.write.recordkey.field":"age", - "hoodie.datasource.write.precombine.field":"age" - }, - "path": "file://{filePath}/hudi", + "sourceTable": "test", + "host": "wds07", + "port": "6679", + "auth":"password", + "targetTable":"spark_etl_test", "saveMode": "append" } } @@ -250,49 +218,14 @@ tableFormat: 目前支持`hudi`和`delta` } ``` -数据读取 -```json -{ - "sources": [ - { - "name": "datalake", - "type": "source", - "config": { - "resultTable": "T1654611700631", - "tableFormat": "hudi", - "path": "file://{filePath}/hudi", - } - } - ], - "transformations": [ - { - "name": "sql", - "type": "transformation", - "config": { - "resultTable": "T111", - "sql": "select * from T1654611700631" - } - } - ], - "sinks": [ - { - "name": "file", - "config": { - "sourceTable": "T1654611700631", - "path": "file://{filePath}/csv", - "saveMode": "overwrite", - "options": { - "header":"true" - }, - "serializer": "csv" - } - } - ] -} +需要新增的jar +```text +jedis-3.2.0.jar +commons-pool2-2.8.1.jar +spark-redis_2.12-2.6.0.jar ``` - -### kafka +### 4.3.4 kafka 配置说明 ```text @@ -301,8 +234,7 @@ mode: 目前支持`batch`和`stream` topic: kafka topic名称 ``` - -数据写入 +数据写入 json code ```json { "sources": [ @@ -335,7 +267,7 @@ topic: kafka topic名称 } ``` -数据读取 +数据读取 json code ```json { "sources": [ @@ -363,6 +295,13 @@ topic: kafka topic名称 } ``` +需要新增的 jar +``` +kafka-clients-2.8.0.jar +spark-sql-kafka-0-10_2.12-3.2.1.jar +spark-token-provider-kafka-0-10_2.12-3.2.1.jar +``` + ### elasticsearch 配置说明 @@ -373,7 +312,7 @@ index: elasticsearch索引名称 ``` -数据写入 +数据写入 json code ```json { "sources": [ @@ -407,7 +346,7 @@ index: elasticsearch索引名称 } ``` -数据读取 +数据读取 json code ```json { "sources": [ @@ -436,6 +375,10 @@ index: elasticsearch索引名称 } ``` +需要新增的jar +``` +elasticsearch-spark-30_2.12-7.17.7.jar +``` ### mongo @@ -447,7 +390,7 @@ collection: mongo collection ``` -数据写入 +数据写入 json code ```json { "sources": [ @@ -481,7 +424,7 @@ collection: mongo collection } ``` -数据读取 +数据读取 json code ```json { "sources": [ @@ -509,3 +452,191 @@ collection: mongo collection ] } ``` + +需要新增的 jar +``` +bson-3.12.8.jar +mongo-spark-connector_2.12-3.0.1.jar +mongodb-driver-core-3.12.8.jar +mongodb-driver-sync-3.12.8.jar +``` + +### delta + +配置说明 +```text +tableFormat: 目前支持`hudi`和`delta` +``` + + +数据写入 json code +```json +{ + "sources": [ + { + "name": "file", + "type": "source", + "config": { + "resultTable": "T1654611700631", + "path": "file://{filePath}/etltest.dolphin", + "serializer": "csv", + "options": { + "header":"true", + "delimiter":";" + }, + "columnNames": ["name", "age"] + } + } + ], + "sinks": [ + { + "name": "datalake", + "config": { + "sourceTable": "T1654611700631", + "tableFormat": "delta", + "path": "file://{filePath}/delta", + "saveMode": "overwrite" + } + } + ] +} +``` + +数据读取 json code +```json +{ + "sources": [ + { + "name": "datalake", + "type": "source", + "config": { + "resultTable": "T1654611700631", + "tableFormat": "delta", + "path": "file://{filePath}/delta", + } + } + ], + "sinks": [ + { + "name": "file", + "config": { + "sourceTable": "T1654611700631", + "path": "file://{filePath}/csv", + "saveMode": "overwrite", + "options": { + "header":"true" + }, + "serializer": "csv" + } + } + ] +} +``` + +需要新增的 jar +``` +delta-core_2.12-2.0.2.jar +delta-storage-2.0.2.jar +``` + +### hudi + +配置说明 +```text +tableFormat: 目前支持`hudi`和`delta` +``` + + +数据写入 json code +```json +{ + "sources": [ + { + "name": "file", + "type": "source", + "config": { + "resultTable": "T1654611700631", + "path": "file://{filePath}/etltest.dolphin", + "serializer": "csv", + "options": { + "header":"true", + "delimiter":";" + }, + "columnNames": ["name", "age"] + } + } + ], + "transformations": [ + { + "name": "sql", + "type": "transformation", + "config": { + "resultTable": "T111", + "sql": "select * from T1654611700631" + } + } + ], + "sinks": [ + { + "name": "datalake", + "config": { + "sourceTable": "T1654611700631", + "tableFormat": "hudi", + "options": { + "hoodie.table.name":"huditest", + "hoodie.datasource.write.recordkey.field":"age", + "hoodie.datasource.write.precombine.field":"age" + }, + "path": "file://{filePath}/hudi", + "saveMode": "append" + } + } + ] +} +``` + +数据读取 json code +```json +{ + "sources": [ + { + "name": "datalake", + "type": "source", + "config": { + "resultTable": "T1654611700631", + "tableFormat": "hudi", + "path": "file://{filePath}/hudi", + } + } + ], + "transformations": [ + { + "name": "sql", + "type": "transformation", + "config": { + "resultTable": "T111", + "sql": "select * from T1654611700631" + } + } + ], + "sinks": [ + { + "name": "file", + "config": { + "sourceTable": "T1654611700631", + "path": "file://{filePath}/csv", + "saveMode": "overwrite", + "options": { + "header":"true" + }, + "serializer": "csv" + } + } + ] +} +``` + +需要新增的 jar +``` +hudi-spark3.2-bundle_2.12-0.13.0.jar +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/storage-add-support-oss.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/storage-add-support-oss.md deleted file mode 100644 index a279521c98c..00000000000 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/storage-add-support-oss.md +++ /dev/null @@ -1,32 +0,0 @@ ---- -title: 扩展linkis-storage以支持OSS文件系统 -sidebar_position: 0.2 ---- - -## 1. 需求背景 -扩展linkis-storage以支持OSS文件系统。 - -## 2. 使用说明 -为了在OSS中存储日志和resultSet,请在conf/linkis-cg-entrance.properties中添加以下配置。示例: -``` -wds.linkis.entrance.config.log.path=oss://linkis/tmp/ -wds.linkis.resultSet.store.path=oss://linkis/tmp/ -wds.linkis.filesystem.hdfs.root.path=oss://taihao-linkis/tmp/ -wds.linkis.fs.oss.endpoint=https://oss-cn-hangzhou.aliyuncs.com -wds.linkis.fs.oss.bucket.name=linkis -wds.linkis.fs.oss.accessKeyId=your accessKeyId -wds.linkis.fs.oss.accessKeySecret=your accessKeySecret -``` - -在engine engineconn插件conf中添加以下配置。以hive conf为例:修改linkis-engineconn-plugins/hive/src/main/resources/linkis-engineconn.properties, -并添加以下配置。示例: -``` -wds.linkis.fs.oss.endpoint=https://oss-cn-hangzhou.aliyuncs.com -wds.linkis.fs.oss.bucket.name=linkis -wds.linkis.fs.oss.accessKeyId=your accessKeyId -wds.linkis.fs.oss.accessKeySecret=your accessKeySecret -``` - -## 3. 注意事项 -1、您需要拥有一个OSS存储桶。 -2、您需要accessKeyId和accessKeySecret以访问上述OSS存储桶。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/upgrade-base-engine-version.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/upgrade-base-engine-version.md deleted file mode 100644 index 42f1c9277ca..00000000000 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/upgrade-base-engine-version.md +++ /dev/null @@ -1,17 +0,0 @@ ---- -title: 升级基础引擎版本到较新版本 -sidebar_position: 0.2 ---- - -## 1. 需求背景 -目前我们已经支持不同版本的hadoop,hive,spark进行编译,并且低版本的引擎可能有潜在的风险点,我们可以升级默认的基础引擎版本到较新版本 - -## 2. 使用说明 -默认hadoop版本从2.7.2升级到3.3.4,默认hive版本从2.3.3升级到3.1.3,默认spark版本从2.4.3升级到3.2.1 - -## 3. 注意事项 -默认版本编译时,基础版本为:spark3.2.1+hadoop3.3.4+hive3.1.3 -``` -mvn install package -``` -由于默认基础引擎的默认版本升级,`spark-3.2`、`hadoop-3.3`和`spark-2.4-hadoop-3.3` profile被移除,新增profile `hadoop-2.7` and `spark-2.4`. \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/version-and-branch-intro.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/version-and-branch-intro.md deleted file mode 100644 index 0ec0391b69f..00000000000 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/feature/version-and-branch-intro.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: 版本号及分支修改说明 -sidebar_position: 0.4 ---- - -## 1. Linkis 主版本号修改说明 - -Linkis 从 1.3.2 版本后将不再按小版本升级,下一个版本为 1.4.0,再往后升级时版本号为1.5.0,1.6.0 以此类推。当遇到某个发布版本有重大缺陷需要修复时会拉取小版本修复缺陷,如 1.4.1 。 - - -## 2. LInkis 代码提交主分支说明 - -Linkis 1.3.2 及之前版本修改代码默认是合并到 dev 分支。实际上 Apache Linkis 的开发社区很活跃,对于新开发的需求或修复功能都会提交到 dev 分支,但是用户访问 Linkis 代码库的时候默认显示的是 master 分支。由于我们一个季度才会发布一个新版本,从 master 分支来看显得社区活跃的不高。因此我们决定从 1.4.0 版本开始,将开发者提交的代码默认合并到 master 分支。 \ No newline at end of file