Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark connector settings #4577

Merged
merged 5 commits into from
Jul 18, 2023
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 37 additions & 1 deletion _search-plugins/sql/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ PUT _plugins/_query/settings
Requests to the `_plugins/_ppl` and `_plugins/_sql` endpoints include index names in the request body, so they have the same access policy considerations as the `bulk`, `mget`, and `msearch` operations. Setting the `rest.action.multi.allow_explicit_index` parameter to `false` disables both the `SQL` and `PPL` endpoints.
{: .note}

# Available settings
## Available settings

Setting | Default | Description
:--- | :--- | :---
Expand All @@ -78,3 +78,39 @@ Setting | Default | Description
`plugins.sql.cursor.keep_alive` | 1 minute | Configures how long the cursor context is kept open. Cursor contexts are resource-intensive, so we recommend a low value.
`plugins.query.memory_limit` | 85% | Configures the heap memory usage limit for the circuit breaker of the query engine.
`plugins.query.size_limit` | 200 | Sets the default size of index that the query engine fetches from OpenSearch.

## Spark connector settings

The SQL plugin supports [Apache Spark](https://spark.apache.org/) as an augmented compute source. When data sources are defined as tables in Apache Spark, OpenSearch can consume those tables. This allows you to run SQL queries against external sources inside OpenSearch Dashboard's [Discover]({{site.url}}{{site.baseurl}}/dashboards/discover/index-discover/) and observability logs.

To get started, enable the following settings to add Spark as a data source and enable the correct permissions.

Setting | Description
:--- | :---
`spark.uri` | The identifier for your Spark data source.
`spark.auth.type` | The authorization type used to authenticate into Spark.
`spark.auth.username` | The username for your Spark data source.
`spark.auth.password` | The password for your Spark data source.
`spark.datasource.flint.host` | The host of the Spark data source. Default is `localhost`.
`spark.datasource.flint.port` | The port number for Spark. Default is `9200`.
`spark.datasource.flint.scheme` | The data scheme used in your Spark queries. Valid values are `http` and `https`.
`spark.datasource.flint.auth` | The authorization required to access the Spark data source. Valid values are `false` and `sigv4`.
`spark.datasource.flint.region` | The region in which your OpenSearch cluster is located. Only use when `auth` is set to `sigv4`. Default value is `us-west-2``.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When referring to an AWS Region, always use "AWS Region" on first appearance. Capitalized "Region" may be used for subsequent appearances.

`spark.datasource.flint.write.id_name` | The index name in which the Spark connector writes.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
`spark.datasource.flint.ignore.id_column` | Excludes the `id` column when exporting data in a query. Default is `true`.
`spark.datasource.flint.write.batch_size` | Sets the batch size when writing to a Spark-connected index. Default is `1000`.
`spark.datasource.flint.write.refresh_policy` | Sets the refresh policy for the Spark connection upon failure for the connector to write data to OpenSearch. Either no refresh (`false`), an immediate refresh (`true`), or a set time to wait `wait_for: X`. Default value is `false`.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
`spark.datasource.flint.read.scroll_size` | Sets the amount of results returned by queries ran with Spark. Default is `100`.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
`spark.flint.optimizer.enabled` | Enables OpenSearch to be optimized for Spark connection. Default is `true`.
`spark.flint.index.hybridscan.enabled` | Enables OpenSearch to scan for write data on non-partitioned devices from the data source. Default is `false`.

Once configured, you can test your Spark connection using the following API call:

```json
POST /_plugins/_ppl
content-type: application/json

{
"query": "source = my_spark.sql('select * from alb_logs')"
}
```