Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Add circuits breaker for FlintREPL #109

Closed
YANG-DB opened this issue Oct 27, 2023 · 1 comment
Closed

[BUG]Add circuits breaker for FlintREPL #109

YANG-DB opened this issue Oct 27, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Oct 27, 2023

What is the bug?
In some cases flint queries just keep going without being timeout or circuit break

How can one reproduce the bug?
From workbench run:

source = mys3.default.http_logs head 5

What is the expected behavior?
Jobs should be timed-out / circuit break after a max running time

Do you have any screenshots?

The following job dosent stop running:

Sprak Config Params:

--class org.apache.spark.sql.FlintREPL --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.emr.AssumeRoleAWSCredentialsProvider --conf spark.hadoop.aws.catalog.credentials.provider.factory.class=com.amazonaws.glue.catalog.metastore.STSAssumeRoleSessionCredentialsProviderFactory --conf spark.jars.packages=org.opensearch:opensearch-spark-standalone_2.12:0.1.0-SNAPSHOT,org.opensearch:opensearch-spark-sql-application_2.12:0.1.0-SNAPSHOT,org.opensearch:opensearch-spark-ppl_2.12:0.1.0-SNAPSHOT --conf spark.jars.repositories=https://aws.oss.sonatype.org/content/repositories/snapshots --conf spark.datasource.flint.host=ec2-18-237-133-156.us-west-2.compute.amazonaws.com --conf spark.datasource.flint.port=9200 --conf spark.datasource.flint.scheme=http --conf spark.datasource.flint.auth=noauth --conf spark.datasource.flint.customAWSCredentialsProvider=com.amazonaws.emr.AssumeRoleAWSCredentialsProvider --conf spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions,org.opensearch.flint.spark.FlintSparkExtensions --conf spark.emr-serverless.driverEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN=arn:aws:iam::924196221507:role/FlintOpensearchServiceRole --conf spark.executorEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN=arn:aws:iam::924196221507:role/FlintOpensearchServiceRole --conf spark.hive.metastore.glue.role.arn=arn:aws:iam::924196221507:role/FlintOpensearchServiceRole --conf spark.sql.catalog.mys3=org.opensearch.sql.FlintDelegatingSessionCatalog --conf spark.flint.datasource.name=mys3 --conf spark.flint.job.requestIndex=.query_execution_request_mys3 --conf spark.flint.job.sessionId=NWpKUFBobmtiZ215czM= --conf spark.dynamicAllocation.enabled=false --conf spark.executor.instances=10 --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.emr-serverless.driverEnv.JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto.x86_64/ --conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto.x86_64/

Job error Log:

https://aws.oss.sonatype.org/content/repositories/snapshots added as a remote repository with the name: repo-1
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
org.opensearch#opensearch-spark-standalone_2.12 added as a dependency
org.opensearch#opensearch-spark-sql-application_2.12 added as a dependency
org.opensearch#opensearch-spark-ppl_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e89ff3b7-170e-4545-bd38-2304496ffbe0;1.0
	confs: [default]
	found org.opensearch#opensearch-spark-standalone_2.12;0.1.0-SNAPSHOT in repo-1
	found org.opensearch#opensearch-spark-sql-application_2.12;0.1.0-SNAPSHOT in repo-1
	found org.opensearch#opensearch-spark-ppl_2.12;0.1.0-SNAPSHOT in repo-1
downloading https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/opensearch-spark-standalone_2.12/0.1.0-SNAPSHOT/opensearch-spark-standalone_2.12-0.1.0-20231025.180019-42.jar ...
	[SUCCESSFUL ] org.opensearch#opensearch-spark-standalone_2.12;0.1.0-SNAPSHOT!opensearch-spark-standalone_2.12.jar (1796ms)
downloading https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/opensearch-spark-sql-application_2.12/0.1.0-SNAPSHOT/opensearch-spark-sql-application_2.12-0.1.0-20231025.180031-35.jar ...
	[SUCCESSFUL ] org.opensearch#opensearch-spark-sql-application_2.12;0.1.0-SNAPSHOT!opensearch-spark-sql-application_2.12.jar (292ms)
downloading https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/opensearch-spark-ppl_2.12/0.1.0-SNAPSHOT/opensearch-spark-ppl_2.12-0.1.0-20231025.180026-19.jar ...
	[SUCCESSFUL ] org.opensearch#opensearch-spark-ppl_2.12;0.1.0-SNAPSHOT!opensearch-spark-ppl_2.12.jar (229ms)
:: resolution report :: resolve 7942ms :: artifacts dl 2324ms
	:: modules in use:
	org.opensearch#opensearch-spark-ppl_2.12;0.1.0-SNAPSHOT from repo-1 in [default]
	org.opensearch#opensearch-spark-sql-application_2.12;0.1.0-SNAPSHOT from repo-1 in [default]
	org.opensearch#opensearch-spark-standalone_2.12;0.1.0-SNAPSHOT from repo-1 in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e89ff3b7-170e-4545-bd38-2304496ffbe0
	confs: [default]
	3 artifacts copied, 0 already retrieved (37949kB/39ms)
23/10/27 06:12:44 INFO HiveConf: Found configuration file file:/etc/spark/conf/hive-site.xml
23/10/27 06:12:44 INFO SparkContext: Running Spark version 3.3.2-amzn-0
23/10/27 06:12:44 INFO ResourceUtils: ==============================================================
23/10/27 06:12:44 INFO ResourceUtils: No custom resources configured for spark.driver.
23/10/27 06:12:44 INFO ResourceUtils: ==============================================================
23/10/27 06:12:44 INFO SparkContext: Submitted application: FlintREPL$
23/10/27 06:12:44 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 14336, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/10/27 06:12:44 INFO ResourceProfile: Limiting resource is cpus at 4 tasks per executor
23/10/27 06:12:44 INFO ResourceProfileManager: Added ResourceProfile id: 0
23/10/27 06:12:44 INFO SecurityManager: Changing view acls to: hadoop
23/10/27 06:12:44 INFO SecurityManager: Changing modify acls to: hadoop
23/10/27 06:12:44 INFO SecurityManager: Changing view acls groups to: 
23/10/27 06:12:44 INFO SecurityManager: Changing modify acls groups to: 
23/10/27 06:12:44 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
23/10/27 06:12:44 INFO Utils: Successfully started service 'sparkDriver' on port 42815.
23/10/27 06:12:44 INFO SparkEnv: Registering MapOutputTracker
23/10/27 06:12:44 INFO SparkEnv: Registering BlockManagerMaster
23/10/27 06:12:44 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/10/27 06:12:44 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/10/27 06:12:44 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/10/27 06:12:44 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-0cc5defb-dc15-4c7a-b4dd-392a8b830b7d
23/10/27 06:12:44 INFO MemoryStore: MemoryStore started with capacity 8.2 GiB
23/10/27 06:12:44 INFO SparkEnv: Registering OutputCommitCoordinator
23/10/27 06:12:44 INFO SubResultCacheManager: Sub-result caches are disabled.
23/10/27 06:12:45 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/10/27 06:12:45 INFO SparkContext: Added JAR file:///home/hadoop/.ivy2/jars/org.opensearch_opensearch-spark-standalone_2.12-0.1.0-SNAPSHOT.jar at spark://[2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:42815/jars/org.opensearch_opensearch-spark-standalone_2.12-0.1.0-SNAPSHOT.jar with timestamp 1698387164418
23/10/27 06:12:45 INFO SparkContext: Added JAR file:///home/hadoop/.ivy2/jars/org.opensearch_opensearch-spark-sql-application_2.12-0.1.0-SNAPSHOT.jar at spark://[2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:42815/jars/org.opensearch_opensearch-spark-sql-application_2.12-0.1.0-SNAPSHOT.jar with timestamp 1698387164418
23/10/27 06:12:45 INFO SparkContext: Added JAR file:///home/hadoop/.ivy2/jars/org.opensearch_opensearch-spark-ppl_2.12-0.1.0-SNAPSHOT.jar at spark://[2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:42815/jars/org.opensearch_opensearch-spark-ppl_2.12-0.1.0-SNAPSHOT.jar with timestamp 1698387164418
23/10/27 06:12:45 INFO SparkContext: The JAR file:///home/hadoop/.ivy2/jars/org.opensearch_opensearch-spark-sql-application_2.12-0.1.0-SNAPSHOT.jar at spark://[2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:42815/jars/org.opensearch_opensearch-spark-sql-application_2.12-0.1.0-SNAPSHOT.jar has been added already. Overwriting of added jar is not supported in the current version.
23/10/27 06:12:45 INFO ExecutorContainerAllocator: Set total expected execs to {0=10}
23/10/27 06:12:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34899.
23/10/27 06:12:45 INFO NettyBlockTransferService: Server created on [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:34899
23/10/27 06:12:45 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/10/27 06:12:45 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16], 34899, None)
23/10/27 06:12:45 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:34899 with 8.2 GiB RAM, BlockManagerId(driver, [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16], 34899, None)
23/10/27 06:12:45 INFO ExecutorContainerAllocator: Going to request 10 executors for ResourceProfile Id: 0, target: 10 already provisioned: 0.
23/10/27 06:12:45 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16], 34899, None)
23/10/27 06:12:45 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16], 34899, None)
23/10/27 06:12:45 INFO TimeBasedRotatingEventLogFilesWriter: rotationIntervalInSeconds = 300, eventFileMinSize = 1048576, maxFilesToRetain = 2
23/10/27 06:12:45 INFO DefaultEmrServerlessRMClient: Creating containers with container role SPARK_EXECUTOR and keys: Set(8, 4, 9, 5, 10, 6, 1, 2, 7, 3)
23/10/27 06:12:45 INFO TimeBasedRotatingEventLogFilesWriter: Logging events to file:/var/log/spark/apps/eventlog_v2_00fe9epc1cjpq00q/00fe9epc1cjpq00q.inprogress
23/10/27 06:12:46 INFO DefaultEmrServerlessRMClient: Containers created with container role SPARK_EXECUTOR. key to container id map: Map(2 -> 08c5b7e1-1b1b-dd4b-62c2-3a7ab4c18ac8, 5 -> 26c5b7e1-1b13-e4bb-8cf6-0a60d7c7f5cc, 8 -> 84c5b7e1-1b22-e778-85e7-3b8e950e0a21, 7 -> 62c5b7e1-1b45-344a-45cf-d3deb353616f, 1 -> 78c5b7e1-1b2f-5103-7308-a83aefb136f8, 4 -> a6c5b7e1-1b3d-23b5-549e-bec6bb731d45, 6 -> 86c5b7e1-1b0c-2433-e7f0-24c75089ba24, 9 -> 12c5b7e1-1b35-dcd6-7cc4-10346fd83dfc, 3 -> 18c5b7e1-1b4c-740e-8965-7a9b4915f5e6, 10 -> 8cc5b7e1-1b05-49e2-c3a3-c6870e7b2340)
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:69d4:5763:1571:e8f5:33610) with ID 7,  ResourceProfileId 0
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:69d4:5763:1571:e8f5]:41483 with 7.9 GiB RAM, BlockManagerId(7, [2a05:d018:107:5a00:69d4:5763:1571:e8f5], 41483, None)
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:3855:6e2e:5ec2:9670:56494) with ID 2,  ResourceProfileId 0
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:dc42:d9ab:6f13:a7f4:50894) with ID 8,  ResourceProfileId 0
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:7ccc:70a0:84d8:7e13:42610) with ID 5,  ResourceProfileId 0
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:b57:4798:8281:3de4:48974) with ID 6,  ResourceProfileId 0
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:ee9a:1854:d167:471b:32842) with ID 10,  ResourceProfileId 0
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:3855:6e2e:5ec2:9670]:46183 with 7.9 GiB RAM, BlockManagerId(2, [2a05:d018:107:5a00:3855:6e2e:5ec2:9670], 46183, None)
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:dc42:d9ab:6f13:a7f4]:33559 with 7.9 GiB RAM, BlockManagerId(8, [2a05:d018:107:5a00:dc42:d9ab:6f13:a7f4], 33559, None)
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:7ccc:70a0:84d8:7e13]:35807 with 7.9 GiB RAM, BlockManagerId(5, [2a05:d018:107:5a00:7ccc:70a0:84d8:7e13], 35807, None)
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:b57:4798:8281:3de4]:37141 with 7.9 GiB RAM, BlockManagerId(6, [2a05:d018:107:5a00:b57:4798:8281:3de4], 37141, None)
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:ee9a:1854:d167:471b]:38219 with 7.9 GiB RAM, BlockManagerId(10, [2a05:d018:107:5a00:ee9a:1854:d167:471b], 38219, None)
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:d2d1:d67e:bfaf:4be6:60494) with ID 4,  ResourceProfileId 0
23/10/27 06:12:51 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:d2d1:d67e:bfaf:4be6]:43541 with 7.9 GiB RAM, BlockManagerId(4, [2a05:d018:107:5a00:d2d1:d67e:bfaf:4be6], 43541, None)
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:836b:42d:bf25:96c4:53088) with ID 9,  ResourceProfileId 0
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:ce83:e099:9129:7f2a:56734) with ID 3,  ResourceProfileId 0
23/10/27 06:12:51 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
23/10/27 06:12:51 INFO SharedState: Warehouse path is 'file:/home/hadoop/spark-warehouse'.
23/10/27 06:12:51 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:836b:42d:bf25:96c4]:44279 with 7.9 GiB RAM, BlockManagerId(9, [2a05:d018:107:5a00:836b:42d:bf25:96c4], 44279, None)
23/10/27 06:12:51 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:ce83:e099:9129:7f2a]:43411 with 7.9 GiB RAM, BlockManagerId(3, [2a05:d018:107:5a00:ce83:e099:9129:7f2a], 43411, None)
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:d9d9:ced2:7d72:1aed:37034) with ID 1,  ResourceProfileId 0
23/10/27 06:12:52 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:d9d9:ced2:7d72:1aed]:42027 with 7.9 GiB RAM, BlockManagerId(1, [2a05:d018:107:5a00:d9d9:ced2:7d72:1aed], 42027, None)
23/10/27 06:12:53 INFO FlintREPL: Updated job: {"jobid": 00fe9epc1cjpq00q, "sessionId": NWpKUFBobmtiZ215czM=} from .query_execution_request_mys3
23/10/27 06:12:53 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:12:53 INFO FlintREPL: read from .query_execution_request_mys3
...
......
.........
23/10/27 06:27:48 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:27:48 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:27:49 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:27:49 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:27:50 INFO FlintREPL: read from .query_execution_request_mys3
@YANG-DB YANG-DB added bug Something isn't working untriaged labels Oct 27, 2023
@dai-chen
Copy link
Collaborator

I think @kaituo already added default timeout for REPL job. Feel free to reopen if you still see the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants