Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Class Cast Exception when invoking Register API through MLClient #1553

Closed
owaiskazi19 opened this issue Oct 26, 2023 · 11 comments · Fixed by #1560
Closed

[BUG] Class Cast Exception when invoking Register API through MLClient #1553

owaiskazi19 opened this issue Oct 26, 2023 · 11 comments · Fixed by #1560
Labels
bug Something isn't working untriaged

Comments

@owaiskazi19
Copy link
Member

owaiskazi19 commented Oct 26, 2023

What is the bug?
Getting a class cast exception as mentioned below using the register API of MLClient in our plugin. Two class loaders are getting loaded in the cluster

java.net.FactoryURLClassLoader @475f5672
java.net.FactoryURLClassLoader @320be73

whereas our plugin is importing MLRegisterModelResponse just once from the location

jar:file:/home/ubuntu/OpenSearch/opensearch-3.0.0-SNAPSHOT/plugins/opensearch-flow-framework/opensearch-ml-client-3.0.0.0-SNAPSHOT.jar!/org/opensearch/ml/common/transport/register/MLRegisterModelResponse.class

Code:

MLRegisterModelInput mlInput = MLRegisterModelInput.builder()
                .functionName(functionName)
                .modelName(modelName)
                .description(description)
                .connectorId(connectorId)
                .build();

mlClient.register(mlInput, actionListener);

COMPLETE ERROR:

[2023-10-25T20:13:30,038][ERROR][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] Provisioning failed for workflow rlt4aIsB-BC8dYg-Eww9 : java.util.concurrent.ExecutionException: org.opensearch.flowframework.exception.FlowFrameworkException: class org.opensearch.ml.common.transport.register.MLRegisterModelResponse cannot be cast to class org.opensearch.ml.common.transport.register.MLRegisterModelResponse (org.opensearch.ml.common.transport.register.MLRegisterModelResponse is in unnamed module of loader java.net.FactoryURLClassLoader @475f5672; org.opensearch.ml.common.transport.register.MLRegisterModelResponse is in unnamed module of loader java.net.FactoryURLClassLoader @320be73)

Found a PR similar to this issue already addressed Class Cast Exception: #127
How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Create a MachineLearningNodeClient Object and invoke register API with the required input.

What is the expected behavior?
Register API should return MLRegisterModelResponse without any exception.

What is your host/environment?

  • OS: ubuntu
  • Plugins: ml-commons, opensearch-flow-framework

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
opensearch-project/flow-framework#108

@owaiskazi19 owaiskazi19 added bug Something isn't working untriaged labels Oct 26, 2023
@dbwiddis
Copy link
Member

I wonder if this code has any relevance here.

@austintlee
Copy link
Collaborator

ml-commons might not be the right place for this issue.

I think you need to find exactly what those two classloaders are to get to the bottom of this.

Does this happen if you use 2.x?

@owaiskazi19
Copy link
Member Author

Just tested on 2.x as well. Failed with the same error

[2023-10-26T19:49:15,284][ERROR][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] Provisioning failed for workflow eB6HbYsBY7HioFWUywmC : org.opensearch.flowframework.exception.FlowFrameworkException: java.util.concurrent.ExecutionException: org.opensearch.flowframework.exception.FlowFrameworkException: class org.opensearch.ml.common.transport.register.MLRegisterModelResponse cannot be cast to class org.opensearch.ml.common.transport.register.MLRegisterModelResponse (org.opensearch.ml.common.transport.register.MLRegisterModelResponse is in unnamed module of loader java.net.FactoryURLClassLoader @bfc14b9; org.opensearch.ml.common.transport.register.MLRegisterModelResponse is in unnamed module of loader java.net.FactoryURLClassLoader @3ad4a7d6)

@owaiskazi19
Copy link
Member Author

owaiskazi19 commented Oct 26, 2023

Tried invoking registerModelGroup API which doesn't have the actionListener wrapper while invoking client.execute, failed with the below. The issue is arising in TransportRegisterModelGroupAction.
Looks like in the cluster the class MLRegisterModelGroupResponse is getting loaded from ml-commons and our plugin.

Logs:

2023-10-26T22:15:30,043][ERROR][o.o.m.a.m.TransportRegisterModelGroupAction] [ip-172-31-56-214] Failed to init model group index
java.lang.ClassCastException: class org.opensearch.ml.common.transport.model_group.MLRegisterModelGroupResponse cannot be cast to class org.opensearch.ml.common.transport.model_group.MLRegisterModelGroupResponse (org.opensearch.ml.common.transport.model_group.MLRegisterModelGroupResponse is in unnamed module of loader java.net.FactoryURLClassLoader @cf67838; org.opensearch.ml.common.transport.model_group.MLRegisterModelGroupResponse is in unnamed module of loader java.net.FactoryURLClassLoader @459b187a)
        at org.opensearch.flowframework.workflow.ModelGroupStep$1.onResponse(ModelGroupStep.java:55) ~[?:?]
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.ml.action.model_group.TransportRegisterModelGroupAction.lambda$doExecute$0(TransportRegisterModelGroupAction.java:68) ~[?:?]
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.ml.model.MLModelGroupManager.lambda$createModelGroup$1(MLModelGroupManager.java:121) [opensearch-ml-3.0.0.0-SNAPSHOT.jar:3.0.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportSingleItemBulkWriteAction.lambda$wrapBulkResponse$0(TransportSingleItemBulkWriteAction.java:84) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:701) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportBulkAction$BulkOperation$1.onResponse(TransportBulkAction.java:674) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportBulkAction$BulkOperation$1.onResponse(TransportBulkAction.java:660) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishOnSuccess(TransportReplicationAction.java:1175) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleResponse(TransportReplicationAction.java:1083) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleResponse(TransportReplicationAction.java:1069) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.transport.TransportService$6.handleResponse(TransportService.java:886) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1493) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1576) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1556) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$runWithPrimaryShardReference$2(TransportReplicationAction.java:566) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$4.onResponse(ActionListener.java:182) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.ReplicationOperation.finish(ReplicationOperation.java:439) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.ReplicationOperation.decPendingAndFinishIfNeeded(ReplicationOperation.java:425) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:194) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:186) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportWriteAction$WritePrimaryResult$1.onSuccess(TransportWriteAction.java:332) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.maybeFinish(TransportWriteAction.java:478) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.lambda$run$1(TransportWriteAction.java:507) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.notifyList(AsyncIOProcessor.java:143) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:121) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:95) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.sync(IndexShard.java:4249) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.run(TransportWriteAction.java:505) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportWriteAction$WritePrimaryResult.runPostReplicationActions(TransportWriteAction.java:339) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:186) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:355) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportShardBulkAction$2.finishRequest(TransportShardBulkAction.java:521) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:484) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:533) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:414) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:124) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:235) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:833) [?:?]

@dbwiddis
Copy link
Member

So the exception is thrown on this instantiation:

listener.onResponse(new MLRegisterModelGroupResponse(modelGroupId, MLTaskState.CREATED.name()));

The context for the Action Listener is set (or, rather, restored) here:

try (ThreadContext.StoredContext context = client.threadPool().getThreadContext().stashContext()) {
ActionListener<String> wrappedListener = ActionListener.runBefore(listener, () -> context.restore());

@austintlee
Copy link
Collaborator

@owaiskazi19 I see that you were the one who added this API to MLClient (#1493). Is it possible that you have some old jar sitting on your machine? Does this happen on anyone else's machine or is this happening in a "fresh" environment?

@owaiskazi19
Copy link
Member Author

owaiskazi19 commented Oct 27, 2023

@owaiskazi19 I see that you were the one who added this API to MLClient (#1493). Is it possible that you have some old jar sitting on your machine? Does this happen on anyone else's machine or is this happening in a "fresh" environment?

@joshpalis faced the same issue for register API and created opensearch-project/flow-framework#108. Also, I cleaned everything, spun up a new OS 3.0 cluster and installed the ml-commons and ai-flow-framework plugin. Still failed with the same issue.

I think what @dbwiddis pointed out above about the context of ActionListener can be the reason for the CastException.

@ylwu-amzn ylwu-amzn mentioned this issue Oct 27, 2023
5 tasks
@owaiskazi19
Copy link
Member Author

owaiskazi19 commented Oct 27, 2023

#1560 solved the Class Cast Exception for register API! Thanks @ylwu-amzn

@arjunkumargiri
Copy link
Contributor

@owaiskazi19 Do we know why MLRegisterModelInputis loaded from two different jars?

@dbwiddis
Copy link
Member

dbwiddis commented Nov 16, 2023

@owaiskazi19 Do we know why MLRegisterModelInputis loaded from two different jars?

Same JAR, different class loaders.

Also for clarity the conflict is in the Response subclass, in this issue MLRegisterModelResponse.

My suspicion is that plugin class loaders are children of the OpenSearch parent; so they share classes with OpenSearch but don't share classes with each other.

I have not delved into the specific details, but the linked issues and PRs demonstrate an effective fix.

@owaiskazi19
Copy link
Member Author

@owaiskazi19 Do we know why MLRegisterModelInputis loaded from two different jars?

As @dbwiddis pointed out. We were fetching MLRegisterModelInput in flow framework from ml-commons dependency in our build.gradle and ml-commons plugin has MLRegisterModelInput class of it's own. When running the cluster with both the plugins installed different classloaders were detected and hence the ClassCastException.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants