-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Model content hash can't match original hash value #2819
Comments
When an error occurs, should the zip file be retained for easy comparison? |
private void retrieveModelChunks(MLModel mlModelMeta, ActionListener<File> listener) throws InterruptedException {
String modelId = mlModelMeta.getModelId();
String modelName = mlModelMeta.getName();
Integer totalChunks = mlModelMeta.getTotalChunks();
GetRequest getRequest = new GetRequest();
getRequest.index(ML_MODEL_INDEX);
getRequest.id();
Semaphore semaphore = new Semaphore(1);
AtomicBoolean stopNow = new AtomicBoolean(false);
String modelZip = mlEngine.getDeployModelZipPath(modelId, modelName);
ConcurrentLinkedDeque<File> chunkFiles = new ConcurrentLinkedDeque();
AtomicInteger retrievedChunks = new AtomicInteger(0);
for (int i = 0; i < totalChunks; i++) {
semaphore.tryAcquire(10, TimeUnit.SECONDS);
if (stopNow.get()) {
throw new MLException("Failed to deploy model");
}
String modelChunkId = this.getModelChunkId(modelId, i);
int currentChunk = i;
this.getModel(modelChunkId, threadedActionListener(DEPLOY_THREAD_POOL, ActionListener.wrap(model -> {
Path chunkPath = mlEngine.getDeployModelChunkPath(modelId, currentChunk);
FileUtils.write(Base64.getDecoder().decode(model.getContent()), chunkPath.toString());
chunkFiles.add(new File(chunkPath.toUri()));
retrievedChunks.getAndIncrement();
if (retrievedChunks.get() == totalChunks) {
File modelZipFile = new File(modelZip);
FileUtils.mergeFiles(chunkFiles, modelZipFile);
listener.onResponse(modelZipFile);
}
semaphore.release();
}, e -> {
stopNow.set(true);
semaphore.release();
log.error("Failed to retrieve model chunk " + modelChunkId, e);
if (retrievedChunks.get() == totalChunks - 1) {
listener.onFailure(new MLResourceNotFoundException("Fail to find model chunk " + modelChunkId));
}
})));
}
}
|
What operating system are you running this on? |
Can you share your exact Operation System name? @libxj |
CentOS Linux release 7.9.2009 (Core) Server: Docker Engine - Community |
Are you not going to fix this bug? Model deployment now depends entirely on luck.Or should I submit a PR? |
So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?
If you have a fix that works, of course, please submit a PR. |
Yes, my data nodes are on other hosts, and the ML node is on a new GPU machine. They are connected through a VPN tunnel, so the probability of a successful deployment is quite low. |
Thanks @jlibx for fixing this issue. Have you tested that issue will be gone with the fix ? |
@ylwu-amzn I looked at his PR, but it does not explain what we are seeing above. In the screenshot, you can see that the model cache is missing a chunk (chunk id 18) which explains why the hash did not match. I think the real problem might be in this section: ml-commons/plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java Lines 1710 to 1717 in a4dff63
More specifically, this line:
I think what we need is:
When the chunk 18 was not found, we sorted of swallowed the exception and did not invoke We will need a setup like what OP has where you have a bad connectivity between ml nodes and data nodes to be able to reliably reproduce this problem and verify the fix, but to me, that looks very suspicious. If I am correct, when this problem happens, you should see this line in the log:
|
@jlibx ^^ |
What is the bug?
opensearch-ml-gpu | [2024-08-09T07:17:54,585][DEBUG][o.o.m.e.u.FileUtils ] [opensearch-ml-gpu] merge 61 files into /usr/share/opensearch/data/ml_cache/models_cache/deploy/HdPgNZEBkGu7typLkQJX/cre_pt_v0_2_0_test2.zip
opensearch-ml-gpu | [2024-08-09T07:17:54,782][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:54,961][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 25%
opensearch-ml-gpu | [2024-08-09T07:17:54,990][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:55,461][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,491][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 6%
opensearch-ml-gpu | [2024-08-09T07:17:55,898][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:55,962][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,991][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:56,055][ERROR][o.o.m.m.MLModelManager ] [opensearch-ml-gpu] Model content hash can't match original hash value
opensearch-ml-gpu | [2024-08-09T07:17:56,055][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] removing model HdPgNZEBkGu7typLkQJX from cache
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] Setting the auto deploying flag for Model HdPgNZEBkGu7typLkQJX
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.t.MLTaskManager ] [opensearch-ml-gpu] remove ML task from cache aYjwNZEBtLXkNmkPz_gQ
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: cluster:admin/opensearch/mlinternal/forward
opensearch-ml-gpu | [2024-08-09T07:17:56,279][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-ml-gpu] deploy model task done aYjwNZEBtLXkNmkPz_gQ
How can one reproduce the bug?
Steps to reproduce the behavior:
_register
POST /_plugins/_ml/models/_register
{
"name": "cre_pt_v0_2_0_test2",
"version": "0.2.0",
"model_format": "TORCH_SCRIPT",
"function_name": "TEXT_EMBEDDING",
"description": "huggingface_cre_v0_2_0_snapshot_norm_pt model 2024.4.26",
"url": "xxx.zip",
"model_config": {
"model_type": "bert",
"embedding_dimension": 1024,
"framework_type": "SENTENCE_TRANSFORMERS"
},
"model_content_hash_value": "197916cdbbeb40903393a3f74c215a6c4cb7e3201a2e0e826ef2b93728e4bf6b"
}
_deploy
POST /_plugins/_ml/models/HdPgNZEBkGu7typLkQJX/_deploy
result
GET /_plugins/_ml/tasks/aYjwNZEBtLXkNmkPz_gQ
{
"model_id": "HdPgNZEBkGu7typLkQJX",
"task_type": "DEPLOY_MODEL",
"function_name": "TEXT_EMBEDDING",
"state": "FAILED",
"worker_node": [
"Tv4342EeTaydOgMRthFtrg"
],
"create_time": 1723186859791,
"last_update_time": 1723187876166,
"error": """{"Tv4342EeTaydOgMRthFtrg":"model content changed"}""",
"is_async": true
}
What is your host/environment?
My hash value is completely correct, according to the official calculation method
shasum -a 256 sentence-transformers_paraphrase-mpnet-base-v2-1.0.0-onnx.zip
,there is no problem when calling _register, but The above error occurred after _deployThe text was updated successfully, but these errors were encountered: