[BUG]Model content hash can't match original hash value #2819

jlibx · 2024-08-09T07:26:19Z

What is the bug?
opensearch-ml-gpu | [2024-08-09T07:17:54,585][DEBUG][o.o.m.e.u.FileUtils ] [opensearch-ml-gpu] merge 61 files into /usr/share/opensearch/data/ml_cache/models_cache/deploy/HdPgNZEBkGu7typLkQJX/cre_pt_v0_2_0_test2.zip
opensearch-ml-gpu | [2024-08-09T07:17:54,782][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:54,961][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 25%
opensearch-ml-gpu | [2024-08-09T07:17:54,990][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:55,461][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,491][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 6%
opensearch-ml-gpu | [2024-08-09T07:17:55,898][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:55,962][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,991][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:56,055][ERROR][o.o.m.m.MLModelManager ] [opensearch-ml-gpu] Model content hash can't match original hash value
opensearch-ml-gpu | [2024-08-09T07:17:56,055][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] removing model HdPgNZEBkGu7typLkQJX from cache
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] Setting the auto deploying flag for Model HdPgNZEBkGu7typLkQJX
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.t.MLTaskManager ] [opensearch-ml-gpu] remove ML task from cache aYjwNZEBtLXkNmkPz_gQ
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: cluster:admin/opensearch/mlinternal/forward
opensearch-ml-gpu | [2024-08-09T07:17:56,279][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-ml-gpu] deploy model task done aYjwNZEBtLXkNmkPz_gQ

How can one reproduce the bug?
Steps to reproduce the behavior:

_register
POST /_plugins/_ml/models/_register
{
"name": "cre_pt_v0_2_0_test2",
"version": "0.2.0",
"model_format": "TORCH_SCRIPT",
"function_name": "TEXT_EMBEDDING",
"description": "huggingface_cre_v0_2_0_snapshot_norm_pt model 2024.4.26",
"url": "xxx.zip",
"model_config": {
"model_type": "bert",
"embedding_dimension": 1024,
"framework_type": "SENTENCE_TRANSFORMERS"
},
"model_content_hash_value": "197916cdbbeb40903393a3f74c215a6c4cb7e3201a2e0e826ef2b93728e4bf6b"
}
_deploy
POST /_plugins/_ml/models/HdPgNZEBkGu7typLkQJX/_deploy
result
GET /_plugins/_ml/tasks/aYjwNZEBtLXkNmkPz_gQ

{
"model_id": "HdPgNZEBkGu7typLkQJX",
"task_type": "DEPLOY_MODEL",
"function_name": "TEXT_EMBEDDING",
"state": "FAILED",
"worker_node": [
"Tv4342EeTaydOgMRthFtrg"
],
"create_time": 1723186859791,
"last_update_time": 1723187876166,
"error": """{"Tv4342EeTaydOgMRthFtrg":"model content changed"}""",
"is_async": true
}

What is your host/environment?

OS: [e.g. Linux Docker]
Version [e.g. 2.16]
Plugins

My hash value is completely correct, according to the official calculation method shasum -a 256 sentence-transformers_paraphrase-mpnet-base-v2-1.0.0-onnx.zip,there is no problem when calling _register, but The above error occurred after _deploy

The text was updated successfully, but these errors were encountered:

jlibx · 2024-08-09T10:26:22Z

When an error occurs, should the zip file be retained for easy comparison?
In addition, the current network between the ml node and the data node is not very good, will it affect the content of each block obtained, and ultimately lead to problems with the merged zip file?

jlibx · 2024-08-11T01:34:46Z

private void retrieveModelChunks(MLModel mlModelMeta, ActionListener<File> listener) throws InterruptedException {
        String modelId = mlModelMeta.getModelId();
        String modelName = mlModelMeta.getName();
        Integer totalChunks = mlModelMeta.getTotalChunks();
        GetRequest getRequest = new GetRequest();
        getRequest.index(ML_MODEL_INDEX);
        getRequest.id();
        Semaphore semaphore = new Semaphore(1);
        AtomicBoolean stopNow = new AtomicBoolean(false);
        String modelZip = mlEngine.getDeployModelZipPath(modelId, modelName);
        ConcurrentLinkedDeque<File> chunkFiles = new ConcurrentLinkedDeque();
        AtomicInteger retrievedChunks = new AtomicInteger(0);
        for (int i = 0; i < totalChunks; i++) {
            semaphore.tryAcquire(10, TimeUnit.SECONDS);
            if (stopNow.get()) {
                throw new MLException("Failed to deploy model");
            }
            String modelChunkId = this.getModelChunkId(modelId, i);
            int currentChunk = i;
            this.getModel(modelChunkId, threadedActionListener(DEPLOY_THREAD_POOL, ActionListener.wrap(model -> {
                Path chunkPath = mlEngine.getDeployModelChunkPath(modelId, currentChunk);
                FileUtils.write(Base64.getDecoder().decode(model.getContent()), chunkPath.toString());
                chunkFiles.add(new File(chunkPath.toUri()));
                retrievedChunks.getAndIncrement();
                if (retrievedChunks.get() == totalChunks) {
                    File modelZipFile = new File(modelZip);
                    FileUtils.mergeFiles(chunkFiles, modelZipFile);
                    listener.onResponse(modelZipFile);
                }
                semaphore.release();
            }, e -> {
                stopNow.set(true);
                semaphore.release();
                log.error("Failed to retrieve model chunk " + modelChunkId, e);
                if (retrievedChunks.get() == totalChunks - 1) {
                    listener.onFailure(new MLResourceNotFoundException("Fail to find model chunk " + modelChunkId));
                }
            })));
        }
    }

semaphore.tryAcquire(10, TimeUnit.SECONDS);
I think this code will cause chunk confusion when the network is not good.

austintlee · 2024-08-24T23:34:40Z

What operating system are you running this on?

Zhangxunmt · 2024-08-27T18:09:19Z

Can you share your exact Operation System name? @libxj

jlibx · 2024-08-28T01:40:42Z

CentOS Linux release 7.9.2009 (Core)
But i run it in the docker，base image is opensearchproject/opensearch:2.16.0
Client: Docker Engine - Community
Version: 26.1.4
API version: 1.45
Go version: go1.21.11
Git commit: 5650f9b
Built: Wed Jun 5 11:32:04 2024
OS/Arch: linux/amd64
Context: default

Server: Docker Engine - Community
Engine:
Version: 26.1.4
API version: 1.45 (minimum version 1.24)
Go version: go1.21.11
Git commit: de5c9cf
Built: Wed Jun 5 11:31:02 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.33
GitCommit: d2d58213f83a351ca8f528a95fbd145f5654e957
nvidia:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0

jlibx · 2024-08-28T01:46:20Z

jlibx · 2024-09-06T03:48:06Z

Are you not going to fix this bug? Model deployment now depends entirely on luck.Or should I submit a PR?

austintlee · 2024-09-08T00:19:54Z

In addition, the current network between the ml node and the data node is not very good

So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?

Or should I submit a PR

If you have a fix that works, of course, please submit a PR.

jlibx · 2024-09-09T07:28:23Z

In addition, the current network between the ml node and the data node is not very good

So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?

Or should I submit a PR

If you have a fix that works, of course, please submit a PR.

Yes, my data nodes are on other hosts, and the ML node is on a new GPU machine. They are connected through a VPN tunnel, so the probability of a successful deployment is quite low.

ylwu-amzn · 2024-09-18T20:44:23Z

Thanks @jlibx for fixing this issue. Have you tested that issue will be gone with the fix ?

austintlee · 2024-09-28T17:55:51Z

@ylwu-amzn I looked at his PR, but it does not explain what we are seeing above. In the screenshot, you can see that the model cache is missing a chunk (chunk id 18) which explains why the hash did not match.

I think the real problem might be in this section:

ml-commons/plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

Lines 1710 to 1717 in a4dff63

    
           }, e -> { 
        
               stopNow.set(true); 
        
               semaphore.release(); 
        
               log.error("Failed to retrieve model chunk " + modelChunkId, e); 
        
               if (retrievedChunks.get() == totalChunks - 1) { 
        
                   listener.onFailure(new MLResourceNotFoundException("Fail to find model chunk " + modelChunkId)); 
        
               } 
        
           })));

More specifically, this line:

ml-commons/plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

Line 1714 in a4dff63

if (retrievedChunks.get() == totalChunks - 1) {

I think what we need is:

if (retrievedChunks.get() <= totalChunks - 1) {

When the chunk 18 was not found, we sorted of swallowed the exception and did not invoke .onFailure because at that time, the equality condition was not met.

We will need a setup like what OP has where you have a bad connectivity between ml nodes and data nodes to be able to reliably reproduce this problem and verify the fix, but to me, that looks very suspicious.

If I am correct, when this problem happens, you should see this line in the log:

ml-commons/plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

Line 1713 in a4dff63

log.error("Failed to retrieve model chunk " + modelChunkId, e);

austintlee · 2024-09-28T17:56:01Z

@jlibx ^^

jlibx added bug Something isn't working untriaged labels Aug 9, 2024

ylwu-amzn added this to ml-commons projects Aug 27, 2024

Zhangxunmt removed the untriaged label Aug 27, 2024

Zhangxunmt moved this to On-deck in ml-commons projects Aug 27, 2024

Zhangxunmt assigned austintlee Aug 27, 2024

jlibx mentioned this issue Sep 10, 2024

Fixed the bug of sequence error during merging files #2923

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Model content hash can't match original hash value #2819

[BUG]Model content hash can't match original hash value #2819

jlibx commented Aug 9, 2024 •

edited

Loading

jlibx commented Aug 9, 2024

jlibx commented Aug 11, 2024

austintlee commented Aug 24, 2024

Zhangxunmt commented Aug 27, 2024

jlibx commented Aug 28, 2024

jlibx commented Aug 28, 2024

jlibx commented Sep 6, 2024

austintlee commented Sep 8, 2024

jlibx commented Sep 9, 2024

ylwu-amzn commented Sep 18, 2024

austintlee commented Sep 28, 2024

austintlee commented Sep 28, 2024

[BUG]Model content hash can't match original hash value #2819

[BUG]Model content hash can't match original hash value #2819

Comments

jlibx commented Aug 9, 2024 • edited Loading

jlibx commented Aug 9, 2024

jlibx commented Aug 11, 2024

austintlee commented Aug 24, 2024

Zhangxunmt commented Aug 27, 2024

jlibx commented Aug 28, 2024

jlibx commented Aug 28, 2024

jlibx commented Sep 6, 2024

austintlee commented Sep 8, 2024

jlibx commented Sep 9, 2024

ylwu-amzn commented Sep 18, 2024

austintlee commented Sep 28, 2024

austintlee commented Sep 28, 2024

jlibx commented Aug 9, 2024 •

edited

Loading