Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Model content hash can't match original hash value #2819

Open
jlibx opened this issue Aug 9, 2024 · 12 comments
Open

[BUG]Model content hash can't match original hash value #2819

jlibx opened this issue Aug 9, 2024 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@jlibx
Copy link

jlibx commented Aug 9, 2024

What is the bug?
opensearch-ml-gpu | [2024-08-09T07:17:54,585][DEBUG][o.o.m.e.u.FileUtils ] [opensearch-ml-gpu] merge 61 files into /usr/share/opensearch/data/ml_cache/models_cache/deploy/HdPgNZEBkGu7typLkQJX/cre_pt_v0_2_0_test2.zip
opensearch-ml-gpu | [2024-08-09T07:17:54,782][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:54,961][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 25%
opensearch-ml-gpu | [2024-08-09T07:17:54,990][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:55,461][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,491][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 6%
opensearch-ml-gpu | [2024-08-09T07:17:55,898][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check
opensearch-ml-gpu | [2024-08-09T07:17:55,962][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26%
opensearch-ml-gpu | [2024-08-09T07:17:55,991][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0%
opensearch-ml-gpu | [2024-08-09T07:17:56,055][ERROR][o.o.m.m.MLModelManager ] [opensearch-ml-gpu] Model content hash can't match original hash value
opensearch-ml-gpu | [2024-08-09T07:17:56,055][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] removing model HdPgNZEBkGu7typLkQJX from cache
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] Setting the auto deploying flag for Model HdPgNZEBkGu7typLkQJX
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.t.MLTaskManager ] [opensearch-ml-gpu] remove ML task from cache aYjwNZEBtLXkNmkPz_gQ
opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: cluster:admin/opensearch/mlinternal/forward
opensearch-ml-gpu | [2024-08-09T07:17:56,279][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-ml-gpu] deploy model task done aYjwNZEBtLXkNmkPz_gQ

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. _register
    POST /_plugins/_ml/models/_register
    {
    "name": "cre_pt_v0_2_0_test2",
    "version": "0.2.0",
    "model_format": "TORCH_SCRIPT",
    "function_name": "TEXT_EMBEDDING",
    "description": "huggingface_cre_v0_2_0_snapshot_norm_pt model 2024.4.26",
    "url": "xxx.zip",
    "model_config": {
    "model_type": "bert",
    "embedding_dimension": 1024,
    "framework_type": "SENTENCE_TRANSFORMERS"
    },
    "model_content_hash_value": "197916cdbbeb40903393a3f74c215a6c4cb7e3201a2e0e826ef2b93728e4bf6b"
    }

  2. _deploy
    POST /_plugins/_ml/models/HdPgNZEBkGu7typLkQJX/_deploy

  3. result
    GET /_plugins/_ml/tasks/aYjwNZEBtLXkNmkPz_gQ

{
"model_id": "HdPgNZEBkGu7typLkQJX",
"task_type": "DEPLOY_MODEL",
"function_name": "TEXT_EMBEDDING",
"state": "FAILED",
"worker_node": [
"Tv4342EeTaydOgMRthFtrg"
],
"create_time": 1723186859791,
"last_update_time": 1723187876166,
"error": """{"Tv4342EeTaydOgMRthFtrg":"model content changed"}""",
"is_async": true
}

What is your host/environment?

  • OS: [e.g. Linux Docker]
  • Version [e.g. 2.16]
  • Plugins

My hash value is completely correct, according to the official calculation method shasum -a 256 sentence-transformers_paraphrase-mpnet-base-v2-1.0.0-onnx.zip,there is no problem when calling _register, but The above error occurred after _deploy

@jlibx jlibx added bug Something isn't working untriaged labels Aug 9, 2024
@jlibx
Copy link
Author

jlibx commented Aug 9, 2024

When an error occurs, should the zip file be retained for easy comparison?
In addition, the current network between the ml node and the data node is not very good, will it affect the content of each block obtained, and ultimately lead to problems with the merged zip file?

@jlibx
Copy link
Author

jlibx commented Aug 11, 2024

private void retrieveModelChunks(MLModel mlModelMeta, ActionListener<File> listener) throws InterruptedException {
        String modelId = mlModelMeta.getModelId();
        String modelName = mlModelMeta.getName();
        Integer totalChunks = mlModelMeta.getTotalChunks();
        GetRequest getRequest = new GetRequest();
        getRequest.index(ML_MODEL_INDEX);
        getRequest.id();
        Semaphore semaphore = new Semaphore(1);
        AtomicBoolean stopNow = new AtomicBoolean(false);
        String modelZip = mlEngine.getDeployModelZipPath(modelId, modelName);
        ConcurrentLinkedDeque<File> chunkFiles = new ConcurrentLinkedDeque();
        AtomicInteger retrievedChunks = new AtomicInteger(0);
        for (int i = 0; i < totalChunks; i++) {
            semaphore.tryAcquire(10, TimeUnit.SECONDS);
            if (stopNow.get()) {
                throw new MLException("Failed to deploy model");
            }
            String modelChunkId = this.getModelChunkId(modelId, i);
            int currentChunk = i;
            this.getModel(modelChunkId, threadedActionListener(DEPLOY_THREAD_POOL, ActionListener.wrap(model -> {
                Path chunkPath = mlEngine.getDeployModelChunkPath(modelId, currentChunk);
                FileUtils.write(Base64.getDecoder().decode(model.getContent()), chunkPath.toString());
                chunkFiles.add(new File(chunkPath.toUri()));
                retrievedChunks.getAndIncrement();
                if (retrievedChunks.get() == totalChunks) {
                    File modelZipFile = new File(modelZip);
                    FileUtils.mergeFiles(chunkFiles, modelZipFile);
                    listener.onResponse(modelZipFile);
                }
                semaphore.release();
            }, e -> {
                stopNow.set(true);
                semaphore.release();
                log.error("Failed to retrieve model chunk " + modelChunkId, e);
                if (retrievedChunks.get() == totalChunks - 1) {
                    listener.onFailure(new MLResourceNotFoundException("Fail to find model chunk " + modelChunkId));
                }
            })));
        }
    }

semaphore.tryAcquire(10, TimeUnit.SECONDS);
I think this code will cause chunk confusion when the network is not good.

@austintlee
Copy link
Collaborator

What operating system are you running this on?

@Zhangxunmt
Copy link
Collaborator

Can you share your exact Operation System name? @libxj

@jlibx
Copy link
Author

jlibx commented Aug 28, 2024

CentOS Linux release 7.9.2009 (Core)
But i run it in the docker,base image is opensearchproject/opensearch:2.16.0
Client: Docker Engine - Community
Version: 26.1.4
API version: 1.45
Go version: go1.21.11
Git commit: 5650f9b
Built: Wed Jun 5 11:32:04 2024
OS/Arch: linux/amd64
Context: default

Server: Docker Engine - Community
Engine:
Version: 26.1.4
API version: 1.45 (minimum version 1.24)
Go version: go1.21.11
Git commit: de5c9cf
Built: Wed Jun 5 11:31:02 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.33
GitCommit: d2d58213f83a351ca8f528a95fbd145f5654e957
nvidia:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0

@jlibx
Copy link
Author

jlibx commented Aug 28, 2024

image

@jlibx
Copy link
Author

jlibx commented Sep 6, 2024

Are you not going to fix this bug? Model deployment now depends entirely on luck.Or should I submit a PR?

@austintlee
Copy link
Collaborator

In addition, the current network between the ml node and the data node is not very good

So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?

Or should I submit a PR

If you have a fix that works, of course, please submit a PR.

@jlibx
Copy link
Author

jlibx commented Sep 9, 2024

In addition, the current network between the ml node and the data node is not very good

So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?

Or should I submit a PR

If you have a fix that works, of course, please submit a PR.

Yes, my data nodes are on other hosts, and the ML node is on a new GPU machine. They are connected through a VPN tunnel, so the probability of a successful deployment is quite low.

@ylwu-amzn
Copy link
Collaborator

Thanks @jlibx for fixing this issue. Have you tested that issue will be gone with the fix ?

@austintlee
Copy link
Collaborator

@ylwu-amzn I looked at his PR, but it does not explain what we are seeing above. In the screenshot, you can see that the model cache is missing a chunk (chunk id 18) which explains why the hash did not match.

I think the real problem might be in this section:

}, e -> {
stopNow.set(true);
semaphore.release();
log.error("Failed to retrieve model chunk " + modelChunkId, e);
if (retrievedChunks.get() == totalChunks - 1) {
listener.onFailure(new MLResourceNotFoundException("Fail to find model chunk " + modelChunkId));
}
})));

More specifically, this line:

if (retrievedChunks.get() == totalChunks - 1) {

I think what we need is:

if (retrievedChunks.get() <= totalChunks - 1) {

When the chunk 18 was not found, we sorted of swallowed the exception and did not invoke .onFailure because at that time, the equality condition was not met.

We will need a setup like what OP has where you have a bad connectivity between ml nodes and data nodes to be able to reliably reproduce this problem and verify the fix, but to me, that looks very suspicious.

If I am correct, when this problem happens, you should see this line in the log:

log.error("Failed to retrieve model chunk " + modelChunkId, e);

@austintlee
Copy link
Collaborator

@jlibx ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: On-deck
Development

No branches or pull requests

4 participants