[BUG] Sync up job not working as expected when upgrading cluster from prev versions to 2.17 #3238

rbhavna · 2024-11-26T17:57:13Z

What is the bug?
On 2.18, model auto redeploy feature has been removed for remote models in this PR meaning after b/g or node/cluster restart, remote models wont be auto redeployed to new nodes automatically.

When the same had been back-ported to AOS 2.17, a bug was identified due to which ML Sync-up job would stop running automatically when a cluster is upgraded from 2.15 or any older versions to 2.17 on AOS. This condition returns directly if the model encountered by auto redeployer is a remote model and sync-up job is never getting started

The sync-up is only starting again with manual intervention by changing the sync-up interval setting to a different value than the previous one(Default is 10 seconds)

PUT /_cluster/settings
{
  "persistent": {
    "plugins.ml_commons.sync_up_job_interval_in_seconds": 5
  }
}

How can one reproduce the bug?
Steps to reproduce the behavior:

On AOS, create a cluster on 2.15 version. Verify the sync_up job interval by getting the setting plugins.ml_commons.sync_up_job_interval_in_seconds
Check the logs to see the ML Sync-up logs at the intervals specified by above setting
Upgrade the cluster to 2.17 version
Check the logs again and you can see the sunc-up logs are missing

What is the expected behavior?
Sync-up job should run automatically when cluster is upgraded to 2.17 version

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

The text was updated successfully, but these errors were encountered:

rbhavna added bug Something isn't working untriaged labels Nov 26, 2024

rbhavna self-assigned this Nov 26, 2024

rbhavna added v2.19.0 Issues targeting release v2.19.0 and removed untriaged labels Nov 26, 2024

rbhavna mentioned this issue Nov 27, 2024

fix for sync up job not working in 2.17 when upgraded from previous versions #3241

Merged

5 tasks

dhrubo-os added this to ml-commons projects Dec 3, 2024

dhrubo-os moved this to In Progress in ml-commons projects Dec 3, 2024

Zhangxunmt closed this as completed in #3241 Dec 12, 2024

github-project-automation bot moved this from In Progress to Done in ml-commons projects Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sync up job not working as expected when upgrading cluster from prev versions to 2.17 #3238

[BUG] Sync up job not working as expected when upgrading cluster from prev versions to 2.17 #3238

rbhavna commented Nov 26, 2024

[BUG] Sync up job not working as expected when upgrading cluster from prev versions to 2.17 #3238

[BUG] Sync up job not working as expected when upgrading cluster from prev versions to 2.17 #3238

Comments

rbhavna commented Nov 26, 2024