Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logger shows "Need to balance: True" but nothing happens #1

Closed
mattv8 opened this issue May 4, 2022 · 9 comments
Closed

Logger shows "Need to balance: True" but nothing happens #1

mattv8 opened this issue May 4, 2022 · 9 comments
Labels
documentation Improvements or additions to documentation

Comments

@mattv8
Copy link

mattv8 commented May 4, 2022

Is this expected behavior?

INFO | START ***Load-balancer!***
INFO | Need to balance: True
INFO | Number of options = 1
INFO | Waiting 10 seconds for cluster information update
INFO | Need to balance: True
INFO | Number of options = 1
INFO | Waiting 10 seconds for cluster information update

I have two nodes, which are already nearly balanced so this could be the reason why. See my screenshot below:
image

@cvk98
Copy link
Owner

cvk98 commented May 4, 2022

It depends on many factors.
It may turn out that you have 1 option for migration, but the VM may have a CD-ROM connected or its HDD is located on the node's local storage. Then the balancer finds 1 option to improve the situation, but cannot implement it at the stage of checking the possibility of migration. The output in the "DEBUG" mode can tell more about what is happening.

@cvk98
Copy link
Owner

cvk98 commented May 5, 2022

In the readme, I added the requirement of a common storage for all nodes

@cvk98 cvk98 closed this as completed May 11, 2022
@mattv8
Copy link
Author

mattv8 commented May 16, 2022

Sorry for the delay, so I do have common storage between all nodes. In fact, they are all identical: same number of CPU's, RAM and storage. However, something strange is still happening. The algorithm sees that it needs to balance, and finds an option, but the migration doesn't end up happening and the algorithm gets stuck in an infinite loop:

root@PVE1:~# python3 ~/Proxmox-load-balancer/plb.py
INFO | START Load-balancer!
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 1
DEBUG | Starting vm_migration
DEBUG | VM:202 migration from PVE2 to "recipient"
DEBUG | The VM:202 has [{'is_tpmstate': 0, 'replicate': 1, 'cdrom': 0, 'volid': 'shared-zfs:vm-202-disk-1', 'drivename': 'efidisk0', 'is_unused': 0, 'is_vmstate': 0, 'size': 1048576, 'referenced_in_config': 1, 'shared': 0}, {'shared': 0, 'referenced_in_config': 1, 'size': 4194304, 'is_unused': 0, 'drivename': 'tpmstate0', 'is_vmstate': 0, 'volid': 'shared-zfs:vm-202-disk-2', 'cdrom': 0, 'is_tpmstate': 1, 'replicate': 1}]
INFO | Waiting 10 seconds for cluster information update
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 0
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 0
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True

What do you think is stopping it up? This is Virtual Environment 7.2-3 with latest pull from this repo.

@cvk98 cvk98 reopened this May 17, 2022
@cvk98
Copy link
Owner

cvk98 commented May 17, 2022

In theory:

  1. The script decides that the cluster is unbalanced
  2. Goes through all the migration options and finds one that will improve the situation.
  3. Tries to migrate the selected VM, but cannot due to local VM resources: "The VM:202 has..."
  4. Decides again that the cluster is not balanced (for some reason VM:202 no longer selects)
  5. BUT! any migration will increase sum_of_deviations. In this case, sorted_variants will be empty.

Here it is necessary to include another algorithm that will choose a bad (but not critical) option. And then it will start working in the same mode.
image
Such a cluster cannot be balanced with improvements. We need to make it worse so that new options open up.
It's not difficult to implement, but I have nowhere to test it. Maybe I'll add this as an option.

@mattv8
Copy link
Author

mattv8 commented May 17, 2022

Ah ha! Interesting, thanks for the explanation. I am sure this is somewhat difficult to test and implement since you must iteratively migrate and check, and migration takes time and compute resources.

I will look more into the algorithm when I have time to see if I can contribute. For now, I need to see why the API isn't starting the migration when it hits the def vm_migration(); function. It's like the API call isn't responding properly.

@cvk98
Copy link
Owner

cvk98 commented May 18, 2022

pvesh get /nodes/PVE2/qemu/202/migrate - will show local resources that prevent migration
pvesh create /nodes/PVE2/qemu/200/migrate --target PVE1 --online 1 - this is the CLI analog of the http request that the script makes
If this command does not start the migration, then the script will not be able to do it either.
Using this link, you can view the migration options and change them in the script to suit your needs: https://pve.proxmox.com/pve-docs/api-viewer/#/nodes/{node}/qemu/{vmid}/migrate

@cvk98
Copy link
Owner

cvk98 commented May 18, 2022

Changes will need to be made in this block
image

@cvk98 cvk98 closed this as completed May 22, 2022
@cvk98
Copy link
Owner

cvk98 commented May 22, 2022

I hope I was able to help you

@mattv8
Copy link
Author

mattv8 commented May 23, 2022

Thank you, yes, very helpful! Fine to close this as it is not an issue. I'm still testing in my environment; I'll report back if I have any more issues.

@cvk98 cvk98 added the documentation Improvements or additions to documentation label Jan 31, 2023
@cvk98 cvk98 pinned this issue Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants