-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api,agent,server,engine-schema: scalability improvements #9840
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9840 +/- ##
============================================
- Coverage 15.78% 15.60% -0.19%
- Complexity 12564 12620 +56
============================================
Files 5627 5631 +4
Lines 492250 492861 +611
Branches 61405 62903 +1498
============================================
- Hits 77710 76911 -799
- Misses 406066 407450 +1384
- Partials 8474 8500 +26
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
2c750db
to
080e5af
Compare
Following changes and improvements have been added: - Improvements in handling of PingRoutingCommand 1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs. 2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch 3. Optimized scanning stalled VMs - Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers` - Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine - Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0. - Added caching for account/use role API access with expiration after write set to 60 seconds. - Added caching for some recurring DB retrievals 1. CapacityManager - listing service offerings - beneficial in host capacity calculation 2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins 3. DownloadListener - hypervisors for zone - beneficial for host joins 5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands - Optimized MS list retrieval for agent connect - Optimize finding ready systemvm template for zone - Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks - Changes in agent-agentmanager connection with NIO client-server classes 1. Optimized the use of the executor service 2. Refactore Agent class to better handle connections. 3. Do SSL handshakes within worker threads 5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end. 6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used. - Improvements in StatsCollection - minimize DB retrievals. - Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals. - Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools. - Minor improvements in resource limit calculations wrt DB retrievals Signed-off-by: Abhishek Kumar <[email protected]> Co-authored-by: Abhishek Kumar <[email protected]> Co-authored-by: Rohit Yadav <[email protected]>
080e5af
to
e3cf7fd
Compare
Honestly, I don't like PRs with thousand of lines doing thousand of things. It is hard to review and test. I encourage you to separate it in several minor PRs that address each one of the changes you are proposing. |
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11441 |
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11486 |
@blueorangutan test |
@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
[SF] Trillian test result (tid-11738)
|
Signed-off-by: Abhishek Kumar <[email protected]>
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great job @shwstppr
over lgtm
it seems this will reduce a large number of database queries, therefore improve the performance a lot.
@@ -433,3 +433,9 @@ iscsi.session.cleanup.enabled=false | |||
|
|||
# Implicit host tags managed by agent.properties | |||
# host.tags= | |||
|
|||
# Timeout(in seconds) for SSL handshake when agent connects to server | |||
#ssl.handshake.timeout= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this line is commented by default, can we add the default value ?
if there is no default value, add a line to explain what will happen (unlimited ?) ?
#ssl.handshake.timeout= | ||
|
||
# Wait(in seconds) during agent reconnections | ||
#backoff.seconds= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, add default value if exists ?
InetAddress addr; | ||
protected String retrieveHostname() { | ||
if (logger.isTraceEnabled()) { | ||
logger.trace(" Retrieving hostname " + serverResource.getClass().getSimpleName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.trace(" Retrieving hostname " + serverResource.getClass().getSimpleName()); | |
logger.trace(" Retrieving hostname with resource=" + serverResource.getClass().getSimpleName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if (logger.isTraceEnabled()) { | ||
logger.trace(" Retrieving hostname " + serverResource.getClass().getSimpleName()); | ||
} | ||
final String result = Script.runSimpleBashScript(Script.getExecutableAbsolutePath("hostname"), 500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Script.getExecutableAbsolutePath
finds in all locations from PATH and if not found then it returns the string as it is. It should allow execution in those cases?
} | ||
} catch (final ClassNotFoundException e) { | ||
logger.error("Unable to find this request "); | ||
} catch (final Exception e) { | ||
logger.error("Error parsing task", e); | ||
} | ||
} else if (task.getType() == Task.Type.DISCONNECT) { | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason this is removed ? @shwstppr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this because reconnect method has been refactored to allow waiting for backoff.seconds (default = 5 sec) before connecting to the next host
ResultSet rs = pstmt.executeQuery(); | ||
while (rs.next()) { | ||
l.add(new Pair<Long, Integer>(rs.getLong(1), rs.getInt(2))); | ||
poolCount = rs.getLong(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the SQLs (SHARED_STORAGE_POOL_HOST_INFO : STORAGE_POOL_HOST_INFO) return the storage_pool_host_ref.id, not the count of pools or hosts, is it intended ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the method is used to find if there is a connected storage pool and both SQLs are listing IDs of storage_pool_host_ref, this should be okay.
We can otherwise add those IDs to a list and check if the list is not empty but it would return the same result.
|
||
@Override | ||
public void connectHostsToPool(DataStore primaryStore, List<Long> hostIds, Scope scope, | ||
boolean handleExceptionsPartially, boolean errorOnNoUpHost) throws CloudRuntimeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like ?
if (hostIds.size() == 0) {
return;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added check
@@ -594,7 +594,7 @@ public String getConfigComponentName() { | |||
|
|||
@Override | |||
public ConfigKey<?>[] getConfigKeys() { | |||
return new ConfigKey<?>[] {RoleService.EnableDynamicApiChecker}; | |||
return new ConfigKey<?>[] {RoleService.EnableDynamicApiChecker, RoleService.DynamicApiCheckerCachePeriod}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work ?
return new ConfigKey<?>[] {EnableDynamicApiChecker, DynamicApiCheckerCachePeriod};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tools/marvin/setup.py
Outdated
@@ -27,7 +27,7 @@ | |||
raise RuntimeError("python setuptools is required to build Marvin") | |||
|
|||
|
|||
VERSION = "4.20.0.0-SNAPSHOT" | |||
VERSION = "4.20.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be caused by maven build
may be not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -190,9 +209,25 @@ public Boolean call() throws NioConnectionException { | |||
|
|||
abstract void unregisterLink(InetSocketAddress saddr); | |||
|
|||
protected boolean rejectConnectionIfBusy(final SocketChannel socketChannel) throws IOException { | |||
if (activeAcceptConnections.get() < sslHandshakeMaxWorkers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will reject and drop cause some issues ?
any wait and retry mechanism ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait and retry is present at the client side (agent).
(Although in a multi-managment server setup, the agent may end up connecting to a different MS in this case)
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11496 |
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Thanks @weizhouapache for the review. I've made the changes as per your suggestions and responded to the queries. @blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11505 |
@blueorangutan test |
@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
Description
Following changes and improvements have been added:
Improvements in handling of PingRoutingCommand
vm.sync.power.state.transitioning
, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.Added option to set worker threads for capacity calculation using config -
capacity.calculate.workers
Added caching for account/use role API access with expiration after write can be configured using config -
dynamic.apichecker.cache.period
. If set to zero then there will be no caching. Default is 0.Added caching for account/use role API access with expiration after write set to 60 seconds.
Added caching for some recurring DB retrievals
Optimized MS list retrieval for agent connect
Optimize finding ready systemvm template for zone
Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks
Changes in agent-agentmanager connection with NIO client-server classes
agent.ssl.handshake.min.workers
andagent.ssl.handshake.max.workers
can be used to control number of new connections management server handles at a time.agent.ssl.handshake.timeout
can be used to set number of seconds after which SSL handshake times out at MS end.backoff.seconds
andssl.handshake.timeout
properties can be used.Improvements in StatsCollection - minimize DB retrievals.
Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.
Improvements in hosts connection for a storage pool. Added config -
storage.pool.host.connect.workers
to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.Minor improvements in resource limit calculations wrt DB retrievals
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?