Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-31290 Fix Sasha Thor QMon switching issues #18302

Merged

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Feb 15, 2024

Automatic and manual queue switching by the Sasha QMon service have not worked since 7.12.
Changes to the format of the queued thor job items meant it did not find any queued items to swap.
This meant that workunits submitted with 'allowedclusters' and 'allowautoqueueswitch', that should have automatically
switched to an idle Thor queue in the 'allowedclusters' set when the queue they were submitted to was busy, did not.

Also fix the qmon tracing, which was supposed to trace the current workunits in flight running on Thor instances. That appears not to have worked well before 7.12.

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copy link

@jakesmith
Copy link
Member Author

@ghalliday - even though this affects 7.12 onwards, targetting master (to go into 9.6), because it is not a trivial fix, and this only came to light, after a failed attempt to use it in cloud.
i.e. the feature would appear not to have been missed all this time. However, setups and queries continue to provide the options that are supposed to invoke it, and the expectation was it was working. That is, moving items from one queue to a less busy queue automatically.
There is a separate JIRA for the missing functionality in cloud: https://track.hpccsystems.com/browse/HPCC-31291

@jakesmith jakesmith force-pushed the HPCC-31290-sasha-qmon-baremetal branch from 120a3ab to b4e2009 Compare February 16, 2024 12:26
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/comments. Looks like a sensible set of fixes.

xpath.appendf("Server[@queue=\"%s.thor\"]/WorkUnit",qname);
Owned<IPropertyTreeIterator> iter = conn->queryRoot()->getElements(xpath.str());
getClusterThorQueueName(thorQName, qname);
Owned<IPropertyTreeIterator> iter = conn->queryRoot()->getElements("Server[@queue]");
Copy link
Member

@ghalliday ghalliday Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@queue looks a bit strange. Does that check that it has an entry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's a qualifier without a result. Returns all "Server" nodes that have a @Queue attribute.

queueList.appendList(queues, ",");
if (!queueList.contains(thorQName))
continue;
wuids.append(server.queryProp("WorkUnit"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: wuids.append(wuid);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change.

@@ -56,6 +56,7 @@ bool switchWorkunitQueue(const char *wuid, const char *cluster)
Owned<IWorkUnit> wu = factory->updateWorkUnit(wuid);
if (!wu)
return false;
return wu->switchThorQueue(cluster, &switcher);
VStringBuffer item("*/%s/*", wuid);
return wu->switchThorQueue(cluster, &switcher, item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should pass nullptr as item - otherwise it will only match 1 item (although without parallel workflow I doubt there is ever >1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's a mistake, will change.

@jakesmith jakesmith requested a review from ghalliday February 16, 2024 14:14
@jakesmith
Copy link
Member Author

@ghalliday - please see review changes.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash

@jakesmith jakesmith force-pushed the HPCC-31290-sasha-qmon-baremetal branch from 3f05fd3 to 336d003 Compare February 19, 2024 12:48
@jakesmith
Copy link
Member Author

@ghalliday - squashed.

@ghalliday
Copy link
Member

@jakesmith I am going to be annoying(!)
Please can you update the commit message so that a user will understand what has been fixed just from the log entry.

Automatic and manual queue switching by the Sasha QMon service
have not worked since 7.12.
Changes to the format of the queued thor job items meant it did
not find any queued items to swap.
This meant that workunits submitted with 'allowedclusters'
and 'allowautoqueueswitch', that should have automatically
switched to an idle Thor queue in the 'allowedclusters' set
when the queue they were submitted to was busy, did not.

Also fix the qmon tracing, which was supposed to trace
the current workunits in flight running on Thor instances.
That appears not to have worked well before 7.12.

Signed-off-by: Jake Smith <[email protected]>
@jakesmith jakesmith force-pushed the HPCC-31290-sasha-qmon-baremetal branch from 336d003 to a5baecd Compare February 19, 2024 14:57
@jakesmith jakesmith changed the title HPCC-31290 Sasha QMon fixes. HPCC-31290 Fix Sasha Thor QMon switching issues Feb 19, 2024
@jakesmith
Copy link
Member Author

@jakesmith I am going to be annoying(!) Please can you update the commit message so that a user will understand what has been fixed just from the log entry.

@ghalliday - revised commit title + message.

@ghalliday
Copy link
Member

Still not 100% convinced, but good enough to merge!

@ghalliday ghalliday merged commit 28c167f into hpcc-systems:master Feb 19, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants