Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: slow failover #146

Merged
merged 4 commits into from
Nov 9, 2023
Merged

Conversation

678098
Copy link
Collaborator

@678098 678098 commented Nov 7, 2023

No description provided.

@678098 678098 requested a review from a team as a code owner November 7, 2023 14:42
@678098 678098 force-pushed the 231107_slow_failover branch from 41c473c to 181c61d Compare November 7, 2023 15:19
@678098 678098 requested a review from dorjesinpo November 7, 2023 15:22
@@ -286,7 +286,7 @@ void RelayQueueEngine::onHandleConfiguredDispatched(
<< downStreamParameters << ", but assuming success.";
}

BALL_LOG_INFO_BLOCK
BALL_LOG_TRACE_BLOCK
{
mqbcmd::RoundRobinRouter outrr(d_allocator_p);
context->d_routing_sp->loadInternals(&outrr);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loadInternals might be called for each consumer rapidly and might degrade overall performance when we have thousands of consumers and still continue to add new.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this a debug level instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about BALL_LOGTHROTTLE_INFO_BLOCK?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hallfox changed trace->debug

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hallfox will use throttle info block instead, so we have something in logs without changing default severity

static const int k_MAX_INSTANT_MESSAGES = 10;
static const bsls::Types::Int64 k_NS_PER_MESSAGE =
bdlt::TimeUnitRatio::k_NANOSECONDS_PER_SECOND;
BALL_LOGTHROTTLE_INFO(k_MAX_INSTANT_MESSAGES, k_NS_PER_MESSAGE)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to use in-place directive from ball and remove member throttle d_throttledSubscriptionInfo to simplify class.
This throttle is thread-safe: https://bloomberg.github.io/bde-resources/doxygen/bde_api_prod/group__ball__logthrottle.html#3.2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side effect: 2 places with throttle previously had a common throttle, and now they are independent. But in my opinion this doesn't matter in this case.

Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, there was no BALL_LOGTHROTTLE_ when we needed it...
Looks good. Couple minor things.

@@ -286,7 +286,7 @@ void RelayQueueEngine::onHandleConfiguredDispatched(
<< downStreamParameters << ", but assuming success.";
}

BALL_LOG_INFO_BLOCK
BALL_LOG_TRACE_BLOCK
{
mqbcmd::RoundRobinRouter outrr(d_allocator_p);
context->d_routing_sp->loadInternals(&outrr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about BALL_LOGTHROTTLE_INFO_BLOCK?

<< ", downstreamSubQueueId = " << downstreamSubId
<< ", upstreamId = " << upstreamId << "]";
}
static const int k_MAX_INSTANT_MESSAGES = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, move k_MAX_INSTANT_MESSAGES and k_NS_PER_MESSAGE out in a single place? They are the same for all occurrences most likely.

// Maximum messages logged with throttling in a short period of time.

const bsls::Types::Int64 k_NS_PER_MESSAGE =
bdlt::TimeUnitRatio::k_NANOSECONDS_PER_MINUTE / k_MAX_INSTANT_MESSAGES;
Copy link
Collaborator Author

@678098 678098 Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow to log at most 10 records in a minute. Should be enough for us.

Previously each cycle of adding a new consumer consumed ~40 ms in total:

04NOV2023_14:09:16.926 (140284186806016) INFO mqbblp_relayqueueengine.cpp:824 For queue '...', rebuilt highest priority consumers: [count: 1598]
04NOV2023_14:09:16.885 (140284186806016) INFO mqbblp_relayqueueengine.cpp:824 For queue '...', rebuilt highest priority consumers: [count: 1597]
04NOV2023_14:09:16.842 (140284186806016) INFO mqbblp_relayqueueengine.cpp:824 For queue '...', rebuilt highest priority consumers: [count: 1596]
04NOV2023_14:09:16.802 (140284186806016) INFO mqbblp_relayqueueengine.cpp:824 For queue '...', rebuilt highest priority consumers: [count: 1595]
04NOV2023_14:09:16.761 (140284186806016) INFO mqbblp_relayqueueengine.cpp:824 For queue '...', rebuilt highest priority consumers: [count: 1594]

@678098 678098 requested a review from dorjesinpo November 7, 2023 17:13
@678098 678098 force-pushed the 231107_slow_failover branch from 3170366 to 1f4768f Compare November 7, 2023 17:14
@dorjesinpo dorjesinpo merged commit 2c8f8ca into bloomberg:main Nov 9, 2023
6 checks passed
@678098 678098 deleted the 231107_slow_failover branch November 9, 2023 16:39
@678098 678098 added bug Something isn't working A-Broker Area: C++ Broker labels Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Broker Area: C++ Broker bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants