Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redistribute RDY in high throughput, idle producer situations #277

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mscso
Copy link

@mscso mscso commented Dec 5, 2019

When there are several producer nsqds registered on a nsqlookupd but only one of them (at least not all) is currently producing messages, the current flat max-in-flight distribution leads to the consumer effectively having fewer messages in flight than we might want.

Consider a situation of 4 hosts / nsqds being used to produce messages on a topic - but only one of them is used at a time (various reasons). Consider a single consumer setting max-in-flight of 8. These are equally spread so each nsqd connection will have a RDY count of 2. Since at any point in time 3 of the 4 nsqds are idle / not producing on the topic, we effectively only ever have 2 messages in flight.

One workaround is to increase the max-in-flight drastically (multiply by nsqd count) but then we might have more messages in flight than our consumer wants if suddenly more than one nsqd is producing messages.

We constantly deal with this situation (automatically scheduled producer containers that move between hosts), we implemented a second RDY redistribution function that trades RDY count from an unused nsqd connection to a "busy" nsqd connection.

Since this might not be useful / wanted in every use case the feature is only enabled with a config flag RDYTrading.

The code is similar to the normal code in redistributeRDY for the max-in-flight < len(conns) situation but here it essentially deals with max-in-flight > len(producing_conns).

Let me know what you think and whether this could be useful for others and thus whether you think it could be merged upstream.

NSQ2019/12/05 14:26:34 DBG 1 [foo/bar] looking for RDY trade possibilities...
NSQ2019/12/05 14:26:34 DBG 1 [foo/bar] - moving 3 RDY from 10.13.2.51:4150 to 10.13.2.85:4150
NSQ2019/12/05 14:26:34 DBG 1 [foo/bar] - moving 3 RDY from 10.13.2.39:4150 to 10.13.2.85:4150

before
after

@mscso
Copy link
Author

mscso commented Dec 5, 2019

And right after I open a PR I realize that I don't correctly trade back RDY when nsqd conns go from idle / unused to busy / used. Will amend.

@ploxiln
Copy link
Member

ploxiln commented Dec 5, 2019

I think this kind of improvement would be welcome ... but historically the ready-count distribution has been the trickiest logic, when you add in backoff and such. If we can be sure that this new logic doesn't get "stuck" in an on or off state when it shouldn't, then we probably would not want a new config option. Ideally, the code works well, and there would be no need to "turn it off if it causes problems".

Although I'm currently one of the nsq maintainers, I'm not the biggest contributor to this particular repo, and don't have a lot of familiarity with this implementation of the ready-count logic 😅 so I can't promise very prompt review.

@jehiah
Copy link
Member

jehiah commented Jan 16, 2020

@mscso thanks for starting this discussion/effort; I appreciate many of the challenges you described around uneven message distribution as i've experienced them as well.

As @ploxiln mentioned, there is good appetite for improving this aspect of nsq / go-nsq, but an equal dose of caution because a one-size-fits-all has been elusive so far despite it being desired.

I have a few high level thoughts on how we might think of changing the paradigm to resolve this at a higher level:

  • In the case of an uneven distribution would pruning the presence of 'unused' topics help? I.e. when you have 3 of 4 nsqds that aren't getting messages on a topic, safely 'delete' the topic on 2 of those. Sort of a cluster "tidy" option. (it would be great if nsqd exposed how long it's been since the last message was received on a topic for this, and had a flag so that delete only applied if the topic and all channels were empty to make this easy/safe)
  • A related area that has me thinking about RDY distribution is to make it easier to influence a zone/region priority in a cloud environment; go-nsq gives you an ability to on/off that flow with nsq.DiscoveryFilter but perhaphs we need to expose some interface for overriding the RDY distribution w/ other strategies. Perhaphs this could be facilitated w/ a concept of priority for connections - if a connection provided a priority nsqd could prefer higher priority connections when it has a message to send out, and fall back to lower priority connections only when needed. (i.e. when connecting to a same-zone source a consumer would set priority=20, when connecting to a same-region source it might set priority=10, and when connecting to another region it might set priority=0 then nsqd would prefer same-zone first, then same-region then any available consumer all w/o having a specific knowledge of what might influence priority)
  • I wonder what information would make this logic easier; If Nsqd had a way to signal to consumers it's queue depth, the consumer could more effectively start at a max-in-flight of 1 and throttle up specific connections to a concurrency limit based on where messages are actually backed up.

Do you feel any of these would better fit your needs and reduce complexities around max-in-flight settings?

cc: @mreiferson @ploxiln

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants