Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move scheduling logic to the mlpop lua command #14

Open
wants to merge 6 commits into
base: skroutz
Choose a base branch
from

Conversation

ctrochalakis
Copy link
Member

@ctrochalakis ctrochalakis commented Feb 14, 2021

Resque finds jobs by iterating on a list of queues, for each queue it
tries to lpop a job until successful.

This patch moves that logic to the server by implementing mlpop,
that approach reduces the required round-trips especially for workers
watching a large set of queues.

The API is not altered, we keep the same signature while accepting
variadic arguments.

lib/resque/data_store.rb Outdated Show resolved Hide resolved
lib/resque/data_store.rb Outdated Show resolved Hide resolved
lib/resque.rb Show resolved Hide resolved
# Pops a job off a set of queues. Queue names should be strings.
#
# Returns an array of [queue, Ruby object] or nil.
def pop_with_queue(*queues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a newly introduced method, we could use the plural in the name. We could also use from instead of with.

Suggested change
def pop_with_queue(*queues)
def pop_from_queues(*queues)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's different with pop_with_queue() is that it returns a pair of [queue, job] instead of just the job like pop() does, thus the _with_queue suffix. It's not that clear though.

end

def pop_from_queue(*queues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] We could use the plural here for queues and implement the pop_from_queue calling this one to avoid breaking the api.

Suggested change
def pop_from_queue(*queues)
def pop_from_queues(*queues)
...
end
def pop_from_queue(queue)
pop_from_queues(queue)
end

lib/resque.rb Outdated Show resolved Hide resolved
# Pop whatever is on queue
def pop_from_queue(queue)
@redis.lpop(redis_key_for_queue(queue))
# MLPOP key [key ..]
Copy link
Member

@cvrac cvrac Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[super minor] missing a . in [κey ..]?

Copy link
Member

@avgerin0s avgerin0s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall nice approach, have we benchmarked this against the multiple lpops from the client?

@@ -271,11 +271,23 @@ def push(queue, item)
data_store.push_to_queue(queue,encode(item))
end

# Pops a job off a queue. Queue name should be a string.
# Pops a job off a set of queues. Queue names should be a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queue names should be a string. That comment seems untrue, your test case below uses symbols, so let's drop it for sanity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I have adjusted the original comment. We could change it.

# is restarted) reload it.
if e.message.include?("NOSCRIPT")
mlpop_sha!
retry
Copy link
Member

@avgerin0s avgerin0s Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some backoff here, failovers take some time to be completed.

Also a timeout (and/or a fallback to the old scheduling) would be nice to avoid busy loops in case of a fatal error (e.g. all redises are dead) here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff logic in case of a server restart is the responsibility of the redis client.

As far as mlpop is concerned, it gets a NOSCRIPT error and tries to reload the script. If there is a connectivity issue (e.g. a server restart) the client should kick in, backoff and possibly retry the command internally.

We could add a hidden retry argument that decrements itself to guard against busy loops, by I am sure it's needed in this case.

# Pops a job off a set of queues. Queue names should be strings.
#
# Returns an array of [queue, Ruby object] or nil.
def pop_with_queue(*queues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Do we pass the queues list ordered from resquepool here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, keeping the queue list order is critical.

@ctrochalakis
Copy link
Member Author

ctrochalakis commented Feb 16, 2021

Overall nice approach, have we benchmarked this against the multiple lpops from the client?

It's not so easy to get exact metrics, it depends on which queues are non-empty at that specific moment. The expected speedup is also a function of the roundtrip between the worker and the server, for our case the gain would be around 0.7ms for each worker poll.

A test run in our own environment (switching lpop with llen)

require 'redis'
require 'benchmark'

$redis = Redis.new REDACTED
$queues = %w/interactive high medium low tmp/

class Redis
  def mlpop_sha
    @mlpop_sha ||= script(:load, <<-eos
      for i, q in ipairs(KEYS) do
        local llen = redis.call("llen", q)
        if llen ~= 0 then
            -- return the zero-index position
            return {i-1, llen}
        end
      end
      return false
      eos
    )
  end

  def mlpop(*queues)
    evalsha(mlpop_sha, keys: queues)
  end
end

$redis.mlpop_sha # load the script

n = 10000
Benchmark.bmbm do |x|
  x.report("mlpop:") {
    n.times { $redis.mlpop($queues) }
  }
  x.report("lpop:") {
    n.times {
      $queues.any? { |q| !$redis.llen(q).zero? }
    }
  }
end

% bundle exec ruby benchmark.rb
Rehearsal ------------------------------------------
mlpop:   0.523443   0.188144   0.711587 (  2.093777)
lpop:    1.654880   0.837122   2.492002 (  8.621473)
--------------------------------- total: 3.203589sec

             user     system      total        real
mlpop:   0.348903   0.148945   0.497848 (  1.844264)
lpop:    1.325281   0.745117   2.070398 (  8.331645)

ctrochalakis and others added 5 commits February 19, 2021 09:50
Resque finds jobs by iterating on a list of queues, for each queue it
tries to lpop a job until successful.

This patch moves that logic to the server by implementing mlpop,
that approach reduces the required round-trips especially for workers
watching a large set of queues.

The API is not altered, we keep the same signature while accepting
variadic arguments.
Co-authored-by: Lazarus Lazaridis <[email protected]>
@ctrochalakis
Copy link
Member Author

Travis builds are working now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants