Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dns): Eliminate asynchronous timer in syncQuery() to prevent hang risk #11900

Merged
merged 26 commits into from
Nov 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
121c9df
Revert "fix(conf): set default value of `dns_no_sync` to `on` (#11869)"
chobits Nov 1, 2023
dcceee7
fix(dns): introduce the synchronous query in syncQuery() to prevent h…
chobits Nov 1, 2023
bccd2b5
fix(dns): using yield() instead of ngx.sleep(0)
chobits Nov 2, 2023
90e5c1d
chore(dns): restore fix_dns_enable_dns_no_sync.yml
chobits Nov 2, 2023
95247d6
fix(dns): remove calls to yield()/ngx.sleep(0)
chobits Nov 2, 2023
607591a
fix(dns): improve the logic of handling the unyieldable situation
chobits Nov 2, 2023
0e217a8
fix(dns): eliminate the asynchronous query in syncQuery
chobits Nov 3, 2023
0c14c6a
fix(dns): improve test case
chobits Nov 3, 2023
b8152a5
fix(dns): clear comment
chobits Nov 3, 2023
7879736
fix(dns): abstract common logic for the queue and item
chobits Nov 3, 2023
f201ef2
fix(dns): fix lua runtime error found by test caes
chobits Nov 3, 2023
1a407ac
chore(dns): rename queue_get_item
chobits Nov 7, 2023
dc2f773
fix(dns): add bug-checking logic
chobits Nov 8, 2023
2b7c659
fix(dns): avoid query item from being repeatedly released
chobits Nov 8, 2023
c0483f6
fix(dns): add 1s to resolve_max_wait
chobits Nov 8, 2023
68eb5ff
fix(dns): get error message to try_list in queue_get_query()
chobits Nov 8, 2023
c31bac6
fix(dns): add test case to test stale data timer
chobits Nov 14, 2023
e4919ea
chore(dns): fix comment typo
chobits Nov 14, 2023
a20df81
chore(dns): fix comment typo
chobits Nov 14, 2023
15fbf88
Delete changelog/unreleased/kong/fix_dns_enable_dns_no_sync.yml
chobits Nov 14, 2023
c09fb47
chore(dns): improve message in chagnelog
chobits Nov 14, 2023
9029c09
chore(dns): add new changelog
chobits Nov 14, 2023
40fdfcf
chore(dns): fix typo in changelog
chobits Nov 14, 2023
ce058f4
fix(dns): round sema:wait time to the nearest 1ms
chobits Nov 14, 2023
edd877d
fix(dns): reverse modification of sema:wait()
chobits Nov 14, 2023
0882b53
Update changelog/unreleased/kong/fix_dns_disable_dns_no_sync.yml
chobits Nov 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions changelog/unreleased/kong/fix_dns_blocking.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
message: Eliminate asynchronous timer in syncQuery() to prevent hang risk
type: bugfix
scope: Core
3 changes: 3 additions & 0 deletions changelog/unreleased/kong/fix_dns_disable_dns_no_sync.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
message: The default value of `dns_no_sync` option has been changed to `off`
type: feature
scope: Configuration
2 changes: 1 addition & 1 deletion kong.conf.default
Original file line number Diff line number Diff line change
Expand Up @@ -1550,7 +1550,7 @@

#dns_error_ttl = 1 # TTL in seconds for error responses.

#dns_no_sync = on # If enabled, then upon a cache-miss every
#dns_no_sync = off # If enabled, then upon a cache-miss every
chobits marked this conversation as resolved.
Show resolved Hide resolved
# request will trigger its own dns query.
# When disabled multiple requests for the
# same name/type will be synchronised to a
Expand Down
146 changes: 74 additions & 72 deletions kong/resty/dns/client.lua
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@ local time = ngx.now
local log = ngx.log
local ERR = ngx.ERR
local WARN = ngx.WARN
local ALERT = ngx.ALERT
local DEBUG = ngx.DEBUG
--[[
DEBUG = ngx.WARN
--]]
local PREFIX = "[dns-client] "
local timer_at = ngx.timer.at
local get_phase = ngx.get_phase

local math_min = math.min
local math_max = math.max
Expand Down Expand Up @@ -651,7 +651,9 @@ _M.init = function(options)

config = options -- store it in our module level global

resolve_max_wait = options.timeout / 1000 * options.retrans -- maximum time to wait for the dns resolver to hit its timeouts
-- maximum time to wait for the dns resolver to hit its timeouts
-- + 1s to ensure some delay in timer execution and semaphore return are accounted for
resolve_max_wait = options.timeout / 1000 * options.retrans + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could happen if we didn't include this +1 ?

Copy link
Member

@dndx dndx Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comment above, without the +1 there is a risk semaphore might timeout before resolve actually happened. Because timer are eventual consistent system and we can not expect it to always execute in a timely manner.

Copy link
Contributor Author

@chobits chobits Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two scenerio where waiting time or timeout

  1. options.timeout: a timeout used for network IO of the DNS query
  2. resolve_max_wait: waiting for the completion of the DNS query.

If there is no difference of +1s, these two timeouts will be similar. Consequently, the second scenario may conclude before the first DNS query retrieves the result.


return true
end
Expand Down Expand Up @@ -742,46 +744,61 @@ local function individualQuery(qname, r_opts, try_list)
end

local queue = setmetatable({}, {__mode = "v"})

local function enqueue_query(key, qname, r_opts, try_list)
local item = {
key = key,
semaphore = semaphore(),
qname = qname,
r_opts = cycle_aware_deep_copy(r_opts),
try_list = try_list,
expire_time = time() + resolve_max_wait,
}
queue[key] = item
return item
end


local function dequeue_query(item)
if queue[item.key] == item then
-- query done, but by now many others might be waiting for our result.
-- 1) stop new ones from adding to our lock/semaphore
queue[item.key] = nil
-- 2) release all waiting threads
item.semaphore:post(math_max(item.semaphore:count() * -1, 1))
item.semaphore = nil
end
end


local function queue_get_query(key, try_list)
local item = queue[key]

if not item then
return nil
end

-- bug checks: release it actively if the waiting query queue is blocked
if item.expire_time < time() then
local err = "stale query, key:" .. key
add_status_to_try_list(try_list, err)
log(ALERT, PREFIX, err)
dequeue_query(item)
return nil
end

return item
end


-- to be called as a timer-callback, performs a query and returns the results
-- in the `item` table.
local function executeQuery(premature, item)
if premature then return end

local r, err = resolver:new(config)
if not r then
item.result, item.err = r, "failed to create a resolver: " .. err
else
--[[
log(DEBUG, PREFIX, "Query executing: ", item.qname, ":", item.r_opts.qtype, " ", fquery(item))
--]]
add_status_to_try_list(item.try_list, "querying")
item.result, item.err = r:query(item.qname, item.r_opts)
if item.result then
--[[
log(DEBUG, PREFIX, "Query answer: ", item.qname, ":", item.r_opts.qtype, " ", fquery(item),
" ", frecord(item.result))
--]]
parseAnswer(item.qname, item.r_opts.qtype, item.result, item.try_list)
--[[
log(DEBUG, PREFIX, "Query parsed answer: ", item.qname, ":", item.r_opts.qtype, " ", fquery(item),
" ", frecord(item.result))
else
log(DEBUG, PREFIX, "Query error: ", item.qname, ":", item.r_opts.qtype, " err=", tostring(err))
--]]
end
end
item.result, item.err = individualQuery(item.qname, item.r_opts, item.try_list)

-- query done, but by now many others might be waiting for our result.
-- 1) stop new ones from adding to our lock/semaphore
queue[item.key] = nil
-- 2) release all waiting threads
item.semaphore:post(math_max(item.semaphore:count() * -1, 1))
item.semaphore = nil
ngx.sleep(0)
-- 3) destroy the resolver -- ditto in individualQuery
if r then
r:destroy()
end
chobits marked this conversation as resolved.
Show resolved Hide resolved
dequeue_query(item)
end


Expand All @@ -795,7 +812,7 @@ end
-- the `semaphore` field will be removed). Upon error it returns `nil+error`.
local function asyncQuery(qname, r_opts, try_list)
local key = qname..":"..r_opts.qtype
local item = queue[key]
local item = queue_get_query(key, try_list)
if item then
--[[
log(DEBUG, PREFIX, "Query async (exists): ", key, " ", fquery(item))
Expand All @@ -804,14 +821,7 @@ local function asyncQuery(qname, r_opts, try_list)
return item -- already in progress, return existing query
end

item = {
key = key,
semaphore = semaphore(),
qname = qname,
r_opts = cycle_aware_deep_copy(r_opts),
try_list = try_list,
}
queue[key] = item
item = enqueue_query(key, qname, r_opts, try_list)

local ok, err = timer_at(0, executeQuery, item)
dndx marked this conversation as resolved.
Show resolved Hide resolved
if not ok then
Expand All @@ -837,40 +847,24 @@ end
-- @return `result + nil + try_list`, or `nil + err + try_list` in case of errors
local function syncQuery(qname, r_opts, try_list)
local key = qname..":"..r_opts.qtype
local item = queue[key]

-- if nothing is in progress, we start a new async query
local item = queue_get_query(key, try_list)

-- If nothing is in progress, we start a new sync query
if not item then
local err
item, err = asyncQuery(qname, r_opts, try_list)
if not item then
return item, err, try_list
end
else
add_status_to_try_list(try_list, "in progress (sync)")
end
item = enqueue_query(key, qname, r_opts, try_list)

local supported_semaphore_wait_phases = {
rewrite = true,
access = true,
content = true,
timer = true,
ssl_cert = true,
ssl_session_fetch = true,
}
item.result, item.err = individualQuery(qname, item.r_opts, try_list)

local ngx_phase = get_phase()
dequeue_query(item)

if not supported_semaphore_wait_phases[ngx_phase] then
-- phase not supported by `semaphore:wait`
-- return existing query (item)
--
-- this will avoid:
-- "dns lookup pool exceeded retries" (second try and subsequent retries)
-- "API disabled in the context of init_worker_by_lua" (first try)
return item, nil, try_list
return item.result, item.err, try_list
end

-- If the query is already in progress, we wait for it.

add_status_to_try_list(try_list, "in progress (sync)")

-- block and wait for the async query to complete
local ok, err = item.semaphore:wait(resolve_max_wait)
if ok and item.result then
Expand All @@ -883,6 +877,14 @@ local function syncQuery(qname, r_opts, try_list)
return item.result, item.err, try_list
end

-- bug checks
if not ok and not item.err then
item.err = err -- only first expired wait() reports error
log(ALERT, PREFIX, "semaphore:wait(", resolve_max_wait, ") failed: ", err,
", count: ", item.semaphore and item.semaphore:count(),
", qname: ", qname)
end

err = err or item.err or "unknown"
add_status_to_try_list(try_list, "error: "..err)

Expand Down
2 changes: 1 addition & 1 deletion kong/templates/kong_defaults.lua
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ dns_stale_ttl = 4
dns_cache_size = 10000
dns_not_found_ttl = 30
dns_error_ttl = 1
dns_no_sync = on
dns_no_sync = off

dedicated_config_processing = on
worker_consistency = eventual
Expand Down
22 changes: 16 additions & 6 deletions spec/01-unit/21-dns-client/02-client_spec.lua
Original file line number Diff line number Diff line change
Expand Up @@ -584,7 +584,10 @@ describe("[DNS client]", function()
}
}))
query_func = function(self, original_query_func, name, options)
ngx.sleep(5)
-- The first request uses syncQuery not waiting on the
-- aysncQuery timer, so the low-level r:query() could not sleep(5s),
-- it can only sleep(timeout).
ngx.sleep(math.min(timeout, 5))
return nil
end
local start_time = ngx.now()
Expand Down Expand Up @@ -1742,9 +1745,12 @@ describe("[DNS client]", function()
end)

it("timeout while waiting", function()

local timeout = 500
local ip = "1.4.2.3"
-- basically the local function _synchronized_query
assert(client.init({
timeout = 500,
timeout = timeout,
retrans = 1,
resolvConf = {
-- resolv.conf without `search` and `domain` options
Expand All @@ -1755,7 +1761,7 @@ describe("[DNS client]", function()
-- insert a stub thats waits and returns a fixed record
local name = TEST_DOMAIN
query_func = function()
local ip = "1.4.2.3"
local ip = ip
local entry = {
{
type = client.TYPE_A,
Expand All @@ -1767,7 +1773,9 @@ describe("[DNS client]", function()
touch = 0,
expire = gettime() + 10,
}
sleep(0.5) -- wait before we return the results
-- wait before we return the results
-- `+ 2` s ensures that the semaphore:wait() expires
sleep(timeout/1000 + 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm - how can we be sure that adding +2 will make semaphore:wait() expire? I'm wondering whether this will not introduce flakiness 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would it be possible to use some kind of waiter function pattern that would check against a predicate every interval ?

Copy link
Contributor Author

@chobits chobits Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would it be possible to use some kind of waiter function pattern that would check against a predicate every interval ?

If the waiting of the test case happens under two endpoint of C/S model, like that a client wait for server to complete some tasks, your advice is the suitable way to do. We need to be especially care of handling the synchronization issues between two different processes to prevent the flakiness.

But here we only run the DNS query and wait the result in one process, just testing dns library, not starting the kong gateway to do C/S testing. So if this test case fails, there must be some pressing problem happening in the DNS library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: it is a unit test, the chance of flakiness is much lower because client runs much faster than the entire Gateway

return entry
end

Expand Down Expand Up @@ -1797,10 +1805,12 @@ describe("[DNS client]", function()
ngx.thread.wait(coros[i]) -- this wait will resume the scheduled ones
end

-- all results are equal, as they all will wait for the first response
for i = 1, 10 do
-- results[1~9] are equal, as they all will wait for the first response
for i = 1, 9 do
assert.equal("timeout", results[i])
end
-- results[10] comes from synchronous DNS access of the first request
assert.equal(ip, results[10][1]["address"])
end)
end)

Expand Down
7 changes: 3 additions & 4 deletions t/03-dns-client/01-phases.t
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
use Test::Nginx::Socket;

plan tests => repeat_each() * (blocks() * 5);
plan tests => repeat_each() * (blocks() * 4 + 1);

workers(6);

Expand Down Expand Up @@ -59,8 +59,7 @@ qq {
GET /t
--- response_body
answers: nil
err: dns client error: 101 empty record received
--- no_error_log
err: nil
--- error_log
Tieske marked this conversation as resolved.
Show resolved Hide resolved
[error]
dns lookup pool exceeded retries
API disabled in the context of init_worker_by_lua
Loading
Loading