Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce likelihood of race conditions on keep-alive timeout calculatio… #52653

Closed
wants to merge 1 commit into from

Conversation

zanettea
Copy link
Contributor

Added 1 seconds threshold in keepalive timeout client-side
Added 1 second threshold in keepalive timeout server-side (expire the socket timeout 1 sec after the announced timeout)

Probably better to use a configurable threshold like in undici keepAliveTimeoutThreshold (nodejs/undici#291)

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/http
  • @nodejs/net

@nodejs-github-bot nodejs-github-bot added http Issues or PRs related to the http subsystem. needs-ci PRs that need a full CI run. labels Apr 23, 2024
Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mcollina mcollina added the request-ci Add this label to start a Jenkins CI on a PR. label Apr 23, 2024
@mcollina
Copy link
Member

ping @mweberxyz

@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Apr 23, 2024
@nodejs-github-bot
Copy link
Collaborator

// Let the timer expires before the announced timeout to reduce
// the likelihood of ECONNRESET errors
let serverHintTimeout = ( NumberParseInt(hint) * 1000 ) - 1000;
serverHintTimeout = serverHintTimeout > 0 ? serverHintTimeout : 0;

if (serverHintTimeout < agentTimeout) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional needs to be fixed - agentTimeout defaults to 0, so the serverHintTimeout is never being set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the socket shouldn't be reused at all. Is setting socket.setTimeout(0) the right way to do it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at socket docs - socket.setTimeout(0) doesn't mean immediate, it means never.

Copy link
Contributor Author

@zanettea zanettea Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have limited understanding of the network internals. In undici ( https://github.com/nodejs/undici/pull/291/files ) if the keepAliveTimeout goes down to 0 they flag the connection as reset:

if (!keepAliveTimeout || keepAliveTimeout < 1e3) {
          client[kReset] = true
        } 

I don't know how to achieve the same result here. Maybe:

socket.setKeepAlive(false);

?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have a limited understanding of network internals, but I think:

if (!agentTimeout || serverHintTimeout < agentTimeout) {

is what you want.

socket.setTimeout(server.keepAliveTimeout);
// Increase the internal timeout wrt the advertised value to reduce likeliwood of ECONNRESET errors
// due to race conditions between the client and server timeout calculation
socket.setTimeout(server.keepAliveTimeout + 1000);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this server change is necessary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be the cause of all the test failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix protects clients that are using the hint timeout as is (current node impl) without adjusting it for network (or cpu load) delays

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If server.keepAliveTimeout is set to 0 (never time out) this will change it to 1 second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If server.keepAliveTimeout == 0 then the if condition line1012 is false. The timeout on the socket is set only if server.keepAliveTimeout != 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, sorry I missed that.

const serverHintTimeout = NumberParseInt(hint) * 1000;
// Let the timer expires before the announced timeout to reduce
// the likelihood of ECONNRESET errors
let serverHintTimeout = ( NumberParseInt(hint) * 1000 ) - 1000;
Copy link

@mweberxyz mweberxyz Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case the server responds with a 1 second keepalive, this will set the disable the timeout because serverHintTimeout will be 0 -- maybe go with - 500 in place of - 1000 ?

serverHintTimeout = serverHintTimeout > 0 ? serverHintTimeout : 0;
if (serverHintTimeout === 0) {
// cannot safely reuse the socket because the server timeout is too short
canKeepSocketAlive = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR to just skip keep alive if the timeout is too short. In my local build the test now pass.

lib/_http_agent.js Outdated Show resolved Hide resolved
@meyfa

This comment has been minimized.

@zanettea zanettea changed the title Reduce likeliwood of race conditions on keep-alive timeout calculatio… Reduce likelihood of race conditions on keep-alive timeout calculatio… Apr 24, 2024
@zanettea zanettea force-pushed the main branch 2 times, most recently from 8407fd7 to bc34680 Compare April 24, 2024 10:12
@mweberxyz
Copy link

Found the same issue in dotnet -- they went with a 1 second offset as well, so that was a good choice. 👍

@zanettea zanettea force-pushed the main branch 3 times, most recently from ba0d895 to 2124b55 Compare April 24, 2024 15:30
Copy link

@mweberxyz mweberxyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I verified that the test cases in #52649 pass without error.

I don't have Approve permission - lgtm @mcollina

Copy link
Contributor

@ShogunPanda ShogunPanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

reduce likelihood of race conditions on keep-alive timeout
calculation between http1.1 servers and clients and honor server
keep-alive timeout when agentTimeout is not set

Fixes: nodejs#47130
Fixes: nodejs#52649
Copy link
Contributor

@fahrradflucht fahrradflucht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it looks like this should be a testable behavior change.

I had a short look and I guess if test-http-client-keep-alive-hint.js would be a better test it should actually fail now, because the case it tests now behaves quite different (there is no keep-alive because the hint is just 1 second).

I think a test somewhat along these lines would pass on main and fail on this branch:

'use strict';

const common = require('../common');
const assert = require('assert');
const http = require('http');

const server = http.createServer(
  { keepAliveTimeout: common.platformTimeout(60000) },
  function(req, res) {
    req.resume();
    res.writeHead(200, { 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=1' });
    res.end('FOO');
  }
);

server.listen(0, common.mustCall(() => {
  let shouldStillBeAlive = true;
  setTimeout(() => {
    shouldStillBeAlive = false;
  }, common.platformTimeout(500)); // 500ms buffer

  const req = http.get({ port: server.address().port }, (res) => {
    assert.strictEqual(res.statusCode, 200);

    res.resume();
  });

  req.on('socket', (socket) => {
    socket.on('close', common.mustCall(() => {
      if (shouldStillBeAlive) {
        assert.fail('socket prematurely closed');
      }
      server.close();
    }));
  });
}));

// This timer should never go off as the agent will parse the hint and terminate earlier
setTimeout(common.mustNotCall(), common.platformTimeout(3000)).unref();

I guess it would be good to write a similar test (hopefully a version that has to rely less on timers) that verifies that we are actually latency adjusting (currently hard-coded to 1s) the hint based keep alive?

@@ -1010,7 +1010,9 @@ function resOnFinish(req, res, socket, state, server) {
}
} else if (state.outgoing.length === 0) {
if (server.keepAliveTimeout && typeof socket.setTimeout === 'function') {
socket.setTimeout(server.keepAliveTimeout);
// Increase the internal timeout wrt the advertised value to reduce likeliwood of ECONNRESET errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Increase the internal timeout wrt the advertised value to reduce likeliwood of ECONNRESET errors
// Increase the internal timeout wrt the advertised value to reduce likelihood of ECONNRESET errors

const serverHintTimeout = NumberParseInt(hint) * 1000;

if (serverHintTimeout < agentTimeout) {
// Let the timer expires before the announced timeout to reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Let the timer expires before the announced timeout to reduce
// Let the timer expire before the announced timeout to reduce

@jazelly
Copy link
Member

jazelly commented Aug 28, 2024

Hey, this valuable PR fixes many similar issues for keep-alive. Is it possible to push this through 👀

@mcollina mcollina added the request-ci Add this label to start a Jenkins CI on a PR. label Aug 28, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Aug 28, 2024
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@jakecastelli jakecastelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test?

@jazelly
Copy link
Member

jazelly commented Sep 7, 2024

Hi, I'd like to help this land to resovle similar issues. Can I continue the work and add @zanettea as a co-author? To me, there are just test cases to be added and some commits to be rebased to dodge the flakiness on CI.

@nikwen
Copy link
Contributor

nikwen commented Sep 13, 2024

The new PR was just merged. This one can be closed now. (Thanks to @jazelly, @zanettea and everyone else who worked on it!)

@mcollina mcollina closed this Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.