Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added strict timeout (timeAllowed ) for shard tolerant requests #201

Open
wants to merge 12 commits into
base: fs/branch_9_3
Choose a base branch
from

Conversation

hiteshk25
Copy link
Collaborator

No description provided.

@hiteshk25 hiteshk25 changed the base branch from main to fs/branch_9_3 May 22, 2024 14:56
@hiteshk25
Copy link
Collaborator Author

@magibney can you see if this looks reasonable as now we are not waiting for all responses. I feel like there are some semantic issue(multiple stages of query) at query level

Copy link
Collaborator

@magibney magibney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, with some adjustments. I wonder if we should propose this upstream (certainly would need tests added). This change seems reasonable enough that upstream would be a good option because:

  1. others might benefit from it
  2. if there are issues that we haven't considered, we'd want to know
  3. as always, avoid future potential for merge conflicts

@@ -529,7 +529,7 @@ public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throw
}
} else {
// a distributed request

long maxTimeAllowed = req.getParams().getLong(CommonParams.TIME_ALLOWED, Long.MAX_VALUE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're not actually doing anything with maxTimeAllowed here? Probably want to pass it to shardHandler around line 596?

Comment on lines 246 to 247
long waitTime = maxAllowedTime - System.nanoTime();
ShardResponse rsp = responses.poll(waitTime, TimeUnit.NANOSECONDS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite doing what we want here.

maxAllowedTime is supplied in millis, so any computations need to make sure the values are compatible. At the head of this method we want e.g. to convert

long maxAllowedNanos = TimeUnit.MILLISECONDS.toNanos(maxAllowedTime);
long startTimeNanos = System.nanoTime();

then compute wait time for each poll:

long waitTimeNanos = maxAllowedNanos - (System.nanoTime() - startTimeNanos);

Copy link
Collaborator

@magibney magibney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few practical changes, and one or two stylistic.

Ideally for the test(s) we would run actual requests in a test solrcloud/cluster, though this would be quite a bit more difficult.

My main practical concern is that the failure mode I'm concerned about is: what happens to client responses that return after we've shortcircuited? That's the main change here, and I expect it should be fine, but it'd be nice to have a test that validates that assumption end-to-end (and avoiding the need for the extra package-accessible methods for manipulating internal state would also be nice).

if (maxAllowedTimeInMillis > 0) {
deadline += TimeUnit.MILLISECONDS.toNanos(maxAllowedTimeInMillis);
} else {
deadline = Long.MAX_VALUE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't do deadline = Long.MAX_VALUE -- there are no guarantees about nanoTime being non-negative, nor starting from 0, etc. So any computation we do needs to make no such assumptions, and should intentionally play well with overflow.

With the approach you're taking, deadline += Long.MAX_VALUE would work here instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magibney good catch. didn't realize can return non-positive. I have added delay of day.

@hiteshk25
Copy link
Collaborator Author

My main practical concern is that the failure mode I'm concerned about is: what happens to client responses that return after we've shortcircuited? That's the main change here, and I expect it should be fine,

That was the my initial concern as searchhandler can have multiple components and multiple stages, and its not clear to me what would happen if we timeout.

@hiteshk25
Copy link
Collaborator Author

Adding, tried ValueSourceParser.sleep method but. looks like it get called before going into HTTPShardHandler. That tells me any plugin idea for delay may not help here. CCing @magibney

Caused by: java.lang.InterruptedException: sleep interrupted
	at java.lang.Thread.sleep(Native Method) ~[?:?]
	at org.apache.solr.search.ValueSourceParser$5.parse(ValueSourceParser.java:205) ~[main/:?]
	... 48 more

@hiteshk25
Copy link
Collaborator Author

Full exception thread

99972 ERROR (qtp697120469-47) [n:127.0.0.1:63794_solr c:test_coll s:shard1 r:core_node3 x:test_coll_shard1_replica_n1] o.a.s.h.RequestHandlerBase Server exception
          => java.lang.RuntimeException: java.lang.InterruptedException: sleep interrupted
	at org.apache.solr.search.ValueSourceParser$5.parse(ValueSourceParser.java:207)
java.lang.RuntimeException: java.lang.InterruptedException: sleep interrupted
	at org.apache.solr.search.ValueSourceParser$5.parse(ValueSourceParser.java:207) ~[main/:?]
	at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:485) ~[main/:?]
	at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:100) ~[main/:?]
	at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[main/:?]
	at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:172) ~[main/:?]
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:448) ~[main/:?]
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:226) [main/:?]
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2907) [main/:?]
	at org.apache.solr.servlet.HttpSolrCall.executeCoreRequest(HttpSolrCall.java:879) [main/:?]
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:575) [main/:?]
	at org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:255) [main/:?]
	at org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:219) [main/:?]
	at org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:240) [main/:?]
	at org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:213) [main/:?]
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:213) [main/:?]
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) [main/:?]
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202) [jetty-servlet-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635) [jetty-servlet-10.0.15.jar:10.0.15]
	at org.apache.solr.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:187) [solr-test-framework-9.3.0-SNAPSHOT.jar:9.3.0-SNAPSHOT 48061b2d3d3719007e657ed01cfaea8dd823db6f [snapshot build, details omitted]]
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202) [jetty-servlet-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635) [jetty-servlet-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) [jetty-servlet-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1570) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1384) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) [jetty-servlet-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1543) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1306) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:301) [jetty-rewrite-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.Server.handle(Server.java:563) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282) [jetty-server-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314) [jetty-io-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) [jetty-io-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) [jetty-io-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) [jetty-util-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) [jetty-util-10.0.15.jar:10.0.15]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) [jetty-util-10.0.15.jar:10.0.15]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.InterruptedException: sleep interrupted
	at java.lang.Thread.sleep(Native Method) ~[?:?]
	at org.apache.solr.search.ValueSourceParser$5.parse(ValueSourceParser.java:205) ~[main/:?]
	... 48 more

@hiteshk25
Copy link
Collaborator Author

@magibney now test looks reasonable, which shows the issue and proper timeout. Let me know what you think about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants