-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added strict timeout (timeAllowed ) for shard tolerant requests #201
base: fs/branch_9_3
Are you sure you want to change the base?
Conversation
@magibney can you see if this looks reasonable as now we are not waiting for all responses. I feel like there are some semantic issue(multiple stages of query) at query level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work, with some adjustments. I wonder if we should propose this upstream (certainly would need tests added). This change seems reasonable enough that upstream would be a good option because:
- others might benefit from it
- if there are issues that we haven't considered, we'd want to know
- as always, avoid future potential for merge conflicts
@@ -529,7 +529,7 @@ public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throw | |||
} | |||
} else { | |||
// a distributed request | |||
|
|||
long maxTimeAllowed = req.getParams().getLong(CommonParams.TIME_ALLOWED, Long.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're not actually doing anything with maxTimeAllowed
here? Probably want to pass it to shardHandler around line 596?
long waitTime = maxAllowedTime - System.nanoTime(); | ||
ShardResponse rsp = responses.poll(waitTime, TimeUnit.NANOSECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite doing what we want here.
maxAllowedTime
is supplied in millis, so any computations need to make sure the values are compatible. At the head of this method we want e.g. to convert
long maxAllowedNanos = TimeUnit.MILLISECONDS.toNanos(maxAllowedTime);
long startTimeNanos = System.nanoTime();
then compute wait time for each poll:
long waitTimeNanos = maxAllowedNanos - (System.nanoTime() - startTimeNanos);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few practical changes, and one or two stylistic.
Ideally for the test(s) we would run actual requests in a test solrcloud/cluster, though this would be quite a bit more difficult.
My main practical concern is that the failure mode I'm concerned about is: what happens to client responses that return after we've shortcircuited? That's the main change here, and I expect it should be fine, but it'd be nice to have a test that validates that assumption end-to-end (and avoiding the need for the extra package-accessible methods for manipulating internal state would also be nice).
if (maxAllowedTimeInMillis > 0) { | ||
deadline += TimeUnit.MILLISECONDS.toNanos(maxAllowedTimeInMillis); | ||
} else { | ||
deadline = Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't do deadline = Long.MAX_VALUE
-- there are no guarantees about nanoTime being non-negative, nor starting from 0, etc. So any computation we do needs to make no such assumptions, and should intentionally play well with overflow.
With the approach you're taking, deadline += Long.MAX_VALUE
would work here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magibney good catch. didn't realize can return non-positive. I have added delay of day.
solr/core/src/test/org/apache/solr/handler/component/TestHttpShardHandlerFactory.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/handler/component/TestHttpShardHandlerFactory.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/handler/component/TestHttpShardHandlerFactory.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/handler/component/TestHttpShardHandlerFactory.java
Outdated
Show resolved
Hide resolved
That was the my initial concern as searchhandler can have multiple components and multiple stages, and its not clear to me what would happen if we timeout. |
Adding, tried ValueSourceParser.sleep method but. looks like it get called before going into HTTPShardHandler. That tells me any plugin idea for delay may not help here. CCing @magibney
|
Full exception thread
|
@magibney now test looks reasonable, which shows the issue and proper timeout. Let me know what you think about it |
No description provided.