-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve retrying downloads code and testing #1559
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1559 +/- ##
==========================================
- Coverage 88.53% 88.47% -0.07%
==========================================
Files 78 78
Lines 10735 10798 +63
==========================================
+ Hits 9504 9553 +49
- Misses 1231 1245 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -160,7 +160,7 @@ def urls(self) -> Iterator[str]: | |||
|
|||
#: HTTP response status codes that should always be retried (until we run out | |||
#: of retries) | |||
RETRY_STATUSES = (500, 502, 503, 504) | |||
RETRY_STATUSES = (429, 500, 502, 503, 504) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding 429 here will also affect requests made by dandiapi.py
and dandiarchive.py
. Unless you want to add support for Retry-After there as well, it'd be better to just add 429 next to RETRY_STATUSES
in download.py
.
@@ -3,7 +3,8 @@ | |||
from collections import Counter, deque | |||
from collections.abc import Callable, Iterable, Iterator, Sequence | |||
from dataclasses import InitVar, dataclass, field | |||
from datetime import datetime | |||
from datetime import UTC, datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datetime.UTC
requires Python 3.11. Use datetime.timezone.utc
instead.
# if is_access_denied(exc) or attempt >= 2: | ||
# raise | ||
# sleep a little and retry | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By removing this else
, the lgr.debug()
call on line 1138/1156 will be followed by the call on line 1157/1167, resulting in two similar log messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would only be in the case of non-conformant "retry-after" we would get 3 messages "on the matter"
- "to response %d with non-integer or future date"
- "to response %d with Retry-After"
- the one below.
if there is a good Retry-after, we would get a non-negative value for sleep_amount, and log only at 1138/1156 (but we might want it into else:
for exception, to not log with /1147) as sleep_amount condition on /1164 would not trigger that other message where we would also define the random sleep.
But for above "3 msgs" I will adjust... thanks!
if retry_after.isdigit(): | ||
sleep_amount = int(retry_after) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are assuming that this client always connect to our trusted DANDI API, I think this is fine, but if we want to want to play even "safer" I think we may want to ensure the retry_after
string is really a non-negative integer in a decimal expression.
If retry_after
is "²³"
will crash this client at this point.
This safe guard can be excessive though since we have control of both the client and server.
I think use of str.isdecimal()
and ensuring the return of int()
to be non-negative is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, _check_if_more_attempts_allowed
is equally, if not more, about sleeping/waiting before sending out another requests. I think it would be beneficial to rename it, or in some other way, to indicate this nature.
If we want to improve the retrying further, I think we can consider incorporating some well-known and recommended algorithms for retrying requests to a server while avoiding overwhelming it. https://chatgpt.com/share/676f8290-86cc-800a-9aa8-1ad952282d27.
try: | ||
retry_after_date = parsedate_to_datetime(retry_after) | ||
current_date = datetime.now(UTC) | ||
difference = retry_after_date - current_date | ||
sleep_amount = int(difference.total_seconds()) | ||
if sleep_amount < 0: | ||
lgr.warning( | ||
"%s - date in Retry-After=%r is in the past (current is %r)", | ||
path, | ||
retry_after, | ||
current_date, | ||
) | ||
except ValueError as exc_ve: | ||
# our code or response is wrong, do not crash but issue warning | ||
# and continue with "default" sleep logic | ||
lgr.warning( | ||
"%s - download failed due to response %d with non-integer or future date" | ||
" Retry-After=%r: %s with %s upon converting assuming it is a date", | ||
path, | ||
exc.response.status_code, | ||
retry_after, | ||
exc, | ||
exc_ve, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | |
retry_after_date = parsedate_to_datetime(retry_after) | |
current_date = datetime.now(UTC) | |
difference = retry_after_date - current_date | |
sleep_amount = int(difference.total_seconds()) | |
if sleep_amount < 0: | |
lgr.warning( | |
"%s - date in Retry-After=%r is in the past (current is %r)", | |
path, | |
retry_after, | |
current_date, | |
) | |
except ValueError as exc_ve: | |
# our code or response is wrong, do not crash but issue warning | |
# and continue with "default" sleep logic | |
lgr.warning( | |
"%s - download failed due to response %d with non-integer or future date" | |
" Retry-After=%r: %s with %s upon converting assuming it is a date", | |
path, | |
exc.response.status_code, | |
retry_after, | |
exc, | |
exc_ve, | |
) | |
try: | |
retry_after_date = parsedate_to_datetime(retry_after) | |
except (ValueError, TypeError) as exc_ve: | |
# our code or response is wrong, do not crash but issue warning | |
# and continue with "default" sleep logic | |
lgr.warning( | |
"%s - download failed due to response %d with non-integer or future date" | |
" Retry-After=%r: %s with %s upon converting assuming it is a date", | |
path, | |
exc.response.status_code, | |
retry_after, | |
exc, | |
exc_ve, | |
) | |
else: | |
current_date = datetime.now(timezone.utc) | |
difference = retry_after_date - current_date | |
sleep_amount = int(difference.total_seconds()) | |
if sleep_amount < 0: # todo: `sleep_amount < 0` may not be an error. It the server set a `Retry-After` datetime extremely close and the network is slow, the client may receive the response after the `Retry-After` datetime. | |
lgr.warning( | |
"%s - date in Retry-After=%r is in the past (current is %r)", | |
path, | |
retry_after, | |
current_date, | |
) |
This suggestion make the try clause
- more targeted
- catches
TypeError
s fromparsedate_to_datetime
. (Tryparsedate_to_datetime('Wed, 21 O3t 2015 07:280 GMT')
) - adds a note with a todo comment regarding
sleep_amount < 0
retry_after, | ||
exc, | ||
exc_ve, | ||
) | ||
lgr.debug( | ||
"%s - download failed due to response %d with " | ||
"Retry-After=%d: %s, will sleep and retry", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "Retry-After=[]" in this log is now insistent and sleep_amount
may not be a meaningful number if retry_after
failed to be parsed into a datetime
object
@@ -1122,19 +1123,36 @@ def _check_if_more_attempts_allowed( | |||
return None | |||
elif retry_after := exc.response.headers.get("Retry-After"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about the changes in this PR. However, since are trying to improve this function, we may as well consider replace elif
with if
here. The last if
clause ended with a return
statement, so the elif
here is about confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can be said about the containing elif
clause at line 1111 at which I can't place a comment.
# sleep a little and retry | ||
else: | ||
if sleep_amount < 0: | ||
# it was not Retry-after set, so we come up with random duration to sleep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current arrangement, the execution flow can reach here event if Retry-After
is set. It can be set to an extremely close datetime or some string can't be parsed properly in to a datetime
object
Follow up to
Continue retrying downloads on retriable statuses #1558
retry on 429 as well
handle Retry-After as a Date not just seconds (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Date)
adjust logging of sleeping upon retry - do in a single place
We are into holidays already, so it is understood if review does not come. I will reserve the "right" to merge if there is user demand but likely we could just wait. If anyone sees "clearly" how to improve -- feel welcome to push directly.