Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve retrying downloads code and testing #1559

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

yarikoptic
Copy link
Member

Follow up to

We are into holidays already, so it is understood if review does not come. I will reserve the "right" to merge if there is user demand but likely we could just wait. If anyone sees "clearly" how to improve -- feel welcome to push directly.

@yarikoptic yarikoptic added the patch Increment the patch version when merged label Dec 20, 2024
Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 98.59155% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.47%. Comparing base (bb05a23) to head (6b57379).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
dandi/download.py 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1559      +/-   ##
==========================================
- Coverage   88.53%   88.47%   -0.07%     
==========================================
  Files          78       78              
  Lines       10735    10798      +63     
==========================================
+ Hits         9504     9553      +49     
- Misses       1231     1245      +14     
Flag Coverage Δ
unittests 88.47% <98.59%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -160,7 +160,7 @@ def urls(self) -> Iterator[str]:

#: HTTP response status codes that should always be retried (until we run out
#: of retries)
RETRY_STATUSES = (500, 502, 503, 504)
RETRY_STATUSES = (429, 500, 502, 503, 504)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding 429 here will also affect requests made by dandiapi.py and dandiarchive.py. Unless you want to add support for Retry-After there as well, it'd be better to just add 429 next to RETRY_STATUSES in download.py.

@@ -3,7 +3,8 @@
from collections import Counter, deque
from collections.abc import Callable, Iterable, Iterator, Sequence
from dataclasses import InitVar, dataclass, field
from datetime import datetime
from datetime import UTC, datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime.UTC requires Python 3.11. Use datetime.timezone.utc instead.

dandi/download.py Show resolved Hide resolved
# if is_access_denied(exc) or attempt >= 2:
# raise
# sleep a little and retry
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing this else, the lgr.debug() call on line 1138/1156 will be followed by the call on line 1157/1167, resulting in two similar log messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would only be in the case of non-conformant "retry-after" we would get 3 messages "on the matter"

  • "to response %d with non-integer or future date"
  • "to response %d with Retry-After"
  • the one below.

if there is a good Retry-after, we would get a non-negative value for sleep_amount, and log only at 1138/1156 (but we might want it into else: for exception, to not log with /1147) as sleep_amount condition on /1164 would not trigger that other message where we would also define the random sleep.

But for above "3 msgs" I will adjust... thanks!

Comment on lines +1127 to +1128
if retry_after.isdigit():
sleep_amount = int(retry_after)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are assuming that this client always connect to our trusted DANDI API, I think this is fine, but if we want to want to play even "safer" I think we may want to ensure the retry_after string is really a non-negative integer in a decimal expression.

If retry_after is "²³" will crash this client at this point.

This safe guard can be excessive though since we have control of both the client and server.

I think use of str.isdecimal() and ensuring the return of int() to be non-negative is better.

Copy link
Member

@candleindark candleindark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, _check_if_more_attempts_allowed is equally, if not more, about sleeping/waiting before sending out another requests. I think it would be beneficial to rename it, or in some other way, to indicate this nature.

If we want to improve the retrying further, I think we can consider incorporating some well-known and recommended algorithms for retrying requests to a server while avoiding overwhelming it. https://chatgpt.com/share/676f8290-86cc-800a-9aa8-1ad952282d27.

Comment on lines +1132 to +1155
try:
retry_after_date = parsedate_to_datetime(retry_after)
current_date = datetime.now(UTC)
difference = retry_after_date - current_date
sleep_amount = int(difference.total_seconds())
if sleep_amount < 0:
lgr.warning(
"%s - date in Retry-After=%r is in the past (current is %r)",
path,
retry_after,
current_date,
)
except ValueError as exc_ve:
# our code or response is wrong, do not crash but issue warning
# and continue with "default" sleep logic
lgr.warning(
"%s - download failed due to response %d with non-integer or future date"
" Retry-After=%r: %s with %s upon converting assuming it is a date",
path,
exc.response.status_code,
retry_after,
exc,
exc_ve,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
retry_after_date = parsedate_to_datetime(retry_after)
current_date = datetime.now(UTC)
difference = retry_after_date - current_date
sleep_amount = int(difference.total_seconds())
if sleep_amount < 0:
lgr.warning(
"%s - date in Retry-After=%r is in the past (current is %r)",
path,
retry_after,
current_date,
)
except ValueError as exc_ve:
# our code or response is wrong, do not crash but issue warning
# and continue with "default" sleep logic
lgr.warning(
"%s - download failed due to response %d with non-integer or future date"
" Retry-After=%r: %s with %s upon converting assuming it is a date",
path,
exc.response.status_code,
retry_after,
exc,
exc_ve,
)
try:
retry_after_date = parsedate_to_datetime(retry_after)
except (ValueError, TypeError) as exc_ve:
# our code or response is wrong, do not crash but issue warning
# and continue with "default" sleep logic
lgr.warning(
"%s - download failed due to response %d with non-integer or future date"
" Retry-After=%r: %s with %s upon converting assuming it is a date",
path,
exc.response.status_code,
retry_after,
exc,
exc_ve,
)
else:
current_date = datetime.now(timezone.utc)
difference = retry_after_date - current_date
sleep_amount = int(difference.total_seconds())
if sleep_amount < 0: # todo: `sleep_amount < 0` may not be an error. It the server set a `Retry-After` datetime extremely close and the network is slow, the client may receive the response after the `Retry-After` datetime.
lgr.warning(
"%s - date in Retry-After=%r is in the past (current is %r)",
path,
retry_after,
current_date,
)

This suggestion make the try clause

  1. more targeted
  2. catches TypeErrors from parsedate_to_datetime. (Try parsedate_to_datetime('Wed, 21 O3t 2015 07:280 GMT'))
  3. adds a note with a todo comment regarding sleep_amount < 0

retry_after,
exc,
exc_ve,
)
lgr.debug(
"%s - download failed due to response %d with "
"Retry-After=%d: %s, will sleep and retry",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "Retry-After=[]" in this log is now insistent and sleep_amount may not be a meaningful number if retry_after failed to be parsed into a datetime object

@@ -1122,19 +1123,36 @@ def _check_if_more_attempts_allowed(
return None
elif retry_after := exc.response.headers.get("Retry-After"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about the changes in this PR. However, since are trying to improve this function, we may as well consider replace elif with if here. The last if clause ended with a return statement, so the elif here is about confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So can be said about the containing elif clause at line 1111 at which I can't place a comment.

# sleep a little and retry
else:
if sleep_amount < 0:
# it was not Retry-after set, so we come up with random duration to sleep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current arrangement, the execution flow can reach here event if Retry-After is set. It can be set to an extremely close datetime or some string can't be parsed properly in to a datetime object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants