-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have the same lock for upload + preprocess task #502
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ ✅ All tests successful. No failed tests found. @@ Coverage Diff @@
## main #502 +/- ##
=======================================
Coverage 97.45% 97.45%
=======================================
Files 416 416
Lines 34759 34761 +2
=======================================
+ Hits 33876 33878 +2
Misses 883 883
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. @@ Coverage Diff @@
## main #502 +/- ##
=======================================
Coverage 97.45% 97.45%
=======================================
Files 416 416
Lines 34759 34761 +2
=======================================
+ Hits 33876 33878 +2
Misses 883 883
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found @@ Coverage Diff @@
## main #502 +/- ##
=======================================
Coverage 97.45% 97.45%
=======================================
Files 416 416
Lines 34759 34761 +2
=======================================
+ Hits 33876 33878 +2
Misses 883 883
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Changes have been made to critical files, which contain lines commonly executed in production. Learn more ✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #502 +/- ##
=======================================
Coverage 97.48% 97.48%
=======================================
Files 447 447
Lines 35488 35490 +2
=======================================
+ Hits 34595 34597 +2
Misses 893 893
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through the problem again I'm realizing that this solution can lead us into more problems than before 😅
Please read the comments for explanation
tasks/preprocess_upload.py
Outdated
@@ -48,7 +49,8 @@ def run_impl( | |||
"Received preprocess upload task", | |||
extra=dict(repoid=repoid, commit=commitid, report_code=report_code), | |||
) | |||
lock_name = f"preprocess_upload_lock_{repoid}_{commitid}_{report_code}" | |||
lock_name = UPLOAD_LOCK_NAME(repoid, commitid) | |||
# lock_name = f"preprocess_upload_lock_{repoid}_{commitid}_{report_code}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why keep it as a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caaaause I forgot 🙃
@@ -48,7 +49,8 @@ def run_impl( | |||
"Received preprocess upload task", | |||
extra=dict(repoid=repoid, commit=commitid, report_code=report_code), | |||
) | |||
lock_name = f"preprocess_upload_lock_{repoid}_{commitid}_{report_code}" | |||
lock_name = UPLOAD_LOCK_NAME(repoid, commitid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be useful to add a comment explaining why this task needs the same lock as the upload task
tasks/preprocess_upload.py
Outdated
blocking_timeout=None, | ||
# This is the timeout that this task will wait to wait for the lock. This should | ||
# be non-zero as otherwise it waits indefinitely to get the lock, and ideally smaller than | ||
# the blocking timeout for the upload task so this one goes first, although it can go second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the order that matters. The reason that the blocking timeout should be smaller is to avoid the scenario in which the PreProcess tasks will wait forever and get the locks. This can drain Upload tasks (because they wait a limited time, and retry a limited amount of times).
If only PreProcess tasks run for a commit that's a problem. If only Upload tasks run it's OK.
We know that the number of Upload tasks is always >= than the number of PreProcess tasks for a given commit.
We know that the Upload tasks retry if they can't get the lock, but the PreProcess doesn't.
By forcing the PreProcess tasks to collectively wait less time for the lock we can guarantee that an Upload task will run and everything will be OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also the reason why nuking the PreProcess task makes sense.
It avoids this problem completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kinda realizing now that the current setup could lead us into trouble...
Consider this scenario:
1 Preprocess and 1 Upload tasks execute at roughly the same time. PreProcess get's the lock. It can stay with the lock for a maximum of 5minutes. Let's say it keeps the lock for 5 minutes.
Upload waits 5s, doesn't get the lock. It retries (after 20s). Then it waits 5 more seconds to get the lock and fails. It will retry again (after 40s), and wait 5 more seconds. Then it will not retry again.
In total the task waiter 15s + 20s + 40s = 1m15s
In this scenario no Upload tasks would run for the commit and nothing would be processed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, so if we just ensure the upload task waits at least 5 minutes in the worst case scenario, that would fix this right?
Every unsuccessful run goes through blocking_timeout (0) + retry_countdown (20 * 2**self.request.retries), so it goes like
0 retries: 5 + 20 = 25
1 retries: 5 + 40 = 45 (total 70)
2 retries: 5 + 204 = 85 (total 155)
3 retries: 5 + 208 = 165 (total 320)
So if we'd allow this to retry at most 3 times, this could solve this, or we could tweak the specific values of retrying. That's one way.
Another is to shorten the time during preprocess and fit the upload retries around that.
There's the elimination of the preprocess task altogether, which I can ask other people/engs but I suspect that will be met with more friction + unforeseen things cause we're suddenly unsupporting a command. Although if we fundamentally think this task is more harmful than not, it's worth pushing back on getting rid of it.
Let me know what you think about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would push for the elimination of the preprocess task altogether. It would be a good idea to get other opinions from the team, definetely.
It is not true that "we're suddenly unsupporting a command" if we drop that task. The api would still create an empty report.
In the meantime we can go with the proposal in the other PR to alleviate customer's problems (although for the record I don't particularly like that solution, but that's just 1 opinion)
This is an alternative to the solution proposed in this ticket #497 to attempt and correct the problem from the source rather than react to it.
The idea is to share the same lock for both of these processes to prevent both starting without initializing and saving the report.
Legal Boilerplate
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.