-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(events tracking): add abstract class and logging implementation #80117
Conversation
a147a53
to
c09a7c4
Compare
2ebf8d5
to
8b5234b
Compare
9d4217c
to
3143ea2
Compare
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## master #80117 +/- ##
==========================================
- Coverage 78.48% 78.48% -0.01%
==========================================
Files 7210 7207 -3
Lines 319532 319607 +75
Branches 43963 43989 +26
==========================================
+ Hits 250797 250841 +44
- Misses 62348 62371 +23
- Partials 6387 6395 +8 |
43d3e93
to
260c0db
Compare
@@ -202,6 +203,10 @@ def process_event( | |||
else: | |||
with metrics.timer("ingest_consumer._store_event"): | |||
cache_key = processing_store.store(data) | |||
track_sampled_event( | |||
data["event_id"], data.get("type"), TransactionStageStatus.REDIS_PUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want the pipeline name not data.get("type")
here. They are not always the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im using data.get("type")
to make it generalized, so when this is eventually extended for errors, it will work too. Do you have any concerns a bout this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded it to only take transactions for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you cannot use data.get("type")
in errors - there are many different types going through the errors pipeline. The pipeline name should be the generalized thing.
@fpacifici int in logging should still be cheaper than string in google logs? Or is that too negligible |
bdc30bc
to
384a541
Compare
add updated list of enums add sampling add redis put add sampling logic add extra remove class add .value change enum value to int use IntEnum add test first pass add hash sampling update status enum docstring comment add wip wip tests pass change to should_track add TransactionStageStatus return if rate 0 add unit test remove old test add comments add TODO add event type use options automator update comment option to 0 only use options override in tests another way
384a541
to
8bb87dc
Compare
if __name__ == "__main__": | ||
unittest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this needed?
…80117) [design doc](https://www.notion.so/sentry/Conversion-rate-of-ingest-transactions-to-save-trx-1298b10e4b5d801ab517c8e2218d13d5) need to track the completion of each stage, to 1) compute events conversion rates 2) enable debugging visibility into where events are being dropped the usage will be heavily sampled to not blow up traffic this PR only adds REDIS_PUT stage, in subsequent PRs I will add all the other stages listed in EventStageStatus class **!!!!!IMPORTANT!!!!!!** hash based sampling here's a [blog post](https://www.rsyslog.com/doc/tutorials/hash_sampling.html) explaining hash based sampling, which would provide "all or nothing" logging for the events sampled across the entire pipeline. That's the idea I want to implement the hashing algorithm used must be consistent and uniformly distributed in order for all or nothing sampling to work. I cannot find references that say that md5 is consistent and evenly distributed other than various [stackoverflow pages](https://crypto.stackexchange.com/questions/14967/distribution-for-a-subset-of-md5). All the official sources are too academic and long and i can't understand ---------- for reviewers: please review with the thoughts of how this can be generalized to other pipelines as well, such as errors
…80117) [design doc](https://www.notion.so/sentry/Conversion-rate-of-ingest-transactions-to-save-trx-1298b10e4b5d801ab517c8e2218d13d5) need to track the completion of each stage, to 1) compute events conversion rates 2) enable debugging visibility into where events are being dropped the usage will be heavily sampled to not blow up traffic this PR only adds REDIS_PUT stage, in subsequent PRs I will add all the other stages listed in EventStageStatus class **!!!!!IMPORTANT!!!!!!** hash based sampling here's a [blog post](https://www.rsyslog.com/doc/tutorials/hash_sampling.html) explaining hash based sampling, which would provide "all or nothing" logging for the events sampled across the entire pipeline. That's the idea I want to implement the hashing algorithm used must be consistent and uniformly distributed in order for all or nothing sampling to work. I cannot find references that say that md5 is consistent and evenly distributed other than various [stackoverflow pages](https://crypto.stackexchange.com/questions/14967/distribution-for-a-subset-of-md5). All the official sources are too academic and long and i can't understand ---------- for reviewers: please review with the thoughts of how this can be generalized to other pipelines as well, such as errors
design doc
need to track the completion of each stage, to 1) compute events conversion rates 2) enable debugging visibility into where events are being dropped
the usage will be heavily sampled to not blow up traffic
this PR only adds REDIS_PUT stage, in subsequent PRs I will add all the other stages listed in EventStageStatus class
!!!!!IMPORTANT!!!!!!
hash based sampling
here's a blog post explaining hash based sampling, which would provide "all or nothing" logging for the events sampled across the entire pipeline. That's the idea I want to implement
the hashing algorithm used must be consistent and uniformly distributed in order for all or nothing sampling to work.
I cannot find references that say that md5 is consistent and evenly distributed other than various stackoverflow pages. All the official sources are too academic and long and i can't understand
for reviewers:
please review with the thoughts of how this can be generalized to other pipelines as well, such as errors