Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSC-STM-A8: Monitoring of PSC Stream Ingester #247

Open
tiredpixel opened this issue Feb 28, 2024 · 3 comments
Open

PSC-STM-A8: Monitoring of PSC Stream Ingester #247

tiredpixel opened this issue Feb 28, 2024 · 3 comments
Assignees

Comments

@tiredpixel
Copy link
Contributor

Since it should be running continuously, it is important to get notifications of when it has crashed.

It ought to be recovering if it is a brief crash, since the stream pointer is stored so it can resume from where it was. However, frequent crashes could show something need fixing, and a long outage could lead to the stream data no longer being available (a sign we will need to ingest the snapshots to get any missing data).

Add monitoring so that some notification is sent for whenever it goes down and uptime status.

Estimate: 4 hours

@tiredpixel
Copy link
Contributor Author

I'm not up-to-date with Heroku monitoring extensions, so I'll evaluate the options briefly.

The Heroku Errors and Exceptions add-ons are:
https://elements.heroku.com/addons#errors-exceptions

@tiredpixel
Copy link
Contributor Author

Bugsnag

https://elements.heroku.com/addons/bugsnag

  • 7500 exceptions/month @ free
  • sets BUGSNAG_API_KEY env var
  • bugsnag Ruby library
  • doesn't work without configuration, even with RACK_ENV env var set
  • documentation has interesting example: to work on-exit for unhandled exceptions, rather than catching exceptions directly
  • test exception took a couple of minutes to be received
  • basic plan ('tauron') has limit of 1 collaborator; not sure if it's possible to add additional emails in such a case

Raygun Crash Reporting

https://elements.heroku.com/addons/raygun

  • 5K errors/month @ free
  • sets RAYGUN_APIKEY env var
  • raygun4ruby Ruby library
  • compulsory company survey
  • doesn't work without configuration, even with RACK_ENV env var set
  • works with configuration
  • not immediately clear how to add email recipient

Sentry

https://elements.heroku.com/addons/sentry

  • 5K errors/month @ free
  • sets SENTRY_DSN env var
  • sentry-ruby Ruby library
  • compulsory newsletter choice
  • doesn't work without configuration, even with RACK_ENV env var set
  • works with configuration
  • simple account-level alerts don't work even after email confirmation

Honeybadger

https://elements.heroku.com/addons/honeybadger

  • 5K errors/month @ free
  • sets HONEYBADGER_API_KEY env var
  • honeybadger Ruby library
  • doesn't work without configuration, even with RACK_ENV env var set
  • quick-start instructions don't work in our case
  • works with configuration
  • attempting to add extra email alert fails with exception ( :| )
  • alert still went to main account email, rather than additional email

AppSignal APM

https://elements.heroku.com/addons/appsignal

  • 250K requests/month @ $15/month
  • skipping since more complex and expensive than the others for our use case (full APM rather than just exception monitoring)

Airbrake Error Monitoring

https://elements.heroku.com/addons/airbrake

  • 2K errors/month @ free
  • sets AIRBRAKE_API_KEY, AIRBRAKE_PROJECT_ID env vars
  • airbrake-ruby Ruby library
  • doesn't work without configuration, even with RACK_ENV env var set
  • works with configuration
  • simple account-level email alerts work
  • I haven't used it for many years, but it still appears to be simple to use out-the-box
  • errors take a couple of minutes to be detected ( :( )
  • the next pricing tier is more expensive than some of the others

Rollbar

https://elements.heroku.com/addons/rollbar

  • 5K events/month @ free
  • sets ROLLBAR_ACCESS_TOKEN, ROLLBAR_ENDPOINT env vars
  • rollbar Ruby library
  • doesn't work without configuration, even with RACK_ENV env var set
  • works with configuration
  • not immediately clear how to add email alerts

@tiredpixel
Copy link
Contributor Author

Monitoring extensions experiment conclusion

There are likely multiple options possible for us, here; I evaluated each option for only a few minutes. Had I had more experience with each option, it is likely that I would know better how to configure them for our use case. However, our use case is actually very simple: detect a failure, and send an email. I was surprised that configuring this wasn't immediately obvious in some of the options.

Nothing worked immediately out-the-box; this is typical for Ruby (or at least was some years ago), but I wondered if one would have a Ruby library which injected into the exceptions stack and worked with zero configuration, even for a basic Ruby app (not Rails, not Sinatra, not Rake, etc.). I can fully understand why they didn't opt for this approach—but it would be less work. Perhaps there are alternative configurations which support this, but I didn't spot them when glancing through the documentation for each.

From this experiment, I would say that Airbrake was the easiest to set up, understand, and configure. However, I should note that I have previous experience with Airbrake (albeit many years, perhaps closer to a decade, ago… :! )—yet I don't think this was a major influence. However, it's more expensive than the others for the next (non-free) tier. It's also not clear to me how easy it will be to configure multiple users (which themselves require a non-free event tier).

This judgement is rather arbitrary, since it's likely there are multiple good options, here—just with a little more work. However, I'm not sure where the work would be best invested (not being familiar with Heroku monitoring add-ons in recent years). So, I would suggest starting with Airbrake (which also has a free tier), and if necessary, considering whether to pay the (rather costly) non-free tier upgrade, or whether to switch to an alternative. However, from my experiments, Airbrake will easily support what we need for the time being, with minimal fuss. But as I say, this judgement is rather arbitrary; given a little more time or previous experience with some of the other options, I might well have selected one of them instead…

What's most important here is to have something that works for what we need currently—alerting us when there are crashes in the streaming app. Unfortunately, it appears that Heroku neither has this functionality natively, nor has a way of automatically restarting crashed apps. This is, quite frankly, a bitter disappointment; not only the gold-standard Kubernetes, but also other alternatives, have long supported this sort of fatal crash and restart scenario. Perhaps I missed something, but I don't see anything indicating to the contrary, at present.

Neither does there seem to be a recommended path for accomplishing this in Heroku, even with the installation of plugins. Thus, given that the primary objectives of error detection and email alerting are achieved, and given that evaulating even these Heroku add-on options has taken a fair amount of time, I recommend selecting Airbrake in the first instance, and then re-evaluating on a usage and pricing basis once those become the dominant factors.

This whole experiment puts me in mind of simply ignoring the exception monitoring altogether at the Ruby level, and instead dealing with it at an ops-level by wrapping the Procfile script and catching stdout and the status exceptions. Such would also allow for reporting monitoring statuses such as duration and unrun time to a solution (outside of Heroku, such as is typically used to monitor Crontabs and similar processes). I wouldn't be surprised if I ended up recommending such an approach instead—however, in the first instance, I'm trying to keep within the Heroku and typical Ruby solutions as much as possible—rather than simply removing it from that stack layer entirely and taking an 'old-school' devops/sysadmin approach (which would likely solve our use case, and more than we're currently able to monitor, far more simply…).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant