Skip to content

Week 07

Mathias Gleitze edited this page Apr 22, 2024 · 1 revision

Setting up monitoring

To set up monitoring we changed our docker compose files to include Prometheus and Grafana. Configuring Prometheus involved some changes and configurations to our codebase:

  • Adding the corresponding package to our project: prometheus-net.AspNetCore(Prometheus client that allows us to configure a /metrics endpoint and write parameterized application metrics)
  • Adding the corresponding metric implementation: These can be seen in the ApplicationMetrics.cs file.
  • Selecting when to trigger such measurements: the use of them can be inspected in CatchAllMiddleware.cs.

Prometheus server is provisioned through the previously mentioned docker-compose.yml file, so it runs inside the same Droplet as the main server. Prometheus server is configured through the prometheus.yml file.

NOTE: We are provisioning a persistent Grafana Dashboard through a Digital Ocean droplet.

Server metrics:

Besides the default metrics provided by prometheus-net.AspNetCore package, we decided to create a middleware that is triggered with all incoming requests so we provide some extra metrics to the /metrics endpoint so it can be scrapped by Prometheus and sent to Grafana.

  • minitwit_http_request_duration_seconds: through the measurement of the request duration and the provisioning of endpoint labels, we intend to create a Histogram that presents the average response time by endpoint. This could be used to further improve on poorly behaving endpoints in the future.

  • minitwit_http_requests_total: through the measurement of the total requests received by the application and the provisioning of endpoint labels, we can have visibility of which endpoints are the most important for our application, and also work as a way of detecting other metrics such as total registered users or total messages written.

  • minitwit_http_response_status_code_total: through the measurement of the total of status codes returned by the application responses we could introduce some monitoring targeting the most critical status codes, such as 401(Unauthorized), 404 (Not found), 429(Too Many Requests), 500 (Internal Server Error), 503(Service Unavailable), etc.

Database metrics

Ask Ellie before completing on how it was configured.

Categories of monitoring

Monitoring the business

These metrics are registered by querying the database on some specific indicators, such as:

  • Messages registered (application usage)
  • Users registered (conversion)
  • Follower registrations (users interaction level)

Application monitoring

  • Rate of HTTP requests received per endpoint: we can identify the most requested endpoints. In combination with the average request duration by endpoint, it can help us identifying critical parts of our application to be refactored in the future.
  • Total number of requests (last 24hs): overall visibility of application load.
  • Average request duration by endpoint: it helps in monitoring the performance of each endpoint, might help in identifying weaknesses.
  • Total count of errors per status code (last 24hs).
  • Top 10 unhandled exception endpoints.
  • Top 10 Requested endpoints (API).

Infrastructure monitoring

  • Even though we haven't set this monitoring ourselves, Digital Ocean provides some out of the box monitoring for its droplets, such as CPU usage, memory usage, DISK I/O, Disk Usage, Bandwith, etc.

Some learnings from monitoring

By analyzing metrics, specifically "average request duration by endpoint", and incorporating research conducted by Ellie, we successfully reduced latency across our endpoints. This improvement is evident in the screenshot below: image

The primary cause of the high latency was the geographical separation between our Database (hosted in the NYC region) and our application server (in the FRA region). Digital Ocean's documentation highlights the potential performance issues caused by hosting database droplets and application servers in different regions. For more details, refer to the Digital Ocean community question on managed MySQL performance.

Releases

We are using semantic-release plugin together with the "release" workflow file. semantic-release is originally meant for publishing and working with node/npm packages, but we are using it purely for the functionality of automating and publishing github releases. Check the following link for more examples: LINK

making fixes:

  • feat: initial commit # => v1.0.0 on @latest
  • fix: a fix # => v1.0.1 on @latest

adding features:

  • feat: initial commit # => v1.0.0 on @latest
  • fix: a fix # => v1.0.1 on @latest
  • feat: adding a small feature # => v1.1.0 on @latest

Releasing a breaking change

  • feat: initial commit # => v1.0.0 on @latest
  • feat: drop Node.js 6 support \n\n BREAKING CHANGE: Node.js >= 8 required # => v2.0.0 on @latest

By default semantic-release uses the Angular Commit Message Conventions and triggers releases based on the following rules: https://github.com/angular/angular.js/blob/master/DEVELOPERS.md#-git-commit-guidelines

  • feat: A new feature
  • fix: A bug fix
  • docs: Documentation only changes
  • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
  • refactor: A code change that neither fixes a bug nor adds a feature
  • perf: A code change that improves performance
  • test: Adding missing or correcting existing tests
  • chore: Changes to the build process or auxiliary tools and libraries such as documentation generation

Copying files from local to remote server

To make sure that the files we have placed on the server is dynamically updated, we added to the workflow that the files should be secured copied (scp) into the server every time the workflow runs. Therefore, if we change something in the files (in the remote_files and remote_files_preproduction folders) locally it will be automatically deployed to the server when the workflow runs.

Clone this wiki locally