Scalyr can't keep up with our logs #525

mo-gr · 2018-11-14T10:52:24Z

One of our applications is creating a lot of logs. They are mostly access logs, so there isn't much we can do to reduce the amount of logs. We create so much logs that the scalyr agent is having trouble processing them. We very frequently see lines like

10:53:03.131 skipper /var/log/scalyr-agent-2/agent.log  2018-11-14 09:53:03.131Z WARNING [core] [log_processing.py:1734] [error="skipForTooFarBehind"] Skipped copying 105580980 bytes in '/var/log/application.log' due to: Too far behind end of log.  Num of bytes to end is 105580980.

The way I understand this, scalyr is frequently dropping many MB of logs. This results in very frequent 1-2min long gaps in the scalyr logs. Is there anything we (or you) can do about this?

The text was updated successfully, but these errors were encountered:

mikkeloscar · 2018-11-14T11:07:06Z

/cc @femueller @vwiessner @christianberg

mo-gr · 2018-11-14T11:08:43Z

We currently observe this on Taupage-AMI-20181101-120344 (ami-0c8c1409048d397a5)

szuecs · 2018-11-14T12:05:55Z

@mo-gr IMHO you should reduce the log volume, because you create a lot of I/O in a latency critical application affecting the whole business. What you can do is to use https://opensource.zalando.com/skipper/reference/filters/#disableaccesslog to disable logging on some of the routes.

aryszka · 2018-11-14T13:40:40Z

i think these access logs all have to be collected for compliance or similar reasons. Sampling is not an option in this case.

szuecs · 2018-11-14T14:48:54Z

@aryszka who said that?
And "I think" is not a good base for a decision like this ;)

@ChristianLohmann @mrandi do you know if this is true or can find someone responsible to answer the question if team pathfinder has to log all accesslogs in the main shop http router?
This creates a lot of I/O and costs a lot of money. Additionally there is now a technical challenge that has to be solved, if this is the case.

aermakov-zalando · 2018-11-14T16:45:55Z

@szuecs I'd argue that this is a bug or misconfiguration that should be fixed in Scalyr agent. What's the point of a logging service that can't keep up with a reasonable logging volume for one of our most important applications? And why do the users now have to babysit something as simple as logging and try to customise it on a per-route basis?

szuecs · 2018-11-14T16:59:43Z

@aermakov-zalando I let team logging answer your question. In the end technically you want to do sampling for high volume logs, if this is possible, because of compliance is different.

christianberg · 2018-11-14T21:01:44Z

I'd suggest not having this discussion in a public GitHub repo. I'll follow up via email.

mrandi · 2018-11-19T15:23:25Z

@szuecs plz create an internal ticket for this. Thx!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalyr can't keep up with our logs #525

Scalyr can't keep up with our logs #525

mo-gr commented Nov 14, 2018 •

edited

Loading

mikkeloscar commented Nov 14, 2018

mo-gr commented Nov 14, 2018

szuecs commented Nov 14, 2018

aryszka commented Nov 14, 2018

szuecs commented Nov 14, 2018

aermakov-zalando commented Nov 14, 2018

szuecs commented Nov 14, 2018

christianberg commented Nov 14, 2018

mrandi commented Nov 19, 2018

Scalyr can't keep up with our logs #525

Scalyr can't keep up with our logs #525

Comments

mo-gr commented Nov 14, 2018 • edited Loading

mikkeloscar commented Nov 14, 2018

mo-gr commented Nov 14, 2018

szuecs commented Nov 14, 2018

aryszka commented Nov 14, 2018

szuecs commented Nov 14, 2018

aermakov-zalando commented Nov 14, 2018

szuecs commented Nov 14, 2018

christianberg commented Nov 14, 2018

mrandi commented Nov 19, 2018

mo-gr commented Nov 14, 2018 •

edited

Loading