Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't handle binary logfile content #63

Open
jplitza opened this issue Jan 10, 2022 · 4 comments
Open

Doesn't handle binary logfile content #63

jplitza opened this issue Jan 10, 2022 · 4 comments

Comments

@jplitza
Copy link

jplitza commented Jan 10, 2022

When processing a logfile that contains binary parts, the following exception gets thrown:

Traceback (most recent call last):
  File "anonip.py", line 508, in <module>
    main()
  File "anonip.py", line 491, in main
    for line in anonip.run(input_file):
  File "anonip.py", line 161, in run
    line = input_file.readline()
  File "/usr/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 1040: invalid start byte

While obviously processing purely binary content isn't the target of this project, this issue arose while anonymizing an nginx error.log which contained the following line:

2022/01/09 05:21:49 [info] 58271#58271: *55771 client sent invalid method while reading client request line, client: 192.0.2.0, server: foo.example.org, request: "<binary rubbish>"

Note that there's even an IP address in that line that needs to be anonymized!

So maybe the file shouldn't be read as UTF-8, or as string at all for that matter, but as bytes?

@elandorr
Copy link

I was about to say that

Works for both access.log- and error.log files

isn't true in case of nginx default error logs at least, it warns, while reading the year as IP (expecting it in the first column). That's why it didn't anonymize your IP, it just skips the line.

WARNING:__main__:'2022' does not appear to be an IPv4 or IPv6 network

There is no 'clean/standard' way to do this yet as far as I know. If nginx devs happen to read this: It's best to have the anonymization as a second process and not in nginx itself, so we can have x days of full logs for debugging/security/whatever and then just anonymize later for archiving.
Maybe anonip could be this 'standard' at some point if we can handle at least default formats.

  1. Accidentally catching strings that look like IPs seems unavoidable given the random nature of errors. Right now it looks for the configured column, so it misses them completely.

  2. Do we need correct error messages in archives?
    Otherwise it could simply regex everything. The reason I chose anonip was because it has 'proper' ipv4 and ipv6 regex so I don't have to figure out all edge cases :).
    If we regex the whole file we just need a proven query that always works and a script that can handle possible binary content.

The way I see it http logs are only useful for rough statistics anyway, so we really don't need the errors. Dropping a few bits still gives us a little country level accuracy, while respecting privacy. If we refuse to terrorize the people via Google analytics or similar, we will only have basic stats anyhow.

For now I just delete the logs regularly.

@open-dynaMIX
Copy link
Member

Hey @jplitza

Thanks for reporting this issue. Could you please provide an example file, that leads to this exception?

@jplitza
Copy link
Author

jplitza commented Mar 28, 2022

Well as I said, I encountered this in nginx's error.log. Something like curl http://example.org/$'\xC0' might produce the following line in your error.log (if it's answered with 403 due to server configuration):

2022/03/28 09:39:29 [error] 2656#2656: *120128 access forbidden by rule, client: 2001:db8::, server: _, request: "GET /<C0> HTTP/1.1", host: "example.org"

Note that is just less' rendition of the actual \xC0 character (an À in ISO-8859-1 if you care)

Here's that line as a single file: anonip_issue_63.log

@jplitza
Copy link
Author

jplitza commented Jun 9, 2022

My minimal workaround for this problem is now using errors="surrogateescape" for both open() calls in main(). This throws up when you don't use --input and --output, but it works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants