Handling invalid UTF-8 bytes #38

sunfishcode · 2020-01-08T20:12:49Z

I'm looking at using vte for a use case where I want to translate invalid UTF-8 bytes into Unicode replacement characters, however vte seem to silently swallow some invalid UTF-8 bytes. For example, if I feed it input consisting of the byte 0x90, it produces no events.

Would it make sense to add Execute rules to the Ground table for 0x90 and other formerly special C1 codes?

Would it make sense to introduce something like a InvalidUtf8 action, to fill in the Ground table in general?

The text was updated successfully, but these errors were encountered:

chrisduerr · 2020-01-08T20:24:11Z

Non-utf8 8-bit C1 escapes should be passed to execute, so you should be able to handle C1 codes if that's your issue?

sunfishcode · 2020-01-08T20:27:34Z

Here's a more specific testcase:

$ echo -e '\x90' > test.txt
$ target/debug/examples/parselog < test.txt
[execute] 0a
$

The 0x90 byte is silently dropped with no execute or any other action.

chrisduerr · 2020-01-08T20:40:38Z

\x90 is an escape introducer, which is stripped for security based on my understanding of the code.

So escapes like \x85 will emit an execute, but the DCS(x90)/CSI(x9b)/OSC(x9d) 8-bit escapes are ignored.

sunfishcode · 2020-01-08T20:50:24Z

I don't actually want to interpret C1 controls in my use case; I want to replace all non-UTF-8 bytes into replacement characters.

Right now, vte doesn't support that, either for bytes like 0x90 which are C1 controls, or bytes like 0xfd which are not. Is this a use case vte is interested in supporting?

chrisduerr · 2020-01-08T21:14:34Z

Is this a use case vte is interested in supporting?

I'm not sure if it's possible to support that without removing existing functionality.

Take things like the NEL non-utf8 8-bit C1 escape \x85. We trigger the execute function for that with this byte attached. So it's a valid escape that we propagate upstream for handling. So it's not actually invalid at all.

You could just handle C1 escapes in your application by printing the missing glyph symbol, would that be reasonable? As far as I can tell, all that would be required then would be to make them all available appropriately.

sunfishcode · 2020-01-08T22:13:24Z

You could just handle C1 escapes in your application by printing the missing glyph symbol, would that be reasonable? As far as I can tell, all that would be required then would be to make them all available appropriately.

Yes, that's what I want to do. It's ok if vte reports these bytes through execute or a new invalid hook or some other hook. I just want to know when these bytes happen so that I know when to emit replacement characters.

Specifically, I want to do this for both C1 codes like 0x90, and non-C1 codes like 0xfd. I can cope if these two cases are reported differently, and it's even ok if the API doesn't tell me what the actual bytes are, as long as it provides indications that such bytes were processed.

chrisduerr · 2020-01-08T22:23:43Z

For actually invalid UTF-8, we already print error glyphs (see echo -e "\xc2\xc2"). So as far as I can tell we'd probably just need to make sure that bytes that are ignored right now are somehow propagated (like C1 DCS/CSI/OSC).

For these specific bytes it would be possible to propagate them to the execute function without actually handling them, though I'm not sure about other things like 0xfd, I'd have to look into that myself.

sunfishcode changed the title ~~Handling invalid UTF_8 bytes~~ Handling invalid UTF-8 bytes Jan 8, 2020

chrisduerr added the enhancement label Jan 8, 2020

This was referenced Jun 15, 2020

Process all non-ASCII bytes with the UTF-8 parser. #58

Closed

Translate non-UTF-8 byte sequences into replacement characters #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling invalid UTF-8 bytes #38

Handling invalid UTF-8 bytes #38

sunfishcode commented Jan 8, 2020

chrisduerr commented Jan 8, 2020

sunfishcode commented Jan 8, 2020

chrisduerr commented Jan 8, 2020

sunfishcode commented Jan 8, 2020

chrisduerr commented Jan 8, 2020

sunfishcode commented Jan 8, 2020

chrisduerr commented Jan 8, 2020

Handling invalid UTF-8 bytes #38

Handling invalid UTF-8 bytes #38

Comments

sunfishcode commented Jan 8, 2020

chrisduerr commented Jan 8, 2020

sunfishcode commented Jan 8, 2020

chrisduerr commented Jan 8, 2020

sunfishcode commented Jan 8, 2020

chrisduerr commented Jan 8, 2020

sunfishcode commented Jan 8, 2020

chrisduerr commented Jan 8, 2020