Process all non-ASCII bytes with the UTF-8 parser. #58

sunfishcode · 2020-06-15T16:32:14Z

The UTF-8 parser knows how to handle invalid byte sequences, so don't
preprocess the input in the main parser; just hand any non-ASCII byte to
the UTF-8 parser to handle.

This includes what were previously interpreted as 8-bit C1 control codes;
they are now interpreted as UTF-8 continuation characters.

This is the first in a series of patches which together fix #38 and a few related issues, in total enabling vte to process non-UTF-8 bytes in a manner consistent with Rust's String::from_utf8_lossy.

Removing support for the 8-bit C1 codes isn't necessary to achieve this, and I can revise this patch to restore support if desired, but removing them does simplify the table and the API. And the 8-bit C1 codes are rarely used and not well supported anyway. vte already doesn't appear to recognize 8-bit DCS, CSI, OSC, PM, or APC, Alacritty doesn't appear to intepret 8-bit NEL, SS2, or SS3, and the remaining 8-bit C1 codes are rarely used.

The UTF-8 parser knows how to handle invalid byte sequences, so don't preprocess the input in the main parser; just hand any non-ASCII byte to the UTF-8 parser to handle. This includes what were previously interpreted as 8-bit C1 control codes; they are now interpreted as UTF-8 continuation characters.

chrisduerr

Why exactly would we do this? We want execute to be called.

This is the first in a series of patches which together fix #38 and a few related issues, in total enabling vte to process non-UTF-8 bytes in a manner consistent with Rust's String::from_utf8_lossy.

If your goal is to implement something, could you please send the entire pull request to add this so the approach you've taken can be properly evaluated as a whole?

Unless the patch itself provides a direct value, there's really no reason to split them up. As far as I can tell, this patch just breaks things.

sunfishcode · 2020-06-15T19:31:13Z

I've now posted #60 with the full series this is part of.

The motivation for this is that from a UTF-8 perspective, 8-bit C1 control codes are continuation bytes, and if they appear alone, they form invalid UTF-8 sequences. Passing them to execute allows users to emit replacement characters themselves, but it's a little awkward if vte translates some invalid sequences to replacement characters and not others. It also turns out to be tricky to emit the expected number of replacement characters from execute because vte is processing some of the bytes and not others. Since vte is emitting replacement characters for some invalid sequences, and has most of the logic needed to determine how many replacement characters to emit, the simplest approach is just to have vte emit the replacement characters.

chrisduerr · 2020-06-15T19:40:38Z

Closing in favor of #60, I'll take a look once that is in a reviewable state.

sunfishcode force-pushed the non-ascii-bytes branch from f36bebf to f75b04c Compare June 15, 2020 16:37

chrisduerr requested changes Jun 15, 2020

View reviewed changes

sunfishcode mentioned this pull request Jun 15, 2020

Translate non-UTF-8 byte sequences into replacement characters #60

Closed

chrisduerr closed this Jun 15, 2020

sunfishcode deleted the non-ascii-bytes branch June 15, 2020 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process all non-ASCII bytes with the UTF-8 parser. #58

Process all non-ASCII bytes with the UTF-8 parser. #58

sunfishcode commented Jun 15, 2020

chrisduerr left a comment •

edited

Loading

sunfishcode commented Jun 15, 2020

chrisduerr commented Jun 15, 2020

Process all non-ASCII bytes with the UTF-8 parser. #58

Process all non-ASCII bytes with the UTF-8 parser. #58

Conversation

sunfishcode commented Jun 15, 2020

chrisduerr left a comment • edited Loading

Choose a reason for hiding this comment

sunfishcode commented Jun 15, 2020

chrisduerr commented Jun 15, 2020

chrisduerr left a comment •

edited

Loading