Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multibyte UTF-8 characters break the line editor #18

Open
jeremy-pereira opened this issue Apr 8, 2021 · 1 comment
Open

Multibyte UTF-8 characters break the line editor #18

jeremy-pereira opened this issue Apr 8, 2021 · 1 comment

Comments

@jeremy-pereira
Copy link

Description

I am writing a REPL for the Lambda Calculus and I incorporated line noise-Swift to provide line editing functionality. Unfortunately, The Greek letter lambda (λ) is encoded in UTF-8 as two bytes: CE BB. linenoise-swift handles input one byte at a time and tries to split the λ. The same problem occurs for any Unicode code point that takes more than one byte to stop in UTF-8, i.e. everything except 7-bit US ASCII.

How to Reproduce

Run the linenoiseDemo command line app. Type in a few characters and then a λ. The cursor will be repositioned at the start of the line and garbage appended to the end of the line. Here is an example:

Type 'exit' to quit
 gdggfdsgdsλ
utput: gdggfdsgdsλ
? 

If you are having trouble producing a λ from your keyboard, the problem still manifests if you copy-paste it from the text of this issue.

Further Information

I made an attempt to fix the issue myself. You can see my attempt here. The patch is a lot bigger than you might expect because adding support for multibyte UTF-8 exposes another more subtle bug.

Consider the following code in class EditLine

    func insertCharacter(_ char: Character) {
        let origLoc = location
        let origEnd = buffer.endIndex
        buffer.insert(char, at: location)
        location = buffer.index(after: location)
        
        if origLoc == origEnd {
            location = buffer.endIndex
        }
    }

The Apple Documentation for insert(_:, at:) says

Calling this method invalidates any existing indices for use with this string

This means that location, origLoc and origEnd are all invalid after the insert. If it's a single byte character we get away with it. If not, location ends up as a garbage value and causes a process abort when it is next used. I ended up changing the types of buffer to [Character] and location to Int as the easy way out.

NB I can give you a pull request or a patch, if it helps, but it hasn't been extensively tested and probably still breaks with composed characters e.g. emoji.

@andybest
Copy link
Owner

andybest commented Apr 9, 2021

Hey,

Unfortunately as you found, LineNoise doesn't support UTF-8 (I don't believe that the original LineNoise library does either, but I could be wrong). I'd be happy to accept a pull request with the added functionality, but it would probably be best if UTF-8 support was a enabled with a flag, as not all terminals support it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants