Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CharSourceStrscan does not work correctly with UTF-8 strings. Remove it. #144

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

outcassed
Copy link

@outcassed outcassed commented Dec 12, 2017

CharSourceStrScan, an alternate CharSource implementation that is not enabled by default, expects characters to be 1 byte. UTF-8 strings break it.

This removes it entirely.

Example:

Rendering

<p>ö <strong>a</strong></p>

In Ruby 1.9.x:

<p>ö &lt;strong&gt;a&lt;/strong&gt;</p>

In Ruby 2.1 and above:

parse_span.rb:32:in `read_span': invalid byte sequence in UTF-8 (ArgumentError)

CharSourceStrScan expects characters to be 1 byte, so strange things happen.

For example, rendering:

````
<p>ö <strong>a</strong></p>
```

In Ruby 1.9.x:

```
<p>ö &lt;strong&gt;a&lt;/strong&gt;</p>
```

In Ruby 2.1 and above:

```
maruku/lib/maruku/input/parse_span.rb:32:in `read_span': invalid byte sequence in UTF-8 (ArgumentError)
```
@coveralls
Copy link

coveralls commented Dec 12, 2017

Coverage Status

Coverage increased (+1.4%) to 78.793% when pulling d68f785 on caseyf:caseyf-remove-charsourcestrscan into ec44b27 on bhollis:master.

@distler
Copy link
Collaborator

distler commented Dec 12, 2017

Alternatively, one can fix CharSourceStrscan to be multi-byte-aware.

I would still make CharSourceManual the default, 'cuz it's faster.

@outcassed
Copy link
Author

A multi-byte aware implementation would replace these methods. Here is a stab at it:

class CharSourceStrscan
    def cur_char
      @scanner.match?(/./m) && @scanner.matched
    end

    def cur_chars(n)
      r = Regexp.new(".{0,#{n}}", Regexp::MULTILINE)
      @scanner.match?(r) && @scanner.matched
    end
    
    def next_char
      @scanner.match?(/../m) && @scanner.matched && @scanner.matched.last
    end
    
    def shift_char
      @scanner.getch
    end
    
    def ignore_char
      @scanner.getch
      nil
    end
    
    def ignore_chars(n)
      n.times { @scanner.getch }
      nil
    end
end

@distler
Copy link
Collaborator

distler commented Dec 12, 2017

If there's interest in a multi-byte-aware version, I can make a pull request out of the above-linked commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants