CharSourceStrscan does not work correctly with UTF-8 strings. Remove it. #144

outcassed · 2017-12-12T21:10:39Z

CharSourceStrScan, an alternate CharSource implementation that is not enabled by default, expects characters to be 1 byte. UTF-8 strings break it.

This removes it entirely.

Example:

Rendering

<p>ö <strong>a</strong></p>

In Ruby 1.9.x:

<p>ö &lt;strong&gt;a&lt;/strong&gt;</p>

In Ruby 2.1 and above:

parse_span.rb:32:in `read_span': invalid byte sequence in UTF-8 (ArgumentError)

CharSourceStrScan expects characters to be 1 byte, so strange things happen. For example, rendering: ```` ö a ``` In Ruby 1.9.x: ``` ö a ``` In Ruby 2.1 and above: ``` maruku/lib/maruku/input/parse_span.rb:32:in `read_span': invalid byte sequence in UTF-8 (ArgumentError) ```

coveralls · 2017-12-12T21:16:36Z

Coverage increased (+1.4%) to 78.793% when pulling d68f785 on caseyf:caseyf-remove-charsourcestrscan into ec44b27 on bhollis:master.

distler · 2017-12-12T21:26:57Z

Alternatively, one can fix CharSourceStrscan to be multi-byte-aware.

I would still make CharSourceManual the default, 'cuz it's faster.

outcassed · 2017-12-12T21:36:08Z

A multi-byte aware implementation would replace these methods. Here is a stab at it:

class CharSourceStrscan
    def cur_char
      @scanner.match?(/./m) && @scanner.matched
    end

    def cur_chars(n)
      r = Regexp.new(".{0,#{n}}", Regexp::MULTILINE)
      @scanner.match?(r) && @scanner.matched
    end
    
    def next_char
      @scanner.match?(/../m) && @scanner.matched && @scanner.matched.last
    end
    
    def shift_char
      @scanner.getch
    end
    
    def ignore_char
      @scanner.getch
      nil
    end
    
    def ignore_chars(n)
      n.times { @scanner.getch }
      nil
    end
end

distler · 2017-12-12T21:38:16Z

If there's interest in a multi-byte-aware version, I can make a pull request out of the above-linked commits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CharSourceStrscan does not work correctly with UTF-8 strings. Remove it. #144

CharSourceStrscan does not work correctly with UTF-8 strings. Remove it. #144

outcassed commented Dec 12, 2017 •

edited

Loading

coveralls commented Dec 12, 2017 •

edited

Loading

distler commented Dec 12, 2017

outcassed commented Dec 12, 2017

distler commented Dec 12, 2017

CharSourceStrscan does not work correctly with UTF-8 strings. Remove it. #144

Are you sure you want to change the base?

CharSourceStrscan does not work correctly with UTF-8 strings. Remove it. #144

Conversation

outcassed commented Dec 12, 2017 • edited Loading

coveralls commented Dec 12, 2017 • edited Loading

distler commented Dec 12, 2017

outcassed commented Dec 12, 2017

distler commented Dec 12, 2017

outcassed commented Dec 12, 2017 •

edited

Loading

coveralls commented Dec 12, 2017 •

edited

Loading