Incorrect handling of tabs in link resources compared to spaces, newlines #191

wooorm · 2020-05-28T09:31:15Z

Spec currently defines whitespace can exist, and in some cases must exist around the destination and title, in the parens: https://spec.commonmark.org/0.29/#inline-link
I clarified the wording recently in Clarify wording in spec for character groups commonmark-spec#618
But whitespace was used since the start: https://github.com/commonmark/commonmark-spec/blame/858a28941d0dd17c24b7240f21372652111bd38b/spec.txt#L7495

The bug probably stems from here:

Line 651 in 8c698a2

this.spnl() &&

, which seems to include spaces and newlines but not tabs.

This bug can be reproduced with the following permalink to the dingus: https://spec.commonmark.org/dingus/?text=tab%3A%20%5Bx%5D(%09y)%0Aspace%3A%20%5Bx%5D(%20y)%0A%0Atab%3A%20%5Bx%5D(y%09)%0Aspace%3A%20%5Bx%5D(y%20)%0A%0Atab%3A%20%5Bx%5D(%09%3Cy%3E)%0Aspace%3A%20%5Bx%5D(%20%3Cy%3E)%0A%0Atab%3A%20%5Bx%5D(y%09%22z%22)%0Aspace%3A%20%5Bx%5D(y%20%22z%22)%0A.

The expected behavior is that tabs and spaces behave the same.

The text was updated successfully, but these errors were encountered:

mikeando · 2022-06-23T13:16:43Z

It looks like all that would be required is to change one line in inlines.js: https://github.com/commonmark/commonmark.js/blob/master/lib/inlines.js#L74

var reSpnl = /^ *(?:\n *)?/;

to

var reSpnl = /^[ \t]*(?:\n[ \t]*)?/;

but I'v not built commonmark.js before, so I'm not sure how to go about testing it.

mikeando · 2022-06-23T13:27:44Z

Actually, that new regex only covers this one case, but the 0.30 spec says we should interpret "whitespace" as:

A Unicode whitespace character is any code point in the Unicode Zs general category, or a tab (U+0009), line feed (U+000A), form feed (U+000C), or carriage return (U+000D).

so while we're in here we should add those.

mikeando · 2022-06-23T13:40:25Z

It looks like all of the whitespace definitions in that part of the code could do with a cleanup.

var reWhitespaceChar = /^[ \t\n\x0b\x0c\x0d]/;
var reUnicodeWhitespaceChar = /^\s/;
var reFinalSpace = / *$/;
var reInitialSpace = /^ */;

reWhitespaceChar includes (U+000B), which it should not.
reUnicodeWhitespacChar \s includes (U+000B), which it should not - and I believe the regex is not unicode aware.
Final space and initial space should also allow other space characters.

I'm not even sure that we need to differentiate between unicode-white-space and white-space, the spec seems to have changed between 0.29 and 0.30 to only have the definition for unicode-white-space.

wooorm · 2022-06-23T13:43:01Z

Where did you see that this case should be covered by unicode whitespace?
I don’t believe it is. I believe only emphasis/strong use unicode whitespace.

mikeando · 2022-06-23T13:47:32Z

Where did you see that this case should be covered by unicode whitespace?

The spec says "white-space" and the only definition for whitespace is now "unicode whitespace".
In 0.29 there was a definition for whitespace in the same location, but it was removed.

Maybe ascii-whitespace was intended, and the definition should not have been removed...
IIRC in 0.29 whitespace did include (U+000B) so the changes would be fewer.

wooorm · 2022-06-23T13:53:48Z

In 0.29 it says whitespace. In 0.30 it doesn’t. I believe I fixed that confusion. Or, where do you see whitespace?

mikeando · 2022-06-23T13:54:36Z

You're right the spec does explicitly call it out as spaces/tabs in links.

These four components may be separated by spaces, tabs, and up to one line ending.

mikeando · 2022-06-23T13:56:33Z

In that case, my initial fix should be sufficient I think.

I'm miss-remembering from some other issues I've been digging into for space handling in HTML elements... sorry.

wooorm · 2022-06-23T13:58:57Z

Yeah. I also found this confusing. That’s why I fixed it :)
In markdown, only [ \t], and in some cases \r\n|\n|\r are considered as “whitespace”.
In markdown, only ASCII punctuation is considered as punctuation.
Except for emphasis/strong, which checks for unicode whitespace and unicode punctuation.

mikeando · 2022-06-23T14:27:16Z

This is wandering off-topic a little - but this is enlightening for me.

Does this mean the current parsing of <T\b> as an HTML entry is wrong?
The \b cant be part of the tag-name (since it can only be ASCII digits, letters or hypen).
It similarly can't form an attribute-name as that must start with an ascii letter, and it
is also not a space or tab which is allowed between the elements.

cmark, commonmark.js and comrak (rust parser) all accept it as HTML though...

wooorm · 2022-06-23T14:38:50Z

I believe that cmark, commonmark.js, and comrak, are incorrect for accepting it as HTML according to the CommonMark spec.
Note though, that Unicode is allowed in markdown. There are also cases where any character is allowed. That is to say, certain states check for X characters and so something, and then for Y characters to do something else, and finally allow any other character. The unquoted attribute value is such a case.
Though, your case can’t be an unquoted attribute value, as there is not attribute name.

jgm · 2022-06-23T16:15:08Z

@mikeando want to submit a PR for your simple fix above?

mikeando · 2022-06-23T23:57:28Z

@jgm PR here: #259

mikeando linked a pull request Jun 23, 2022 that will close this issue

add support for tabs in link body #259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect handling of tabs in link resources compared to spaces, newlines #191

Incorrect handling of tabs in link resources compared to spaces, newlines #191

wooorm commented May 28, 2020

mikeando commented Jun 23, 2022 •

edited

Loading

mikeando commented Jun 23, 2022

mikeando commented Jun 23, 2022

wooorm commented Jun 23, 2022

mikeando commented Jun 23, 2022

wooorm commented Jun 23, 2022

mikeando commented Jun 23, 2022

mikeando commented Jun 23, 2022

wooorm commented Jun 23, 2022

mikeando commented Jun 23, 2022

wooorm commented Jun 23, 2022 •

edited

Loading

jgm commented Jun 23, 2022

mikeando commented Jun 23, 2022

Incorrect handling of tabs in link resources compared to spaces, newlines #191

Incorrect handling of tabs in link resources compared to spaces, newlines #191

Comments

wooorm commented May 28, 2020

mikeando commented Jun 23, 2022 • edited Loading

mikeando commented Jun 23, 2022

mikeando commented Jun 23, 2022

wooorm commented Jun 23, 2022

mikeando commented Jun 23, 2022

wooorm commented Jun 23, 2022

mikeando commented Jun 23, 2022

mikeando commented Jun 23, 2022

wooorm commented Jun 23, 2022

mikeando commented Jun 23, 2022

wooorm commented Jun 23, 2022 • edited Loading

jgm commented Jun 23, 2022

mikeando commented Jun 23, 2022

mikeando commented Jun 23, 2022 •

edited

Loading

wooorm commented Jun 23, 2022 •

edited

Loading