mochiweb_html: parse exception #89

helllamer · 2012-11-13T07:56:24Z

Steps to reproduce:

Download and unpack

$ wget http://ompldr.org/vZzlndw/error.html.bz2
$ bunzip2 error.html.bz2

In erlang console:

{ok, Data} = file:read_file("error.html"),

mochiweb_html:parse(Data).
** exception error: no case clause matching <<"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional."...>>
     in function  mochiweb_html:find_qgt/2 
     in call from mochiweb_html:tokenize/2 
     in call from mochiweb_html:tokens/3 
     in call from mochiweb_html:parse/1

but HTML source code looks valid.

The text was updated successfully, but these errors were encountered:

doubleyou · 2012-11-13T17:45:34Z

<ul><li><a href='http://wp-skins.info/2009/06/08/oshibka-cannot-modify-header-information-headers-already-sent.html'>Warning: Cannot modify header information</a></li><li><a href='http://wp-skins.info/2009/09/04/kak-rasshifrovat-base64_decode-v-futere-wordpress-temyi.html'><? echo(base64_decode("</a></li><li><a href='http://wp-skins.info/2009/03/31/delaem-krasivyie-ssyilki-ili-horoshiy-chpu-v-wordpress-dlya-seo.html'>ЧПУ в wordpress</a></li><li><a href='http://wp-skins.info/2009/03/16/besplatnyiy-internet-magazin-s-pomoschyu-wordpress-i-quick-shop.html'>интернет магазин на wordpress</a></li><li><a href='http://wp-skins.info/2009/02/18/shablonyi-zhurnalov-ili-gazet-dlya-sayta-na-wordpress.html'>шаблоны газет</a></li><li><a href='http://wp-skins.info/2007/12/19/wordpress-dlya-novichkov-chast-2.html'>Hello Dolly wordpress</a></li><li><a href='http://wp-skins.info/category/design'>дизайн wordpress</a></li><li><a href='http://wp-skins.info/2008/07/11/kak-dobavit-vatermark-watermark-na-svoi-kartinki.html'>ватермарк</a></li></ul></li>

In particular, this part (most likely, crappy WP code chunk):

<? echo(base64_decode("

Piece of advice: next time, just find the problem line using binary search and analyze it (generally, syntax highlighting is enough for that).

etrepum · 2012-11-13T18:53:53Z

Even if it's "invalid HTML", it should still parse to something.

helllamer · 2012-11-13T19:04:37Z

So sorry about crappy html, but it is unclear for me, that exception

no case clause matching <<"<!DOCTYPE html PUBLIC \"-// ....

is somehow connected with inlined php-tags. My first idea was problems with DOCTYPE.

etrepum · 2012-11-13T19:26:19Z

Yeah, Erlang's error reporting for failed pattern matches on binaries could use some work. That's not the failure mode mochiweb_html is supposed to have, so this is a bug either way.

doubleyou · 2012-11-13T19:26:38Z

find_qgt(Bin, S=#decoder{offset=O}) ->
    case Bin of
        <<_:O/binary, "?>", _/binary>> ->
            ?ADV_COL(S, 2);
        <<_:O/binary, ">", _/binary>> ->
            ?ADV_COL(S, 1);
        <<_:O/binary, "/>", _/binary>> ->
            ?ADV_COL(S, 2);
        %% tokenize_attributes takes care of this state:
        %% <<_:O/binary, C, _/binary>> ->
        %%     find_qgt(Bin, ?INC_CHAR(S, C));
        <<_:O/binary>> ->
            S;
        _ ->
            S
    end.

(Just added the last clause actually). This handles the current case, but in case of, say, an incomplete tag, it may purge all the contents after the corrupted part.

@etrepum From my experience of sanitizing HTML, there's no ultimately good way of doing it. At Echo, we just tested several approaches and ended up with the one that was statistically the most accurate for us.

Personally, I've been always thinking of mochiweb_html as something like mochijson/mochijson2, and they fail when they meet incorrect format.

etrepum · 2012-11-13T19:30:04Z

Ideally it would follow the HTML5 recommendations for how HTML parsing should work, but I think this code predates that spec. It wasn't supposed to be a brittle parser, it was supposed to be relatively forgiving... but it was never used to parse crazy stuff in the wild so the parser was never adapted to be as robust as it was intended to be.

helllamer · 2012-11-14T14:18:46Z

So... I understood the mochihtml aims, and I will sanitize crazy-HTML and patch find_qgt/2.
Should one close this bug or ...?

etrepum · 2012-11-16T05:55:54Z

The bug shouldn't be closed until it's fixed in the repo

doubleyou closed this as completed Nov 13, 2012

etrepum reopened this Nov 13, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mochiweb_html: parse exception #89

mochiweb_html: parse exception #89

helllamer commented Nov 13, 2012

doubleyou commented Nov 13, 2012

etrepum commented Nov 13, 2012

helllamer commented Nov 13, 2012

etrepum commented Nov 13, 2012

doubleyou commented Nov 13, 2012

etrepum commented Nov 13, 2012

helllamer commented Nov 14, 2012

etrepum commented Nov 16, 2012

mochiweb_html: parse exception #89

mochiweb_html: parse exception #89

Comments

helllamer commented Nov 13, 2012

doubleyou commented Nov 13, 2012

etrepum commented Nov 13, 2012

helllamer commented Nov 13, 2012

etrepum commented Nov 13, 2012

doubleyou commented Nov 13, 2012

etrepum commented Nov 13, 2012

helllamer commented Nov 14, 2012

etrepum commented Nov 16, 2012