Make sanitycheck.py (WellFormedCheckEpub) look for both missing quotes #777

dougmassay · 2024-09-30T18:14:34Z

Any reason why sanitycheck.py shouldn't (or can't) look for both missing opening AND closing attribute quotes like xmlsanitycheck.py does?

https://www.mobileread.com/forums/showthread.php?p=4456608#post4456608

…quotes like xmlsanity.py does

kevinhendricks · 2024-10-01T13:48:50Z

Please test the cases where both a single double quote or a pair of double quotes is/are embedded inside a set of single quotes, and visa-versa.

Both of these cases can happen often as they can be found in img alt strings.

There are legal html attributes without quotes but xhtml requires both the = sign and a set of matched quotes (either single or double).

dougmassay · 2024-10-01T14:55:48Z

These all pass the proposed new sanitycheck.py

<p class="blah">&nbsp;</p>
<p class='blah'>&nbsp;</p>
<p class='bl"ah'>&nbsp;</p>
<p class='bl"a"h'>&nbsp;</p>
<p class="bl'ah">&nbsp;</p>
<p class="bl'a'h">&nbsp;</p>

These all fail:

<p class=blah>&nbsp;</p>
<p class="blah>&nbsp;</p>
<p class=blah">&nbsp;</p>
<p class='blah">&nbsp;</p>
<p class="blah'>&nbsp;</p>

Oddly enough... these pass (two double-quotes and two single-quotes):

  <p class=""blah>&nbsp;</p>
  <p class=''blah>&nbsp;</p>

while these fail:

  <p class=blah"">&nbsp;</p>
  <p class=blah''>&nbsp;</p>

But I suspect those last four have always been the case.

kevinhendricks · 2024-10-01T16:36:13Z

Sounds good. Please push all the fixes to master as I am still over a week away from returning.

dougmassay · 2024-10-02T02:10:55Z

Reverted some of my first commit because of: https://www.mobileread.com/forums/showthread.php?p=4456840#post4456840

I'm going to wait a bit on more testing. I don't want to make a mistake that just moves the problem somewhere else.

kevinhendricks · 2024-10-02T06:15:14Z

FWIW, the html meta charset info and media type should be stripped out in Sigil's first xhtml file input process someplace (or it used to get stripped out) as both are misleading for xhtml in an epub, as Sigil will always convert to utf-8 and add the xml header for charset and the media type is xhtml as not text/html.

Perhaps the process that stripped things out when loading a file got lost over time. Having both can be misleading. Does Mend remove them?

dougmassay · 2024-10-02T11:31:42Z

From what I can see, Mend ignores theses two things. They are not stripped out when first opened either.

kevinhendricks · 2024-10-02T16:16:30Z

I must have lost the code in CleanSource that used to do that somehow. I will investigate this when I return, if you do not beat me to it.

kevinhendricks · 2024-10-16T15:56:18Z

I will look into if and how html meta charset info and media type should be stripped out.

dougmassay · 2024-10-16T16:04:11Z

Sounds good. As it stands, this pull request will allow the meta charset. I'm not sure how important it is for sanity.py to allow this or fail it. Especially if we return to stripping it out on import/mend. I don't know about you, but I'd rather the Epub Well-formed Check not get overly ambitious in what it looks for. It was never intended as full-blown validator.

kevinhendricks · 2024-10-16T16:24:07Z

Yes that sounds good. The stripping out of html charset if it is not utf-8 and bad html media type info "text/html" really is something for CleanSource / Mend anyway.

So please merge this pr and I will separately look to see how/if CleanSource should strip out obsolete charset and media type meta info.

kevinhendricks · 2024-10-16T16:46:22Z

FWIW, CleanSource::RemoveMetaCharset still exists in Sigil CleanSource.cpp.

It was last invoked in Sigil 0.8.x series by the XHTMLTidy clean routine.
It got lost with our move to gumbo.

I will add it back to the Mend and Mend and Prettify routine which is the current equivalent.

dougmassay · 2024-10-16T17:27:54Z

I'll ping @BeckyDTP to see if she wants to add her piece that catches no quotes around an attribute. Otherwise, I'll add it.

BeckyDTP · 2024-10-16T17:32:03Z

I see the discussion.
I have to go shopping now, so feel free to add those few lines (as long as they don't cause problems).

dougmassay · 2024-10-16T17:33:26Z

Those lines have been all the testing everyone has been doing, so I think they're fine. I'll add them. Thanks for the contribution!

Make sanitycheck.py (WellFormedCheckEpub) look for both missing attr …

caeb886

…quotes like xmlsanity.py does

dougmassay added 2 commits October 1, 2024 19:49

Remove problematic change

4b19090

Remove misleading comment

22d2e4b

dougmassay merged commit 293db7e into Sigil-Ebook:master Oct 16, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sanitycheck.py (WellFormedCheckEpub) look for both missing quotes #777

Make sanitycheck.py (WellFormedCheckEpub) look for both missing quotes #777

dougmassay commented Sep 30, 2024

kevinhendricks commented Oct 1, 2024

dougmassay commented Oct 1, 2024

kevinhendricks commented Oct 1, 2024 •

edited

Loading

dougmassay commented Oct 2, 2024

kevinhendricks commented Oct 2, 2024

dougmassay commented Oct 2, 2024

kevinhendricks commented Oct 2, 2024

kevinhendricks commented Oct 16, 2024

dougmassay commented Oct 16, 2024

kevinhendricks commented Oct 16, 2024

kevinhendricks commented Oct 16, 2024

dougmassay commented Oct 16, 2024

BeckyDTP commented Oct 16, 2024

dougmassay commented Oct 16, 2024

Make sanitycheck.py (WellFormedCheckEpub) look for both missing quotes #777

Make sanitycheck.py (WellFormedCheckEpub) look for both missing quotes #777

Conversation

dougmassay commented Sep 30, 2024

kevinhendricks commented Oct 1, 2024

dougmassay commented Oct 1, 2024

kevinhendricks commented Oct 1, 2024 • edited Loading

dougmassay commented Oct 2, 2024

kevinhendricks commented Oct 2, 2024

dougmassay commented Oct 2, 2024

kevinhendricks commented Oct 2, 2024

kevinhendricks commented Oct 16, 2024

dougmassay commented Oct 16, 2024

kevinhendricks commented Oct 16, 2024

kevinhendricks commented Oct 16, 2024

dougmassay commented Oct 16, 2024

BeckyDTP commented Oct 16, 2024

dougmassay commented Oct 16, 2024

kevinhendricks commented Oct 1, 2024 •

edited

Loading