Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sanitycheck.py (WellFormedCheckEpub) look for both missing quotes #777

Merged
merged 3 commits into from
Oct 16, 2024

Conversation

dougmassay
Copy link
Contributor

Any reason why sanitycheck.py shouldn't (or can't) look for both missing opening AND closing attribute quotes like xmlsanitycheck.py does?

https://www.mobileread.com/forums/showthread.php?p=4456608#post4456608

@kevinhendricks
Copy link
Contributor

Please test the cases where both a single double quote or a pair of double quotes is/are embedded inside a set of single quotes, and visa-versa.

Both of these cases can happen often as they can be found in img alt strings.

There are legal html attributes without quotes but xhtml requires both the = sign and a set of matched quotes (either single or double).

@dougmassay
Copy link
Contributor Author

These all pass the proposed new sanitycheck.py

<p class="blah">&nbsp;</p>
<p class='blah'>&nbsp;</p>
<p class='bl"ah'>&nbsp;</p>
<p class='bl"a"h'>&nbsp;</p>
<p class="bl'ah">&nbsp;</p>
<p class="bl'a'h">&nbsp;</p>

These all fail:

<p class=blah>&nbsp;</p>
<p class="blah>&nbsp;</p>
<p class=blah">&nbsp;</p>
<p class='blah">&nbsp;</p>
<p class="blah'>&nbsp;</p>

Oddly enough... these pass (two double-quotes and two single-quotes):

  <p class=""blah>&nbsp;</p>
  <p class=''blah>&nbsp;</p>

while these fail:

  <p class=blah"">&nbsp;</p>
  <p class=blah''>&nbsp;</p>

But I suspect those last four have always been the case.

@kevinhendricks
Copy link
Contributor

kevinhendricks commented Oct 1, 2024

Sounds good. Please push all the fixes to master as I am still over a week away from returning.

@dougmassay
Copy link
Contributor Author

Reverted some of my first commit because of: https://www.mobileread.com/forums/showthread.php?p=4456840#post4456840

I'm going to wait a bit on more testing. I don't want to make a mistake that just moves the problem somewhere else.

@kevinhendricks
Copy link
Contributor

FWIW, the html meta charset info and media type should be stripped out in Sigil's first xhtml file input process someplace (or it used to get stripped out) as both are misleading for xhtml in an epub, as Sigil will always convert to utf-8 and add the xml header for charset and the media type is xhtml as not text/html.

Perhaps the process that stripped things out when loading a file got lost over time. Having both can be misleading. Does Mend remove them?

@dougmassay
Copy link
Contributor Author

From what I can see, Mend ignores theses two things. They are not stripped out when first opened either.

@kevinhendricks
Copy link
Contributor

I must have lost the code in CleanSource that used to do that somehow. I will investigate this when I return, if you do not beat me to it.

@kevinhendricks
Copy link
Contributor

I will look into if and how html meta charset info and media type should be stripped out.

@dougmassay
Copy link
Contributor Author

Sounds good. As it stands, this pull request will allow the meta charset. I'm not sure how important it is for sanity.py to allow this or fail it. Especially if we return to stripping it out on import/mend. I don't know about you, but I'd rather the Epub Well-formed Check not get overly ambitious in what it looks for. It was never intended as full-blown validator.

@kevinhendricks
Copy link
Contributor

Yes that sounds good. The stripping out of html charset if it is not utf-8 and bad html media type info "text/html" really is something for CleanSource / Mend anyway.

So please merge this pr and I will separately look to see how/if CleanSource should strip out obsolete charset and media type meta info.

@kevinhendricks
Copy link
Contributor

FWIW, CleanSource::RemoveMetaCharset still exists in Sigil CleanSource.cpp.

It was last invoked in Sigil 0.8.x series by the XHTMLTidy clean routine.
It got lost with our move to gumbo.

I will add it back to the Mend and Mend and Prettify routine which is the current equivalent.

@dougmassay dougmassay merged commit 293db7e into Sigil-Ebook:master Oct 16, 2024
4 checks passed
@dougmassay
Copy link
Contributor Author

I'll ping @BeckyDTP to see if she wants to add her piece that catches no quotes around an attribute. Otherwise, I'll add it.

@BeckyDTP
Copy link
Contributor

I see the discussion.
I have to go shopping now, so feel free to add those few lines (as long as they don't cause problems).

@dougmassay
Copy link
Contributor Author

Those lines have been all the testing everyone has been doing, so I think they're fine. I'll add them. Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants