-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sanitycheck.py (WellFormedCheckEpub) look for both missing quotes #777
Conversation
…quotes like xmlsanity.py does
Please test the cases where both a single double quote or a pair of double quotes is/are embedded inside a set of single quotes, and visa-versa. Both of these cases can happen often as they can be found in img alt strings. There are legal html attributes without quotes but xhtml requires both the = sign and a set of matched quotes (either single or double). |
These all pass the proposed new sanitycheck.py
These all fail:
Oddly enough... these pass (two double-quotes and two single-quotes):
while these fail:
But I suspect those last four have always been the case. |
Sounds good. Please push all the fixes to master as I am still over a week away from returning. |
Reverted some of my first commit because of: https://www.mobileread.com/forums/showthread.php?p=4456840#post4456840 I'm going to wait a bit on more testing. I don't want to make a mistake that just moves the problem somewhere else. |
FWIW, the html meta charset info and media type should be stripped out in Sigil's first xhtml file input process someplace (or it used to get stripped out) as both are misleading for xhtml in an epub, as Sigil will always convert to utf-8 and add the xml header for charset and the media type is xhtml as not text/html. Perhaps the process that stripped things out when loading a file got lost over time. Having both can be misleading. Does Mend remove them? |
From what I can see, Mend ignores theses two things. They are not stripped out when first opened either. |
I must have lost the code in CleanSource that used to do that somehow. I will investigate this when I return, if you do not beat me to it. |
I will look into if and how html meta charset info and media type should be stripped out. |
Sounds good. As it stands, this pull request will allow the meta charset. I'm not sure how important it is for sanity.py to allow this or fail it. Especially if we return to stripping it out on import/mend. I don't know about you, but I'd rather the Epub Well-formed Check not get overly ambitious in what it looks for. It was never intended as full-blown validator. |
Yes that sounds good. The stripping out of html charset if it is not utf-8 and bad html media type info "text/html" really is something for CleanSource / Mend anyway. So please merge this pr and I will separately look to see how/if CleanSource should strip out obsolete charset and media type meta info. |
FWIW, CleanSource::RemoveMetaCharset still exists in Sigil CleanSource.cpp. It was last invoked in Sigil 0.8.x series by the XHTMLTidy clean routine. I will add it back to the Mend and Mend and Prettify routine which is the current equivalent. |
I'll ping @BeckyDTP to see if she wants to add her piece that catches no quotes around an attribute. Otherwise, I'll add it. |
I see the discussion. |
Those lines have been all the testing everyone has been doing, so I think they're fine. I'll add them. Thanks for the contribution! |
Any reason why sanitycheck.py shouldn't (or can't) look for both missing opening AND closing attribute quotes like xmlsanitycheck.py does?
https://www.mobileread.com/forums/showthread.php?p=4456608#post4456608