Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle quoted external CSS URLs in privacy plugin #7651

Merged
merged 1 commit into from
Oct 30, 2024

Conversation

nejch
Copy link
Contributor

@nejch nejch commented Oct 29, 2024

This one is very similar to #7650, but keeping it separate as it covers different files. I was wondering why our external woff/woff2 files weren't being downloaded and noticed this issue.

The back-reference helps get the exact URL and to avoid more false positives on invalid quoting, though there are still some - but still better than just expecting quotes on either side IMO. Also, this is probably caught by CSS linters and build system anyway.

I had to introduce named groups to handle this, hence the slight change in logic as findall doesn't support named groups. But IMO this might also help in the future if more edge cases show up.

🛠️ with ❤️ at Siemens

@nejch nejch marked this pull request as ready for review October 29, 2024 13:12
@squidfunk
Copy link
Owner

Same as #7650 (comment), need a little more input.

@squidfunk
Copy link
Owner

Looking at the CSS regex, in your example, cases that should not match still do:

Bildschirm­foto 2024-10-30 um 10 15 42

Moving the ? quantifier into the capturing group will allow the back reference to accept an empty match:

url\(([\"']?)(?P<url>\s*http?[^)'\"]+)\1\)
Bildschirm­foto 2024-10-30 um 10 15 54

@nejch
Copy link
Contributor Author

nejch commented Oct 30, 2024

Looking at the CSS regex, in your example, cases that should not match still do:
...
Moving the ? quantifier into the capturing group will allow the back reference to accept an empty match:
...

url\(([\"']?)(?P<url>\s*http?[^)'\"]+)\1\)

Ah nice catch, missed that one! Will adapt the PR.

Here's a reproduction zip with all 3 quote styles:

9.5.42-css-quoted-urls-missing-from-privacy-plugin.zip

With the current version, only the unquoted external URL gets fetched.

Let me know if these 2 repro zips are clear enough @squidfunk 🙇

Edit: pushed the new regex and also updated the regex101 link above. Should be ready for another round 🏓

@nejch nejch force-pushed the fix/css-url-quoted branch 2 times, most recently from 7bca798 to c4105b0 Compare October 30, 2024 12:18
@squidfunk
Copy link
Owner

Thanks! I believe we need to allow spaces before and after the URL, as according to syntax level 3, specifically how URL tokens are parsed. There are two ways how URLs are handled: if they contain a string, they're just considered to be normal function tokens, which definitely demands for whitespace after and before. If they contain a verbatim URL, I believe we need to support whitespace as well.

url\(\s*([\"']?)(?P<url>http?[^)'\"]+)\1\s*\)

This now works correctly with the following strings, albeit it consumes the trailing whitespace on the verbatim version, which should not be a problem though:

/* correct */
url("https://example.com/images/myImg.jpg");
url('https://example.com/images/myImg.jpg');
url(  'https://example.com/images/myImg.jpg'  );
url(  "https://example.com/images/myImg.jpg"  );
url(https://example.com/images/myImg.jpg);
url(   https://example.com/images/myImg.jpg   );


/* mismatching */
url('https://example.com/images/myImg.jpg);
url("https://example.com/images/myImg.jpg);
url('https://example.com/images/myImg.jpg");
url(https://example.com/images/myImg.jpg');
url(https://example.com/images/myImg.jpg");

/* non-http links */
url("data:image/jpg;base64,iRxVB0…");
url(myImg.jpg);
url(#IDofSVGpath);

PS: How can I share on regex101? I'm too stupid to find the share button 😅

@nejch
Copy link
Contributor Author

nejch commented Oct 30, 2024

Perfect, thanks! Added the new pattern.

Hehe, Ctrl+S should save it and give you a modal with the link: https://regex101.com/r/LVJJfK/1https://regex101.com/r/LVJJfK/1. Not sure if it requires starting over when receiving a link though.

I also just checked and as you say at least urlparse seems to be happy with trailing whitespace:

>>> from urllib.parse import urlparse
>>> urlparse("https://example.com/images/myImg.jpg   ")
ParseResult(scheme='https', netloc='example.com', path='/images/myImg.jpg   ', params='', query='', fragment='')

@squidfunk
Copy link
Owner

Perfect, thanks for investigating! I think this is safe to merge then 🤟

@squidfunk squidfunk merged commit 4918a10 into squidfunk:master Oct 30, 2024
4 checks passed
@nejch nejch deleted the fix/css-url-quoted branch October 30, 2024 14:22
@squidfunk
Copy link
Owner

Released as part of 9.5.43.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants