-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to --no-check-certs use at own risk #89
Conversation
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
@SuperKogito it looks like one of your previously working escape sequences is now considered invalid syntax:
And then it's not detecting the urls and some tests fail I think? |
Signed-off-by: vsoch <[email protected]>
urlchecker/core/urlmarker.py
Outdated
@@ -50,7 +50,7 @@ | |||
")", | |||
"|[a-z0-9.\\-]+[.](?:%s)/)" % domain_extensions, | |||
"(?:", | |||
"[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)", | |||
r"[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything other than this looks good to me. I will check this in a couple of hours from now and see if I can come up with a quick fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This was my effort to fix the warning - turning into a raw string (what I did above) helps sometimes. But the tests are still failing - it's not even detecting URLs on many cases, so we have a larger issue on our hands.
I appreciate your help @SuperKogito !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the good news, you had the right fix to the warning and we have some good tests.
The bad news, we got a lot of deprecation warnings and some tough URLs to check e.g. https://codepen.io/rootwork/ cannot be checked from my machine but the link is there.
Also check this, this is the same test, https://groups.drupal.org/node/278968 causes the fail at the first one and then works in the second. Zero code changed in between just timing I guess.
My suggestions are the following:
- Using your fix but all over the
URL_REGEX
and use no escapes inURL_REGEX = "".join(
This is what worked for me:
URL_REGEX = r"".join(
(
r"(?i)\b(",
r"(?:",
r"https?:(?:/{1,3}|[a-z0-9%]",
r")",
r"|[a-z0-9.\-]+[.](?:%s)/)" % domain_extensions,
r"(?:",
r"[^\s()<>\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)",
r"|\([^\\s]+?\)",
r")",
r"+",
r"(?:",
r"\([^\s()]*?\([^\s()]+\)[^\s()]*?\)",
r"|\([^\\s]+?\)",
r"|[^\s`!()\[\];:'\".,<>?«»“”‘’]",
r")",
r"|",
r"(?:",
r"(?<!@)[a-z0-9]",
r"+(?:[.\-][a-z0-9]+)*[.]",
r"(?:%s)\b/?(?!@)" % domain_extensions,
r"))",
)
)
-
Comment the difficult links causing an issue -for now- until we figure a better way to check them (This is not a REGEX issue imo ... I can only point the finger to the Driver atm :/ ). In my case commenting out ("https://groups.drupal.org/node/298298" and "https://codepen.io/rootwork/") under
def test_difficult_urls(file_paths):
made the test pass for Python 3.9 and 3.12. -
A key point is to extend the test.yml to test for different versions of Python (The warning for the escape char -among others- only shows up on Python 3.12 and not on 3.9). In this regard, we need to decide which Python versions do we want to support and which ones we want to gradually drop. I think our test.yml 'test section' should look a bit more like one https://github.com/librosa/librosa/blob/main/.github/workflows/ci.yml
This way we cover more versions.
*I did not make any direct changes to the branch since I don't have a better solution and your input on these matters is very important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm.
Signed-off-by: vsoch <[email protected]>
@SuperKogito could you please do a PR to the PR branch here (and then it can be also tested)? |
#90 but I am having the same fail error :/ |
Signed-off-by: vsoch <[email protected]>
okay I found the issues - nothing to do with our regex or the requests, it was an update to selenium webdriver that deprecated some of the logic we were using. As a result, the driver was failing, returning to be None, and since that is the primary means to get a lot of these URLs (e.g., the initial requests response is not allowed), a lot (actually many) were failing. This is becoming more common with websites, as is logical, they don't want people scraping. But they can't prevent a selenium webdriver from doing so. I'm finishing up local tests now and will push the fixes shortly. |
fb4dcd8
to
2e53fd5
Compare
Note for myself: we will need to update the driver in the Dockerfile as well, once we find the one that matches GH actions. |
The current failures are a result of an update to selenium, so the instantiation of our driver fails, returns as None, and then all the requests are done with only requests. As the web matures (and sites do not want scraping) it is less likely this approach will work - we need the driver. This change will update the selenium UI to ensure the driver works and restore functionality. I will follow up with any tweaks needed for the CI (working locally for me). Signed-off-by: vsoch <[email protected]>
2e53fd5
to
0b41f0e
Compare
That green is sure beautiful :) 🍏 https://github.com/urlstechie/urlchecker-python/actions/runs/7769722953/job/21189125531?pr=89 Just pushed the update for the container, and we should be able to merge and release soon and test with the action. |
38d057e
to
35a382e
Compare
Signed-off-by: vsoch <[email protected]>
35a382e
to
a19bddd
Compare
This will address urlstechie/urlchecker-action#105. After it is tested by the person that opened the issue we will merge, release and update the action.