-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch all exceptions; save full responses and errors #1
Conversation
@@ -91,11 +90,24 @@ def read_csv_make_requests(skip): | |||
|
|||
# Write to both stdout and output file | |||
def crawl_url(url, location): | |||
url, status_code, identified, error = post_request(url, location) | |||
region = "San Francisco" if location == "" else "Europe" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what gets printed now if no location is specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just prints the proxy URL, so for the direct case it's empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we'll be putting this in EC2, I think we should print out what the vantage point is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I figured the primary interface for analyzing the results will be by running scripts on the output file anyways - the console logging is just a convenience thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just quickly show what the output looks like for a sample website?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just ran it for a bit; log output:
failed for https://www.nehnutelnosti.sk (bg.stealthtunnel.net): Navigation timeout of 30000 ms exceeded
failed for https://spserv.microsoft.com (): net::ERR_CERT_COMMON_NAME_INVALID at https://spserv.microsoft.com
failed for https://spserv.microsoft.com (bg.stealthtunnel.net): net::ERR_CERT_COMMON_NAME_INVALID at https://spserv.microsoft.com
failed for https://informaticacloud.com (): net::ERR_NAME_NOT_RESOLVED at https://informaticacloud.com
failed for https://informaticacloud.com (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://informaticacloud.com
failed for https://ninebot.com (): Navigation timeout of 30000 ms exceeded
failed for https://ninebot.com (bg.stealthtunnel.net): Navigation timeout of 30000 ms exceeded
failed for https://first-ns.de (): net::ERR_SSL_PROTOCOL_ERROR at https://first-ns.de
failed for https://first-ns.de (bg.stealthtunnel.net): net::ERR_SSL_PROTOCOL_ERROR at https://first-ns.de
failed for https://cdnhwc8.com (): net::ERR_NAME_NOT_RESOLVED at https://cdnhwc8.com
failed for https://cdnhwc8.com (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://cdnhwc8.com
failed for https://audiencenet.ru (): net::ERR_NAME_NOT_RESOLVED at https://audiencenet.ru
failed for https://audiencenet.ru (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://audiencenet.ru
identified for https://www.viabcp.com ()!
identified for https://www.viabcp.com (bg.stealthtunnel.net)!
that only shows failures and identifications, which was the most relevant stuff for me to see when checking on the status of a crawl.
The full results from the file look more like this (truncated for brevity, since there are a lot more uninteresting crawl results that did not get logged to the console above):
[200, "", "{\"url\":\"https://www.tanishq.co.in\",\"timestamp\":1736886632362,\"scriptSources\":[\"www.tanishq.co.in\",\"code.jquery.com\",\"ajax.googleapis.com\",\"accounts.tatadigital.com\",\"cdn-api.syteapi.com\",\"asset.fwcdn3.com\",\"cdn.cquotient.com\",\"cdn.syteapi.com\",\"fireworkapi1.com\",\"cdn.mirrar.com\",\"e.cquotient.com\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.tanishq.co.in\",\"timestamp\":1736886640770,\"scriptSources\":[\"www.tanishq.co.in\"],\"classifiersUsed\":[],\"scrollBlocked\":false}"]
[200, "", "{\"url\":\"https://audiencenet.ru\",\"timestamp\":1736886647729,\"scriptSources\":[],\"error\":\"net::ERR_NAME_NOT_RESOLVED at https://audiencenet.ru\"}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://audiencenet.ru\",\"timestamp\":1736886649361,\"scriptSources\":[],\"error\":\"net::ERR_TUNNEL_CONNECTION_FAILED at https://audiencenet.ru\"}"]
[200, "", "{\"url\":\"https://www.novibet.gr\",\"timestamp\":1736886652017,\"scriptSources\":[\"www.novibet.gr\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.novibet.gr\",\"timestamp\":1736886658173,\"scriptSources\":[\"www.novibet.gr\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "", "{\"url\":\"https://www.viabcp.com\",\"timestamp\":1736886666177,\"scriptSources\":[\"www.viabcp.com\",\"assets.adobedtm.com\",\"unruffled-shannon-1a7413.netlify.app\",\"apis.google.com\",\"www.google.com\",\"www.gstatic.com\",\"bcpr42sh.staticmon.com\"],\"identified\":true,\"markup\":\"<div class=\\\"bcp_contenedor_aviso container\\\">\\n <div class=\\\"bcp_grupo_texto\\\">\\n <div data-translate=\\\"true\\\" class=\\\"bcp_titulo\\\" tabindex=\\\"0\\\" id=\\\"dialogTitleModalConsentimiento\\\">Pol\u00edtica de Cookies</div>\\n <div class=\\\"bcp_mensaje\\\" tabindex=\\\"0\\\" id=\\\"dialogDescriptionModalConsentimiento\\\">\\n <span data-translate=\\\"true\\\">Esta web utiliza cookies necesarias y, con tu consentimiento, utilizaremos cookies de personalizaci\u00f3n y marketing.</span>\\n <span data-translate=\\\"true\\\">Para m\u00e1s informaci\u00f3n revisa nuestra </span><a data-translate=\\\"true\\\" href=\\\"/transparencia/#protecciondedatos\\\" rel=\\\"noopener noreferrer\\\" target=\\\"_blank\\\" title=\\\"\\\">Pol\u00edtica de Privacidad y Pol\u00edtica de Cookies.</a>\\n </div>\\n </div>\\n <div class=\\\"bcp_grupo_botones\\\">\\n <button class=\\\"bcp_btn_configurar bcp_boton_blanco\\\" data-translate=\\\"true\\\">Configuraci\u00f3n\\n</button>\\n <button class=\\\"bcp_btn_aceptar bcp_boton_naranja\\\" data-translate=\\\"true\\\">Aceptar todo</button>\\n </div>\\n</div>\",\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.viabcp.com\",\"timestamp\":1736886673032,\"scriptSources\":[\"www.viabcp.com\",\"assets.adobedtm.com\",\"unruffled-shannon-1a7413.netlify.app\",\"apis.google.com\",\"bcpr42sh.staticmon.com\",\"www.google.com\",\"www.gstatic.com\"],\"identified\":true,\"markup\":\"<div class=\\\"bcp_contenedor_aviso container\\\">\\n <div class=\\\"bcp_grupo_texto\\\">\\n <div data-translate=\\\"true\\\" class=\\\"bcp_titulo\\\" tabindex=\\\"0\\\" id=\\\"dialogTitleModalConsentimiento\\\">Pol\u00edtica de Cookies</div>\\n <div class=\\\"bcp_mensaje\\\" tabindex=\\\"0\\\" id=\\\"dialogDescriptionModalConsentimiento\\\">\\n <span data-translate=\\\"true\\\">Esta web utiliza cookies necesarias y, con tu consentimiento, utilizaremos cookies de personalizaci\u00f3n y marketing.</span>\\n <span data-translate=\\\"true\\\">Para m\u00e1s informaci\u00f3n revisa nuestra </span><a data-translate=\\\"true\\\" href=\\\"/transparencia/#protecciondedatos\\\" rel=\\\"noopener noreferrer\\\" target=\\\"_blank\\\" title=\\\"\\\">Pol\u00edtica de Privacidad y Pol\u00edtica de Cookies.</a>\\n </div>\\n </div>\\n <div class=\\\"bcp_grupo_botones\\\">\\n <button class=\\\"bcp_btn_configurar bcp_boton_blanco\\\" data-translate=\\\"true\\\">Configuraci\u00f3n\\n</button>\\n <button class=\\\"bcp_btn_aceptar bcp_boton_naranja\\\" data-translate=\\\"true\\\">Aceptar todo</button>\\n </div>\\n</div>\",\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
Each line contains the full response from cookiemonster, so no need to worry about figuring out what data needs to be captured, it can all be handled at analysis time
No description provided.