Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch all exceptions; save full responses and errors #1

Merged
merged 1 commit into from
Jan 15, 2025

Conversation

antonok-edm
Copy link
Collaborator

No description provided.

@@ -91,11 +90,24 @@ def read_csv_make_requests(skip):

# Write to both stdout and output file
def crawl_url(url, location):
url, status_code, identified, error = post_request(url, location)
region = "San Francisco" if location == "" else "Europe"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what gets printed now if no location is specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just prints the proxy URL, so for the direct case it's empty string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we'll be putting this in EC2, I think we should print out what the vantage point is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I figured the primary interface for analyzing the results will be by running scripts on the output file anyways - the console logging is just a convenience thing)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just quickly show what the output looks like for a sample website?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ran it for a bit; log output:

failed for https://www.nehnutelnosti.sk (bg.stealthtunnel.net): Navigation timeout of 30000 ms exceeded
failed for https://spserv.microsoft.com (): net::ERR_CERT_COMMON_NAME_INVALID at https://spserv.microsoft.com
failed for https://spserv.microsoft.com (bg.stealthtunnel.net): net::ERR_CERT_COMMON_NAME_INVALID at https://spserv.microsoft.com
failed for https://informaticacloud.com (): net::ERR_NAME_NOT_RESOLVED at https://informaticacloud.com
failed for https://informaticacloud.com (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://informaticacloud.com
failed for https://ninebot.com (): Navigation timeout of 30000 ms exceeded
failed for https://ninebot.com (bg.stealthtunnel.net): Navigation timeout of 30000 ms exceeded
failed for https://first-ns.de (): net::ERR_SSL_PROTOCOL_ERROR at https://first-ns.de
failed for https://first-ns.de (bg.stealthtunnel.net): net::ERR_SSL_PROTOCOL_ERROR at https://first-ns.de
failed for https://cdnhwc8.com (): net::ERR_NAME_NOT_RESOLVED at https://cdnhwc8.com
failed for https://cdnhwc8.com (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://cdnhwc8.com
failed for https://audiencenet.ru (): net::ERR_NAME_NOT_RESOLVED at https://audiencenet.ru
failed for https://audiencenet.ru (bg.stealthtunnel.net): net::ERR_TUNNEL_CONNECTION_FAILED at https://audiencenet.ru
identified for https://www.viabcp.com ()!
identified for https://www.viabcp.com (bg.stealthtunnel.net)!

that only shows failures and identifications, which was the most relevant stuff for me to see when checking on the status of a crawl.

The full results from the file look more like this (truncated for brevity, since there are a lot more uninteresting crawl results that did not get logged to the console above):

[200, "", "{\"url\":\"https://www.tanishq.co.in\",\"timestamp\":1736886632362,\"scriptSources\":[\"www.tanishq.co.in\",\"code.jquery.com\",\"ajax.googleapis.com\",\"accounts.tatadigital.com\",\"cdn-api.syteapi.com\",\"asset.fwcdn3.com\",\"cdn.cquotient.com\",\"cdn.syteapi.com\",\"fireworkapi1.com\",\"cdn.mirrar.com\",\"e.cquotient.com\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.tanishq.co.in\",\"timestamp\":1736886640770,\"scriptSources\":[\"www.tanishq.co.in\"],\"classifiersUsed\":[],\"scrollBlocked\":false}"]
[200, "", "{\"url\":\"https://audiencenet.ru\",\"timestamp\":1736886647729,\"scriptSources\":[],\"error\":\"net::ERR_NAME_NOT_RESOLVED at https://audiencenet.ru\"}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://audiencenet.ru\",\"timestamp\":1736886649361,\"scriptSources\":[],\"error\":\"net::ERR_TUNNEL_CONNECTION_FAILED at https://audiencenet.ru\"}"]
[200, "", "{\"url\":\"https://www.novibet.gr\",\"timestamp\":1736886652017,\"scriptSources\":[\"www.novibet.gr\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.novibet.gr\",\"timestamp\":1736886658173,\"scriptSources\":[\"www.novibet.gr\"],\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "", "{\"url\":\"https://www.viabcp.com\",\"timestamp\":1736886666177,\"scriptSources\":[\"www.viabcp.com\",\"assets.adobedtm.com\",\"unruffled-shannon-1a7413.netlify.app\",\"apis.google.com\",\"www.google.com\",\"www.gstatic.com\",\"bcpr42sh.staticmon.com\"],\"identified\":true,\"markup\":\"<div class=\\\"bcp_contenedor_aviso container\\\">\\n  <div class=\\\"bcp_grupo_texto\\\">\\n    <div data-translate=\\\"true\\\" class=\\\"bcp_titulo\\\" tabindex=\\\"0\\\" id=\\\"dialogTitleModalConsentimiento\\\">Pol\u00edtica de Cookies</div>\\n    <div class=\\\"bcp_mensaje\\\" tabindex=\\\"0\\\" id=\\\"dialogDescriptionModalConsentimiento\\\">\\n      <span data-translate=\\\"true\\\">Esta web utiliza cookies necesarias y, con tu consentimiento, utilizaremos cookies de personalizaci\u00f3n y marketing.</span>\\n      <span data-translate=\\\"true\\\">Para m\u00e1s informaci\u00f3n revisa nuestra </span><a data-translate=\\\"true\\\" href=\\\"/transparencia/#protecciondedatos\\\" rel=\\\"noopener noreferrer\\\" target=\\\"_blank\\\" title=\\\"\\\">Pol\u00edtica de Privacidad y Pol\u00edtica de Cookies.</a>\\n    </div>\\n  </div>\\n  <div class=\\\"bcp_grupo_botones\\\">\\n    <button class=\\\"bcp_btn_configurar bcp_boton_blanco\\\" data-translate=\\\"true\\\">Configuraci\u00f3n\\n</button>\\n    <button class=\\\"bcp_btn_aceptar bcp_boton_naranja\\\" data-translate=\\\"true\\\">Aceptar todo</button>\\n  </div>\\n</div>\",\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]
[200, "bg.stealthtunnel.net", "{\"url\":\"https://www.viabcp.com\",\"timestamp\":1736886673032,\"scriptSources\":[\"www.viabcp.com\",\"assets.adobedtm.com\",\"unruffled-shannon-1a7413.netlify.app\",\"apis.google.com\",\"bcpr42sh.staticmon.com\",\"www.google.com\",\"www.gstatic.com\"],\"identified\":true,\"markup\":\"<div class=\\\"bcp_contenedor_aviso container\\\">\\n  <div class=\\\"bcp_grupo_texto\\\">\\n    <div data-translate=\\\"true\\\" class=\\\"bcp_titulo\\\" tabindex=\\\"0\\\" id=\\\"dialogTitleModalConsentimiento\\\">Pol\u00edtica de Cookies</div>\\n    <div class=\\\"bcp_mensaje\\\" tabindex=\\\"0\\\" id=\\\"dialogDescriptionModalConsentimiento\\\">\\n      <span data-translate=\\\"true\\\">Esta web utiliza cookies necesarias y, con tu consentimiento, utilizaremos cookies de personalizaci\u00f3n y marketing.</span>\\n      <span data-translate=\\\"true\\\">Para m\u00e1s informaci\u00f3n revisa nuestra </span><a data-translate=\\\"true\\\" href=\\\"/transparencia/#protecciondedatos\\\" rel=\\\"noopener noreferrer\\\" target=\\\"_blank\\\" title=\\\"\\\">Pol\u00edtica de Privacidad y Pol\u00edtica de Cookies.</a>\\n    </div>\\n  </div>\\n  <div class=\\\"bcp_grupo_botones\\\">\\n    <button class=\\\"bcp_btn_configurar bcp_boton_blanco\\\" data-translate=\\\"true\\\">Configuraci\u00f3n\\n</button>\\n    <button class=\\\"bcp_btn_aceptar bcp_boton_naranja\\\" data-translate=\\\"true\\\">Aceptar todo</button>\\n  </div>\\n</div>\",\"classifiersUsed\":[\"llm\"],\"scrollBlocked\":false}"]

Each line contains the full response from cookiemonster, so no need to worry about figuring out what data needs to be captured, it can all be handled at analysis time

@ShivanKaul ShivanKaul merged commit 25c14a2 into main Jan 15, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants