-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping domains pointing to a cname #13
Comments
Hi, |
There has been a delay in the next release, which was supposed to be released earlier this week, due to issues with implementing proxies. This is because I have been unable to find reliable HTTPS proxies that are not detected bt web scraping defenses and are reliably available for use. Another issue is my ability to ensure their safety. I am exploring alternatives to work around their implementation of Cloudflare, which is responsible for the blocking of IPs. These alternatives include modifying the existing headers to better mimic authentic ones and creating a separate script to maintain a constant session to preserve the cookie assigned at the time of the initial connection and expires when the connection is closed this was a gross oversight when I was originally writing the code that should have been incorporated from the start. Additionally, I am investigating ways to spoof the cf_clearance header that Cloudflare assigns when a CAPTCHA is completed. Regarding the spoofing of the user IP, I am looking for the most efficient and safe way to tunnel traffic through Tor. The issue with using proxies is that those provided in free lists available on the web have been known to inject malicious code into the HTTP responses. I considered setting up a Docker container to isolate the host OS from any malicious code and running it on a separate subnet from the user's home network to mitigate any possibilities of network-spread malware. However, due to the nature of free proxies being easily compromised by web scraping defenses, this was not a viable solution. In the event that I implement tunneling through Tor, I will use the Docker container setup due to Tor having many of the same dangers as free proxies. I apologize for the delay and hope to release the next update sometime next week. |
Hey mate, Thank you so much for the updates. Let me know what you think. Thanks, |
Hi, Thank you again for your continued contributions and patience's, |
Hi, I hope you're doing well. I've had some time to look at using the workers as proxies and it would work extremely well but on all my projects I don’t use any services that require registration or payment. I don't do this in order to keep my projects as accessible and anonymous as possible. This does pose lots of obstacles in development but I think those obstacles are worth it in order to provide the best user experience I possibly can. This does mean that unless I can find a safe and reliable way to route traffic through proxies which I have been unable to do I will have to bypass their WAF that is provided by CloudFlare. This means finding a way around their implementation of rate limiting and scraping detection. In my experience I get a 403 and a Captcha after 3 - 5 usages of the program. The program sends 9 requests for every URL meaning the website receives 27 - 45 requests before presenting a 403 and Captcha this leads me to believe that the blocking is not strictly caused by rate limiting but instead mainly caused by the websites scraping detection mechanism. I believe the following factors are the main reasons for the program being detected by the websites scraping detection mechanism:
I'm still aiming for a release this week. I will need to test and implement solutions for the issues mentioned above. I will write a progress report on Friday and post it here. I want to say thank you again for your continued contributions and patience's. |
Hey, You are correct, implementing workers as a proxies in this project will be a bit hard and might not be the right choice. But can you simply add functionality to use a proxy like socks5 in the program with some flag like Let me know what you think. Regards, |
Hi, I’m sorry I didn't post a progress report last Friday and for the lack of an update last week. I looked at the idea of using socks5 proxies but I decided to explore different options due to the security concerns mainly regarding MITM attacks leading to malicious code being injected into the the responses from the website I did however come up with a some ways to mitigate the risks via a docker container that runs unprivileged and in a network isolated environment to avoid network based malware and lower the risk of local scope malware. After a review of the project and my current focus’s relating to it I think the best move is to pivot back to using proxies and detailed below. An issue I will have to deal with is finding Socks5 proxies that are up and have not already been detected this is a challenge but not a enormous one I will have to find lists that are updated often preferably daily like many you can find here on GitHub and develop a tester that will go through each proxy in the list and test for connectivity, ping, test for malicious code injection, and cross reference the ip with websites that log ips that have been determined to be bots. My main concern with Socks5 proxies and proxies in general has always been that I can guarantee their safety with some of my concerns being previously mentioned but I think it’s worth looking back into due to the obstacles I have encountered while trying to find work arounds for the Captcha system implemented by the website with the only solution I have been able to think of is using something like Playwrightor or something similar to solve the Captchas and retrieve the cf_clearence cookie for use in subsequent requests but in preliminary testing this has proven ineffective with the attempts being detected by the websites anti bot protections and I don’t think it’s viable to continue exploring this idea as in addition to the issues encountered in preliminary testing it’s not future proof due to CloudFlare changing their Captcha system consistently meaning constant reviews would need to take place with code updates following any changes made to the system leading to significant down time leading to a poor user experience. This means that the current focus will be shifted to ensuring the safety of proxies with my main concern being as previously mentioned MITM attacks leading to malicious code injection. As far as the malicious code injection is considered the bulk of the issues come from ads being injected into the websites response which does not pose a major security issue because theses ads are put there mainly to generate revenue not to cause any harm but in some cases they can lead the user to a malicious website if clicked or interacted with but the possibility for malware is still a relevant concern and comes in 3 main variants those being network based malware such as worms, local scope malware this can be anything malicious such as a crypto miner that only affects the host computer, and DNS poisoning causing the requested URL to resolve to a IP that does not belong to dnshistory.org. Socks5 proxies do support HTTPS connections however this is not a sure fire way to ensure their safety there are still security concerns that come with routing traffic through a proxy of unknown interrogate many of which have been previously mentioned such as the request being routed to another domain that could respond with malicious code, the request having it’s SSL striped making it a HTTP connection, and DNS poisoning. The main counter masseuses I have been able to think of to combat these issues are as follows
I have come to the conclusion that the best way to enact the steps listed above is to as mentioned prior use a docker container but I will ensure that the source code will still work independent of a docker container but I imagine that I will discourage using it outside of a network isolated VM or sandbox. In conclusion all of my concerns can be mitigated but it will take me some time to implement all of the security features designed to do so. Hopefully I’ll have a beta version done by the end of the week but at latest I will have it out by Wednesday of next week. I'll return here every couple days to report my progress. Thank you again for your help with this project. I would have had tunnel vision trying to get this thing to work and sorry again for not posting a progress report or having a new release last week. |
Hey mate,
This looks like an interesting project, something I was working on last night only. But got banned by firewall. Can you maybe add a feature to get all subdomains which points to a particular cname along with proxy implementation so it wont be problem in scraping. For example you can take a look at URL below.
Thank you
The text was updated successfully, but these errors were encountered: