Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping domains pointing to a cname #13

Open
venom26 opened this issue May 7, 2024 · 7 comments
Open

Scraping domains pointing to a cname #13

venom26 opened this issue May 7, 2024 · 7 comments
Assignees
Labels
ASAP Get done asap bug Something isn't working In Progress

Comments

@venom26
Copy link

venom26 commented May 7, 2024

Hey mate,
This looks like an interesting project, something I was working on last night only. But got banned by firewall. Can you maybe add a feature to get all subdomains which points to a particular cname along with proxy implementation so it wont be problem in scraping. For example you can take a look at URL below.

https://dnshistory.org/points-to/cname/1/mytechcmsprod.azureedge.net

Thank you

@BirdsAreFlyingCameras BirdsAreFlyingCameras self-assigned this May 18, 2024
@BirdsAreFlyingCameras BirdsAreFlyingCameras added bug Something isn't working ASAP Get done asap labels May 18, 2024
@BirdsAreFlyingCameras
Copy link
Owner

BirdsAreFlyingCameras commented May 18, 2024

Hi,
Sorry for the late reply I'm just seeing this issue. I'll start work on this today and include it in the next version which should be released early next week. I'll dig up some old code I have for a proxy retriever/validator and implement a rotating system. Thank you for contributing if you have any improvements pull requests are always welcomed.

@BirdsAreFlyingCameras
Copy link
Owner

There has been a delay in the next release, which was supposed to be released earlier this week, due to issues with implementing proxies. This is because I have been unable to find reliable HTTPS proxies that are not detected bt web scraping defenses and are reliably available for use. Another issue is my ability to ensure their safety. I am exploring alternatives to work around their implementation of Cloudflare, which is responsible for the blocking of IPs.

These alternatives include modifying the existing headers to better mimic authentic ones and creating a separate script to maintain a constant session to preserve the cookie assigned at the time of the initial connection and expires when the connection is closed this was a gross oversight when I was originally writing the code that should have been incorporated from the start. Additionally, I am investigating ways to spoof the cf_clearance header that Cloudflare assigns when a CAPTCHA is completed.

Regarding the spoofing of the user IP, I am looking for the most efficient and safe way to tunnel traffic through Tor. The issue with using proxies is that those provided in free lists available on the web have been known to inject malicious code into the HTTP responses. I considered setting up a Docker container to isolate the host OS from any malicious code and running it on a separate subnet from the user's home network to mitigate any possibilities of network-spread malware. However, due to the nature of free proxies being easily compromised by web scraping defenses, this was not a viable solution.

In the event that I implement tunneling through Tor, I will use the Docker container setup due to Tor having many of the same dangers as free proxies.

I apologize for the delay and hope to release the next update sometime next week.

@venom26
Copy link
Author

venom26 commented May 25, 2024

Hey mate,
I hope you are doing well.

Thank you so much for the updates.
Maybe you can use something like was lambda or cloudflare workers as proxies to route the traffic.

Let me know what you think.

Thanks,
venom26

@BirdsAreFlyingCameras
Copy link
Owner

Hi,
I want to start by sharing my appreciation for your continued contributions to this project. I've taken this long weekend off from projects but from a brief look at your idea of using lambda or cloudflare workers as proxies it shows lots of promise. I look forward to investigating further and will return here on Wednesday for a progress report. I'm still hoping for a new update being released sometime next week and will be working on it constantly until that release comes to fruition.

Thank you again for your continued contributions and patience's,
hope you had a good weekend.

@BirdsAreFlyingCameras
Copy link
Owner

BirdsAreFlyingCameras commented May 30, 2024

Hi,

I hope you're doing well. I've had some time to look at using the workers as proxies and it would work extremely well but on all my projects I don’t use any services that require registration or payment. I don't do this in order to keep my projects as accessible and anonymous as possible. This does pose lots of obstacles in development but I think those obstacles are worth it in order to provide the best user experience I possibly can.

This does mean that unless I can find a safe and reliable way to route traffic through proxies which I have been unable to do I will have to bypass their WAF that is provided by CloudFlare. This means finding a way around their implementation of rate limiting and scraping detection.

In my experience I get a 403 and a Captcha after 3 - 5 usages of the program. The program sends 9 requests for every URL meaning the website receives 27 - 45 requests before presenting a 403 and Captcha this leads me to believe that the blocking is not strictly caused by rate limiting but instead mainly caused by the websites scraping detection mechanism.

I believe the following factors are the main reasons for the program being detected by the websites scraping detection mechanism:

  • The user-agent is static
  • The records are accessed in the same order every time
  • Lack of origin and referrer headers
  • Missing headers
  • The records are accessed by direct links not redirected by a button press
  • The same DNSHistory cookie is not used across all usages of the application during the usage session

I'm still aiming for a release this week. I will need to test and implement solutions for the issues mentioned above. I will write a progress report on Friday and post it here.

I want to say thank you again for your continued contributions and patience's.

@venom26
Copy link
Author

venom26 commented May 31, 2024

Hey,
Thank you so much for the update.

You are correct, implementing workers as a proxies in this project will be a bit hard and might not be the right choice. But can you simply add functionality to use a proxy like socks5 in the program with some flag like --proxy. Even that would help and then you will not have to worry about implementing proxies yourself, simply implementing using proxies in every request.

Let me know what you think.

Regards,
venom26

@BirdsAreFlyingCameras
Copy link
Owner

BirdsAreFlyingCameras commented Jun 5, 2024

Hi,
Hope your doing well

I’m sorry I didn't post a progress report last Friday and for the lack of an update last week. I looked at the idea of using socks5 proxies but I decided to explore different options due to the security concerns mainly regarding MITM attacks leading to malicious code being injected into the the responses from the website I did however come up with a some ways to mitigate the risks via a docker container that runs unprivileged and in a network isolated environment to avoid network based malware and lower the risk of local scope malware. After a review of the project and my current focus’s relating to it I think the best move is to pivot back to using proxies and detailed below.

An issue I will have to deal with is finding Socks5 proxies that are up and have not already been detected this is a challenge but not a enormous one I will have to find lists that are updated often preferably daily like many you can find here on GitHub and develop a tester that will go through each proxy in the list and test for connectivity, ping, test for malicious code injection, and cross reference the ip with websites that log ips that have been determined to be bots.

My main concern with Socks5 proxies and proxies in general has always been that I can guarantee their safety with some of my concerns being previously mentioned but I think it’s worth looking back into due to the obstacles I have encountered while trying to find work arounds for the Captcha system implemented by the website with the only solution I have been able to think of is using something like Playwrightor or something similar to solve the Captchas and retrieve the cf_clearence cookie for use in subsequent requests but in preliminary testing this has proven ineffective with the attempts being detected by the websites anti bot protections and I don’t think it’s viable to continue exploring this idea as in addition to the issues encountered in preliminary testing it’s not future proof due to CloudFlare changing their Captcha system consistently meaning constant reviews would need to take place with code updates following any changes made to the system leading to significant down time leading to a poor user experience.

This means that the current focus will be shifted to ensuring the safety of proxies with my main concern being as previously mentioned MITM attacks leading to malicious code injection. As far as the malicious code injection is considered the bulk of the issues come from ads being injected into the websites response which does not pose a major security issue because theses ads are put there mainly to generate revenue not to cause any harm but in some cases they can lead the user to a malicious website if clicked or interacted with but the possibility for malware is still a relevant concern and comes in 3 main variants those being network based malware such as worms, local scope malware this can be anything malicious such as a crypto miner that only affects the host computer, and DNS poisoning causing the requested URL to resolve to a IP that does not belong to dnshistory.org.

Socks5 proxies do support HTTPS connections however this is not a sure fire way to ensure their safety there are still security concerns that come with routing traffic through a proxy of unknown interrogate many of which have been previously mentioned such as the request being routed to another domain that could respond with malicious code, the request having it’s SSL striped making it a HTTP connection, and DNS poisoning.

The main counter masseuses I have been able to think of to combat these issues are as follows

  • Running the program in a network isolated environment in order to mitigate the spread of network based
    malware

  • Run the program unprivileged to avoid any harm to the host or VM in the event that malware bypasses the
    precautions set in place

  • Limit the usage of DNS servers by using IPs resolved prior to the proxy being used to make requests

  • Validate all SSL/TLS certificates to help prevent MITM attacks

  • Restrict open ports to help protect against malicious connections being made

  • Lock down file system to a read only state where possible to prevent malware being written to the file
    system

  • Prevent files from being downloaded

I have come to the conclusion that the best way to enact the steps listed above is to as mentioned prior use a docker container but I will ensure that the source code will still work independent of a docker container but I imagine that I will discourage using it outside of a network isolated VM or sandbox.

In conclusion all of my concerns can be mitigated but it will take me some time to implement all of the security features designed to do so. Hopefully I’ll have a beta version done by the end of the week but at latest I will have it out by Wednesday of next week. I'll return here every couple days to report my progress.

Thank you again for your help with this project. I would have had tunnel vision trying to get this thing to work and sorry again for not posting a progress report or having a new release last week.

@BirdsAreFlyingCameras BirdsAreFlyingCameras unpinned this issue Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASAP Get done asap bug Something isn't working In Progress
Projects
None yet
Development

No branches or pull requests

2 participants