Scraping domains pointing to a cname #13

venom26 · 2024-05-07T06:37:46Z

Hey mate,
This looks like an interesting project, something I was working on last night only. But got banned by firewall. Can you maybe add a feature to get all subdomains which points to a particular cname along with proxy implementation so it wont be problem in scraping. For example you can take a look at URL below.

https://dnshistory.org/points-to/cname/1/mytechcmsprod.azureedge.net

Thank you

The text was updated successfully, but these errors were encountered:

BirdsAreFlyingCameras · 2024-05-18T20:10:27Z

Hi,
Sorry for the late reply I'm just seeing this issue. I'll start work on this today and include it in the next version which should be released early next week. I'll dig up some old code I have for a proxy retriever/validator and implement a rotating system. Thank you for contributing if you have any improvements pull requests are always welcomed.

BirdsAreFlyingCameras · 2024-05-24T10:21:33Z

There has been a delay in the next release, which was supposed to be released earlier this week, due to issues with implementing proxies. This is because I have been unable to find reliable HTTPS proxies that are not detected bt web scraping defenses and are reliably available for use. Another issue is my ability to ensure their safety. I am exploring alternatives to work around their implementation of Cloudflare, which is responsible for the blocking of IPs.

These alternatives include modifying the existing headers to better mimic authentic ones and creating a separate script to maintain a constant session to preserve the cookie assigned at the time of the initial connection and expires when the connection is closed this was a gross oversight when I was originally writing the code that should have been incorporated from the start. Additionally, I am investigating ways to spoof the cf_clearance header that Cloudflare assigns when a CAPTCHA is completed.

Regarding the spoofing of the user IP, I am looking for the most efficient and safe way to tunnel traffic through Tor. The issue with using proxies is that those provided in free lists available on the web have been known to inject malicious code into the HTTP responses. I considered setting up a Docker container to isolate the host OS from any malicious code and running it on a separate subnet from the user's home network to mitigate any possibilities of network-spread malware. However, due to the nature of free proxies being easily compromised by web scraping defenses, this was not a viable solution.

In the event that I implement tunneling through Tor, I will use the Docker container setup due to Tor having many of the same dangers as free proxies.

I apologize for the delay and hope to release the next update sometime next week.

venom26 · 2024-05-25T04:43:44Z

Hey mate,
I hope you are doing well.

Thank you so much for the updates.
Maybe you can use something like was lambda or cloudflare workers as proxies to route the traffic.

Let me know what you think.

Thanks,
venom26

BirdsAreFlyingCameras · 2024-05-27T10:52:30Z

Hi,
I want to start by sharing my appreciation for your continued contributions to this project. I've taken this long weekend off from projects but from a brief look at your idea of using lambda or cloudflare workers as proxies it shows lots of promise. I look forward to investigating further and will return here on Wednesday for a progress report. I'm still hoping for a new update being released sometime next week and will be working on it constantly until that release comes to fruition.

Thank you again for your continued contributions and patience's,
hope you had a good weekend.

BirdsAreFlyingCameras · 2024-05-30T08:29:59Z

Hi,

I hope you're doing well. I've had some time to look at using the workers as proxies and it would work extremely well but on all my projects I don’t use any services that require registration or payment. I don't do this in order to keep my projects as accessible and anonymous as possible. This does pose lots of obstacles in development but I think those obstacles are worth it in order to provide the best user experience I possibly can.

This does mean that unless I can find a safe and reliable way to route traffic through proxies which I have been unable to do I will have to bypass their WAF that is provided by CloudFlare. This means finding a way around their implementation of rate limiting and scraping detection.

In my experience I get a 403 and a Captcha after 3 - 5 usages of the program. The program sends 9 requests for every URL meaning the website receives 27 - 45 requests before presenting a 403 and Captcha this leads me to believe that the blocking is not strictly caused by rate limiting but instead mainly caused by the websites scraping detection mechanism.

I believe the following factors are the main reasons for the program being detected by the websites scraping detection mechanism:

The user-agent is static
The records are accessed in the same order every time
Lack of origin and referrer headers
Missing headers
The records are accessed by direct links not redirected by a button press
The same DNSHistory cookie is not used across all usages of the application during the usage session

I'm still aiming for a release this week. I will need to test and implement solutions for the issues mentioned above. I will write a progress report on Friday and post it here.

I want to say thank you again for your continued contributions and patience's.

venom26 · 2024-05-31T09:07:36Z

Hey,
Thank you so much for the update.

You are correct, implementing workers as a proxies in this project will be a bit hard and might not be the right choice. But can you simply add functionality to use a proxy like socks5 in the program with some flag like --proxy. Even that would help and then you will not have to worry about implementing proxies yourself, simply implementing using proxies in every request.

Let me know what you think.

Regards,
venom26

BirdsAreFlyingCameras · 2024-06-05T11:23:34Z

Hi,
Hope your doing well

I’m sorry I didn't post a progress report last Friday and for the lack of an update last week. I looked at the idea of using socks5 proxies but I decided to explore different options due to the security concerns mainly regarding MITM attacks leading to malicious code being injected into the the responses from the website I did however come up with a some ways to mitigate the risks via a docker container that runs unprivileged and in a network isolated environment to avoid network based malware and lower the risk of local scope malware. After a review of the project and my current focus’s relating to it I think the best move is to pivot back to using proxies and detailed below.

An issue I will have to deal with is finding Socks5 proxies that are up and have not already been detected this is a challenge but not a enormous one I will have to find lists that are updated often preferably daily like many you can find here on GitHub and develop a tester that will go through each proxy in the list and test for connectivity, ping, test for malicious code injection, and cross reference the ip with websites that log ips that have been determined to be bots.

My main concern with Socks5 proxies and proxies in general has always been that I can guarantee their safety with some of my concerns being previously mentioned but I think it’s worth looking back into due to the obstacles I have encountered while trying to find work arounds for the Captcha system implemented by the website with the only solution I have been able to think of is using something like Playwrightor or something similar to solve the Captchas and retrieve the cf_clearence cookie for use in subsequent requests but in preliminary testing this has proven ineffective with the attempts being detected by the websites anti bot protections and I don’t think it’s viable to continue exploring this idea as in addition to the issues encountered in preliminary testing it’s not future proof due to CloudFlare changing their Captcha system consistently meaning constant reviews would need to take place with code updates following any changes made to the system leading to significant down time leading to a poor user experience.

This means that the current focus will be shifted to ensuring the safety of proxies with my main concern being as previously mentioned MITM attacks leading to malicious code injection. As far as the malicious code injection is considered the bulk of the issues come from ads being injected into the websites response which does not pose a major security issue because theses ads are put there mainly to generate revenue not to cause any harm but in some cases they can lead the user to a malicious website if clicked or interacted with but the possibility for malware is still a relevant concern and comes in 3 main variants those being network based malware such as worms, local scope malware this can be anything malicious such as a crypto miner that only affects the host computer, and DNS poisoning causing the requested URL to resolve to a IP that does not belong to dnshistory.org.

Socks5 proxies do support HTTPS connections however this is not a sure fire way to ensure their safety there are still security concerns that come with routing traffic through a proxy of unknown interrogate many of which have been previously mentioned such as the request being routed to another domain that could respond with malicious code, the request having it’s SSL striped making it a HTTP connection, and DNS poisoning.

The main counter masseuses I have been able to think of to combat these issues are as follows

Running the program in a network isolated environment in order to mitigate the spread of network based
malware
Run the program unprivileged to avoid any harm to the host or VM in the event that malware bypasses the
precautions set in place
Limit the usage of DNS servers by using IPs resolved prior to the proxy being used to make requests
Validate all SSL/TLS certificates to help prevent MITM attacks
Restrict open ports to help protect against malicious connections being made
Lock down file system to a read only state where possible to prevent malware being written to the file
system
Prevent files from being downloaded

I have come to the conclusion that the best way to enact the steps listed above is to as mentioned prior use a docker container but I will ensure that the source code will still work independent of a docker container but I imagine that I will discourage using it outside of a network isolated VM or sandbox.

In conclusion all of my concerns can be mitigated but it will take me some time to implement all of the security features designed to do so. Hopefully I’ll have a beta version done by the end of the week but at latest I will have it out by Wednesday of next week. I'll return here every couple days to report my progress.

Thank you again for your help with this project. I would have had tunnel vision trying to get this thing to work and sorry again for not posting a progress report or having a new release last week.

BirdsAreFlyingCameras self-assigned this May 18, 2024

BirdsAreFlyingCameras added bug Something isn't working ASAP Get done asap labels May 18, 2024

BirdsAreFlyingCameras added the In Progress label May 19, 2024

BirdsAreFlyingCameras mentioned this issue May 30, 2024

Website Captcha Will Appear After Repeated Request From a Single IP #7

Open

BirdsAreFlyingCameras pinned this issue May 30, 2024

BirdsAreFlyingCameras unpinned this issue Jun 7, 2024

BirdsAreFlyingCameras pinned this issue Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping domains pointing to a cname #13

Scraping domains pointing to a cname #13

venom26 commented May 7, 2024

BirdsAreFlyingCameras commented May 18, 2024 •

edited

Loading

BirdsAreFlyingCameras commented May 24, 2024

venom26 commented May 25, 2024

BirdsAreFlyingCameras commented May 27, 2024

BirdsAreFlyingCameras commented May 30, 2024 •

edited

Loading

venom26 commented May 31, 2024

BirdsAreFlyingCameras commented Jun 5, 2024 •

edited

Loading

Scraping domains pointing to a cname #13

Scraping domains pointing to a cname #13

Comments

venom26 commented May 7, 2024

BirdsAreFlyingCameras commented May 18, 2024 • edited Loading

BirdsAreFlyingCameras commented May 24, 2024

venom26 commented May 25, 2024

BirdsAreFlyingCameras commented May 27, 2024

BirdsAreFlyingCameras commented May 30, 2024 • edited Loading

venom26 commented May 31, 2024

BirdsAreFlyingCameras commented Jun 5, 2024 • edited Loading

BirdsAreFlyingCameras commented May 18, 2024 •

edited

Loading

BirdsAreFlyingCameras commented May 30, 2024 •

edited

Loading

BirdsAreFlyingCameras commented Jun 5, 2024 •

edited

Loading