Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Immoscout24 Captcha Resolution #630

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DerLeole
Copy link

@DerLeole DerLeole commented Aug 16, 2024

Solves #577 #589 #513

Immoscout seems to have entirely moved from GeeTest to AWS WAF Captcha.
This PR implments a new solver Capmonster to deal with that fact.

My reasoning behind that:
After trying for hours to get any of the existing captcha solvers to work with the new clientside AWS WAF javascript captchas, I caved and implemented @jukoson's fix using capmonster as a solver in a way that mimics the other implementations.
It should be entirely backwards compatible

To use the new solver just modify your ENV variables or config:

captcha:
  capmonster:
    api_key: meow

Shortcomings:

  • The new solver implementation only supports the aws captcha for now, but I can change that down the line.
  • There is also some leftover code that in theory sniffs on background requests to get iv, context (or whatever it is now), etc from the loaded javascript. However supplying all that additional information leads to unsolvable captchas for whatever reasons and yields wrong results with 2captcha in my trials. Thats why these values are discarded in the solver itself.
  • The accidentally leaked API keys included in one commit have been rotated :D

Hope this helps!

@codders
Copy link

codders commented Aug 17, 2024

Hi @DerLeole,

First of all, thank you so much for taking the time to implement this. It's a real gift to have an active community on a project like flathunter, and it saves me a lot of stress and headache when people step up and make contributions.

I'll add some feedback for the review - I'll try and be clear about what I consider mandatory for merging and what's just optional. But ultimately if you don't want to implement the feedback you can also just say and I will happily tidy this up for you and get it merged (and will try and preserve your commits so that you also get the attribution).

I signed up at Capmonster and did a test locally and the code works, so that's amazing for a first contribution. Re. leaking keys, you can rebase your commits (git rebase -i HEAD~3) and squash them together so that the commit including the keys disappears (I know it doesn't make a difference now that you've rotated the key, but nevertheless).

I'll will test later if it works for the flathunter cloud deployment. The 2captcha implementation is anyway useless at this point, so your implementation can certainly not be worse :)

Thanks again for the contribution,

Arthur

Copy link

@codders codders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this. You've done an amazing job of following the style and layout of the code that's there - excellent work!


# Intercept background network traffic via log sniffing
sleep(2)
logs_raw = driver.get_log("performance")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some deep magic right here. Well done for finding you're way around this.

iv = response_json["state"]["iv"]
context = response_json["state"]["payload"]
sitekey = response_json["key"]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - not sure why we need to double blank line here. One would be plenty

"""Resolve AWS WAF Captcha"""

# Intercept background network traffic via log sniffing
sleep(2)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it would be nice to avoid arbitrary 'sleep's in the code, but I appreciate that we're doing weird network magic here with an uncooperative third-party system, so for the sake of having things work I'm happy to leave this in.

patternChallenge = r'src="([^"]*challenge\.js)"'
challenge_matches = re.findall(patternChallenge, driver.page_source)
for match in challenge_matches:
print(f'Challenge SRC Value: {match}')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace printwith logger.debug where it appears in this file.

patternJsApi = r'src="([^"]*jsapi\.js)"'
jsapi_matches = re.findall(patternJsApi, driver.page_source)
for match in jsapi_matches:
print(f'JsApi SRC Value: {match}')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please place print with logger.debug. Also you'll probably find for calls to the logger that the linter will a %s and a second argument to an f-string (because then it doesn't have to do the string interpolation if the log call isn't triggered).

@@ -66,6 +67,7 @@ def __retrieve_2captcha_result(self, captcha_id: str):
"key": self.api_key,
"action": "get",
"id": captcha_id,
"json": 0,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

@@ -87,4 +89,4 @@ def __retrieve_2captcha_result(self, captcha_id: str):
if not retrieve_response.text.startswith("OK"):
raise requests.HTTPError(response=retrieve_response)

return retrieve_response.text.split("|", 1)[1]
return retrieve_response.text.split("|", 1)[1]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an unnecessary whitespace change - please revert this.

@@ -59,13 +59,12 @@ def get_chrome_driver(driver_arguments):
"""Configure Chrome WebDriver"""
logger.info('Initializing Chrome WebDriver for crawler...')
chrome_options = uc.ChromeOptions() # pylint: disable=no-member
if platform == "darwin":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you deliberately remove this? What happens if you add it back? (I'm testing on Linux, so I don't use this code path).

@@ -36,6 +37,7 @@ class Env:
# Captcha setup
FLATHUNTER_2CAPTCHA_KEY = _read_env("FLATHUNTER_2CAPTCHA_KEY")
FLATHUNTER_IMAGETYPERZ_TOKEN = _read_env("FLATHUNTER_IMAGETYPERZ_TOKEN")
FLATHUNTER_CAPMONSTER_KEY = _read_env("FLATHUNTER_CAPMONSTER_KEY")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for wiring up the environment config!

@@ -124,6 +124,7 @@ def get_entries_from_javascript(self):
logger.error(
"IS24 bot detection has identified our script as a bot - we've been blocked"
)
logger.info(self.get_driver_force().page_source)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this info? Or is it enough if this is debug?

@codders
Copy link

codders commented Aug 17, 2024

Also, the Linter has a bunch of feedback. There are some C messages that have always been there, but the W issues are new and I would need them gone before we can merge the code. Likewise, pyright isn't happy with the types - please resolve the typing issues.

codders added a commit to codders/flathunter that referenced this pull request Sep 17, 2024
codders added a commit to codders/flathunter that referenced this pull request Sep 17, 2024
@devHaitham481
Copy link

I added the my api key after adding funds but I still get

[2024/11/13 15:44:50|_common.py              |INFO    ]: Backing off resolve_amazon(...) for 0.4s (flathunter.captcha.captcha_solver.CaptchaUnsolvableError)
[2024/11/13 15:44:58|twocaptcha_solver.py    |INFO    ]: Trying to solve amazon.
[2024/11/13 15:44:58|_common.py              |INFO    ]: Backing off resolve_amazon(...) for 0.9s (flathunter.captcha.captcha_solver.CaptchaUnsolvableError)
[2024/11/13 15:45:06|twocaptcha_solver.py    |INFO    ]: Trying to solve amazon.
[2024/11/13 15:45:06|_common.py              |ERROR   ]: Giving up resolve_amazon(...) after 3 tries (flathunter.captcha.captcha_solver.CaptchaUnsolvableError)
[2024/11/13 15:45:06|hunter.py               |INFO    ]: Error while scraping url https://www.immobilienscout24.de/Suche/radius/wohnung-mieten?centerofsearchaddress=Hamburg;20459;;;;;&numberofrooms=1.5-2.0&livingspace=35.0-65.0&exclusioncriteria=projectli
sting,swapflat&pricetype=rentpermonth&geocoordinates=53.5475;9.97941;10.0&enteredFrom=result_list: the captcha was unsolvable

I'm not sure if it's something I'm missing or did immoscout change stuff around their end again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants