Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to Playwright instead requests #6

Merged
merged 1 commit into from
Jun 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build-container.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: build-container
on:
push:
tags:
- '?[0-9]+.[0-9]+.[0-9]+'
- 'v[0-9]+.[0-9]+.[0-9]+'

jobs:
build-container:
Expand Down
6 changes: 5 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,16 @@ COPY . .
RUN poetry build --format wheel


FROM python:3.12-alpine
FROM python:3.12-slim

VOLUME /app

COPY --from=compiler /app/dist/*.whl /

RUN pip3 install --no-cache-dir -- *.whl

RUN playwright install --with-deps firefox

ENV SB__BROWSER__TYPE="firefox"

ENTRYPOINT python3 -m scraper_bot
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ As alternative, you can build by yourself the python package or the container
### Fast deploy (docker-compose)

1. [Create a telegram bot](https://core.telegram.org/bots#3-how-do-i-create-a-bot) and retrieve its token
2. Download `config.yaml` and put into `/etc/scraperbot` folder
2. Download `config.example.yaml` and rename it to `config.yaml`
3. Change the configuration follow the [guidelines](#configuration)
4. Download `docker-compose.yaml`
5. Start the scraper with `docker-compose`
Expand All @@ -44,4 +44,4 @@ Furthermore you can get the config json schema from command line with `--config-
scraper_bot --config-schema
```

You can also find a configuration example in `config.yaml`.
You can also find a configuration example in `config.example.yaml`.
42 changes: 42 additions & 0 deletions config.example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#######################
# Example config.yaml #
#######################
# This file contains a config example
# thought to find real estate ads
# In particular we look for an apartment
# in Milano at least tree rooms
notifications:
message: |
# [{{title}}]({{url}})
{% if location %}📍 *{{location}}*{% endif %}
{% if price %}💶 *{{price}}€*{% endif %}
{% if size %}📐 *{{size}}m²*{% endif %}
format: markdown
channels:
# It is a list of apprise supported channels
# where the scraped entities have to be sent
- "tgram://{YOUR_BOT_TOKEN}/{CHAT_ID1}"
- "tgram://{YOUR_BOT_TOKEN}/{CHAT_ID2}"
- message: "Found a new adds at {{url}}"
format: "text"
uri: "discord://webhook_id/webhook_token"
tasks:
- name: "immobiliare.it"
url: "https://www.immobiliare.it/affitto-case/lodi/?criterio=rilevanza&localiMinimo=3"
target: |
[...document.querySelectorAll("li.in-searchLayoutListItem")].map(t =>({
url: t.querySelector("a.in-listingCardTitle")?.href,
title: t.querySelector("a.in-listingCardTitle")?.innerText,
price: t.querySelector(".in-listingCardPrice span")?.innerText,
size: t.querySelector(".in-listingCardFeatureList__item:nth-child(2) span")?.innerText.replace(/[^0-9]+/g,"")
}))
- name: "mioaffitto"
url: "https://www.mioaffitto.it/search?provincia=50&poblacion=67355"
target: |
[...document.querySelectorAll(".property-list .propertyCard:not(.property-alternative)")].map(t=> ({
url: t.querySelector("a")?.href,
title: t.querySelector("a p")?.innerText,
price: t.querySelector(".propertyCard__price--value")?.innerText.replace(/[^0-9]+/g,""),
size: t.querySelector(".propertyCard__details li:has(.fa-size-o)")?.innerText.replace(/[^0-9]+/g,""),
location: t.querySelector(".propertyCard__location p")?.innerText
}))
38 changes: 0 additions & 38 deletions config.yaml

This file was deleted.

156 changes: 126 additions & 30 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@ classifiers=[

[tool.poetry.dependencies]
python = "^3.12"
beautifulsoup4 = ">=4.10.0,<4.11.0"
redis = "^4.6.0"
requests = "^2.32.3"
ischedule = ">=1.2.2,<1.3.0"
pyyaml = ">=6.0,<7.0"
pydantic = "^2.7.4"
Expand All @@ -28,6 +26,8 @@ termcolor = "^2.4.0"
urllib3 = "^2.2.2"
apprise = "^1.8.0"
jinja2 = "^3.1.4"
playwright = "^1.44.0"
playwright-stealth = "^1.0.6"


[tool.poetry.group.dev.dependencies]
Expand Down
Loading
Loading