Parser / Scraper for classified advertisements and phone numbers from site www.olx.ua
This application downloads classified ads and phone numbers from www.olx.ua and saves them to H2 Database. It's capable of scraping huge number of ads and phones within very short time.
- Install Docker with docker-compose.
- Download the source code. Run
git clone https://github.com/stanikol/olx
- Change dir to 'olx/docker'.
cd olx/docker
- Run
docker-compose up --build
for the first time, then use justdocker-compose up
. - Wait for a while till sbt downloads all the libraries needed and compiles the sources. This may take a while, when running for the first time.
- Open http://localhost:8080/olx in your browser.
- Set the search parameters and press "Start".
- H2 database with downloaded ads will be stored in "olx/db/olxdb.mv.db"
- If you want to export results into csv, just run
call CSVWRITE('out.csv', 'select * from ADS');
from DB page. CSV file is saved toolx/db/out.csv
You can also send POST requests to http://localhost:8080/olx/run to start downloads. Parameters are:
- url - OLX search URL
- name - name of the search
- count - max number of advertisements to download
- parsePhones - When
true
parse and save phones
You can send these params with POST requests as form data.
curl -X POST http://localhost:8080/olx/run \
-H "Content-Type: application/x-www-form-urlencoded" \
--data-urlencode "name=Test-1" \
--data-urlencode "count=5" \
--data-urlencode "url=https://www.olx.ua/uk/nedvizhimost/odessa/q-%D1%81%D0%BE%D0%B2%D1%96%D0%BD%D1%8C%D0%BE%D0%BD/?currency=UAH"
Please note, that your search query term should be url encoded. You can use jq, for example:
SEARCH=$(echo "совіньон" | jq --raw-input --raw-output @uri)
curl -X POST http://localhost:8080/olx/run \
-H "Content-Type: application/x-www-form-urlencoded" \
--data-urlencode "name=Test-1" \
--data-urlencode "count=5" \
--data-urlencode "url=https://www.olx.ua/uk/nedvizhimost/odessa/q-$SEARCH/?currency=UAH"
To stop all downloads, run
curl -X POST http://localhost:8080/olx/stop
This app relies on tremendous open source projects. Here's a few of them.
The code is licensed under Apache License v2.0.