This crawler will scrape all houses sold in past 3 months in the U.S and save them into "gp/gp/houses1.csv" and "gp/gp/houses2.csv" which is really easy to store into either MongoDB or MySQL. Temporary downloaded files can be found in "gp/gp/spiders/download" It will scrape following fields:
- url to the house
- sold date
- price
- baths
- beds
- yearbuilt
- sqft
- lotsize
- type
- daysonmarket
- state
- county
- city
- address
- latitude
- longitude
- zipcode However, the website may miss some of the fields and in this condition the field will leave blank.
Before running "main.py", please install following packages:
- scrapy
- csv
- selenium
- re
- time
- newest version of firefox
This crawler is scrapy-selenium combined.RedfinSpider yields requests asychronously and process responses from the downloader middleware, this middleware mainly uses selenium.webdriver to generate response(in this way, we can bypass crawler blocking mechanism of redfin.com and get html as we want) The webdriver is instantiated in the spider and function as a downloader in the form of Firefox browser.