This crawler fetches data from the websites of various websites (e.g. clubs, companies) in order to get information
about their store locations, clubs, or other company informaiton. Information such as store name, locations,
coordinates, phone number, operating hours, etc. See the results
folder for the crawler output.
- Either crawler 1 or 2 was not working because the
robots.txt
was being misread. While the website'srobots.txt
allowed the specific URL to be accessed by crawlers,scapy
did not read that correctly.- Workaround: set
ROBOTSTXT_OBEY
toFalse
insettings.py
- Further investigation needed.
- Workaround: set
Use the following commands to run the crawlers.
Output as JSON file:
scrapy crawl <name> -o results/<name>.json
Output as CSV file:
scrapy crawl <name> -o results/<name>.csv -t csv
The crawlers would need to be tested and changed on a regular basis to make sure they still works.
Name | Last Ran |
---|---|
towncaredental | 2020-07-15 |
rickysalldaygrillcanada | 2020-07-15 |
jockey | 2020-07-15 |
rentking | 2020-07-15 |
uae_free | 2020-07-18 |
marketwatch_ipo | 2020-07-15 |
maac | 2020-07-15 |
XlsxWriterPipeline
will take the items from a spider and place them in an excel spreadsheet. If the spider yields multiple items, they will be placed in separate sheets in the excel file.
- This crawler was created specifically to answer the StackOverflow.com question "Crawl table data without 'next button' with Scrapy".
- For help, I used the StackOverflow.com answer to the question "Crawling through pages with PostBack data javascript Python Scrapy".
- ScraPy module for Python: https://docs.scrapy.org/en/latest/. Quick start-to-finish example: https://www.codementor.io/andy995/writing-a-simple-web-scraper-using-scrapy-myb7vrmgx
- XPath syntax: https://devhints.io/xpath. Use Google Chrome Inspector (Dev tools) to test XPath to access HTML nodes of a website; example: https://yizeng.me/2014/03/23/evaluate-and-validate-xpath-css-selectors-in-chrome-developer-tools/
- Network Log details/demo: https://developers.google.com/web/tools/chrome-devtools/network/