A baby problem in scraping in python
For an overview on scrapy - see the official documentation at - http://doc.scrapy.org/en/latest/intro/tutorial.html
I have tried to keep the "baby_scaping.py" as basic/bare-bones as possible(I have not even used the lxml) - will add more files to this repo on pagination & intelligent crawling, mainly the bits on recursive scraping, also called a crawlspider. With an optimally designed Crawlspider - you can navigate the pages in a website and only extract the elements of interest. An optimial crawlspider will also keep a memory tab on what is already crawled - so that the crawler does not get stuck in an infinite loop. Note also that crwling with approprite rules, can get a very memory intensive activity - no wonder there are ways to run bots on your personal machine, as a proxy to someone else's use case .
For now, all you need to understand this is 3 broad things -:
- What a scraper really does and the overview of offcial Tutorial
- Building Xpaths and trying it out with XPath checker plugin in modzilla
- Writing your spider to extract the items of interest from the webpage.
for suggestions, comments, collaboration, reach out to me at [email protected]