Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added optional source url restrictions to Rule #1

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

fraserharris
Copy link
Owner

There is no easy way (that I can find) to restrict Rules to specific response URLs [1]. For example, I may want to limit a Rule to only apply to a response from a category.php page [2]. It appears the current way to do this is to be overly specific with your link extractor's xpath so it only matches links on category.php pages. This has the downside of being fragile and unprovable.

This PR:

  • adds new Rule optional arguments allow_sources and deny_sources (mimicking the implementation of LinkExtractor's allow and deny)
  • adds Rule.source_allowed(url) to check if url is allowed/denied
  • CrawlSpider._requests_to_follow checks Rule.source_allowed for each rule
  • Tests for above

[1] Example of someone else trying to solve this issue: http://stackoverflow.com/questions/22653656/how-to-make-rules-of-crawlspider-context-sensitive
[2] In my use case, I need to scrape a mix of ecommerce product pages, category pages & "super-category" pages.

- added Rule args allow_sources and deny_sources (paralleling LinkExtractor's allow & deny)
- added Rule method source_allowed to check if url is allowed/denied
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant