-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
startproject and override command line tool for Page Objects development #57
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @ivanprado 👍 I've left a couple of comments here and there :)
po_path=po_path, | ||
test_path=test_path, | ||
) | ||
self.context = context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here. Should we maybe init self.context
and self.po_path
with all the typing before assigning any values to them inside the methods?
scrapy_poet/commands/override.py
Outdated
print("Fixture saved successfully") | ||
|
||
self.po_test_path = generate_test(self.context) | ||
print() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we stick with print
instead of logging
?
WARNING: This was developed on top of #56. Merge it only after it.
scrapy startproject
is modified so that the project is prepared forscrapy-poet
. Also folders for the Page Objects and their tests and fixtures are created.scrapy override
creates a Page Object and a test case over a web page. This makes development handy.override
command can also be used to update the fixture data with fresh web data. Also, it can be used if the dependencies of a Page Object has changed: in this case, running the command is required to fetch additional fixtures to the additional dependencies.TODO
Remaining work for the future:
How can be the documentation structured
Rewrite the tutorial using the new
startproject
andoverride
commands. The goal should be to create a generic spider with common crawling logic and then integrate different sites. The spider could for example extract books from categories in book review pages. The structure could be:2.1 Create a new project using
startproject
2.2 Writing a spider that rely on Page Objects (empty implementation)
2.3 Create the first override using the tool
2.3.1 Explain the
handle_url
decorator and link to web_poet documentation andurl-matcher
doc2.4. Implement extraction logic in the PO
2.5 Use the unit test to check that the logic is right
2.6. Do the same for the rest of PO for the site
2.7. Run the spider
2.8 Integrate the second site
2.9 Summary of what happen
overide
command over the same PO and URL. When and why:3.1. To get fresh data. e.g., because the layout of the site changed and we need to update the extraction code
3.2. Under the presence of new dependencies in the PO. It will be required to fetch the new resources.
4.1 Default templates vs specific ones
python -m web_poet
6.1. ItemPage
6.2. ItemWebPage
6.3. RequestData
6.4. Injectable
Keep in mind that the tutorial will be the entry point for many people. It is really important to have a tutorial that is good, simple and and convinces of the value.