-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch auto field stats to an item pipeline #216
Open
Gallaecio
wants to merge
10
commits into
scrapy-plugins:main
Choose a base branch
from
Gallaecio:auto-fields
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
71567e6
Switch auto field stats to an item pipeline
Gallaecio e1eaa04
Add types
Gallaecio 1935012
Remove unnecessary line
Gallaecio 619f87c
Require the latest zyte-common-items in provider-pinned
Gallaecio afe00e8
Require zyte-common-items 0.21.0
Gallaecio 52c8717
Test missing InjectionMiddleware
Gallaecio bd49a10
ScrapyZyteAPIPoetItemPipeline → ScrapyZyteAPIAutoFieldStatsItemPipeline
Gallaecio a89f3b7
Improve code readability
Gallaecio 959bbdf
Add missing return values to process_item
Gallaecio a5ffb08
Merge remote-tracking branch 'scrapy-plugins/main' into auto-fields
Gallaecio File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
from logging import getLogger | ||
from typing import Any, Set, Type | ||
|
||
from itemadapter import ItemAdapter | ||
from scrapy import Spider | ||
from scrapy.crawler import Crawler | ||
from scrapy.exceptions import NotConfigured | ||
from scrapy.utils.misc import load_object | ||
from scrapy_poet import InjectionMiddleware | ||
from web_poet.fields import get_fields_dict | ||
from web_poet.utils import get_fq_class_name | ||
from zyte_common_items.fields import is_auto_field | ||
|
||
logger = getLogger(__name__) | ||
|
||
|
||
class ScrapyZyteAPIAutoFieldStatsItemPipeline: | ||
|
||
@classmethod | ||
def from_crawler(cls, crawler): | ||
return cls(crawler) | ||
|
||
def __init__(self, crawler: Crawler): | ||
if not crawler.settings.getbool("ZYTE_API_AUTO_FIELD_STATS", False): | ||
raise NotConfigured | ||
|
||
raw_url_fields = crawler.settings.getdict("ZYTE_API_AUTO_FIELD_URL_FIELDS", {}) | ||
self._url_fields = {load_object(k): v for k, v in raw_url_fields.items()} | ||
self._seen: Set[Type] = set() | ||
self._crawler = crawler | ||
self._stats = crawler.stats | ||
self._item_cls_without_url: Set[Type] = set() | ||
|
||
def open_spider(self, spider): | ||
for component in self._crawler.engine.downloader.middleware.middlewares: | ||
if isinstance(component, InjectionMiddleware): | ||
self._registry = component.injector.registry | ||
return | ||
raise RuntimeError( | ||
"Could not find scrapy_poet.InjectionMiddleware among downloader " | ||
"middlewares. scrapy-poet may be misconfigured." | ||
) | ||
|
||
def process_item(self, item: Any, spider: Spider): | ||
item_cls = item.__class__ | ||
|
||
url_field = self._url_fields.get(item_cls, "url") | ||
adapter = ItemAdapter(item) | ||
url = adapter.get(url_field, None) | ||
if not url: | ||
if item_cls not in self._item_cls_without_url: | ||
self._item_cls_without_url.add(item_cls) | ||
logger.warning( | ||
f"An item of type {item_cls} was missing a non-empty URL " | ||
f"in its {url_field!r} field. An item URL is necessary to " | ||
f"determine the page object that was used to generate " | ||
f"that item, and hence print the auto field stats that " | ||
f"you requested by enabling the ZYTE_API_AUTO_FIELD_STATS " | ||
f"setting. If {url_field!r} is the wrong URL field for " | ||
f"that item type, use the ZYTE_API_AUTO_FIELD_URL_FIELDS " | ||
f"setting to set a different field." | ||
) | ||
return item | ||
|
||
page_cls = self._registry.page_cls_for_item(url, item_cls) | ||
|
||
cls = page_cls or item_cls | ||
if cls in self._seen: | ||
return item | ||
self._seen.add(cls) | ||
|
||
if not page_cls: | ||
field_list = "(all fields)" | ||
else: | ||
auto_fields = set() | ||
missing_fields = False | ||
for field_name in get_fields_dict(page_cls): | ||
if is_auto_field(page_cls, field_name): # type: ignore[arg-type] | ||
auto_fields.add(field_name) | ||
else: | ||
missing_fields = True | ||
if missing_fields: | ||
field_list = " ".join(sorted(auto_fields)) | ||
else: | ||
field_list = "(all fields)" | ||
|
||
cls_fqn = get_fq_class_name(cls) | ||
self._stats.set_value(f"scrapy-zyte-api/auto_fields/{cls_fqn}", field_list) | ||
return item |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from ._poet_item_pipelines import ScrapyZyteAPIAutoFieldStatsItemPipeline |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving it to scrapy-poet might allow to make the name shorter :)