Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add The West Australian #615

Merged
merged 4 commits into from
Sep 30, 2024
Merged

Add The West Australian #615

merged 4 commits into from
Sep 30, 2024

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Sep 16, 2024

This PR is based on #614 and should be reviewed afterward.

@MaxDall MaxDall added the DON'T MERGE Current PR is based on another PR. When used, indicate in title with [Based on #...] label Sep 16, 2024
Base automatically changed from implement-xpath-search-for-lds to master September 17, 2024 07:01
@MaxDall MaxDall removed the DON'T MERGE Current PR is based on another PR. When used, indicate in title with [Based on #...] label Sep 17, 2024
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this publisher, seems like it was quite some effort. I only got some minor suggestions :)

parser=WestAustralianParser,
sources=[
NewsMap("https://thewest.com.au/news-sitemap.xml"),
Sitemap("https://thewest.com.au/sitemap.xml", reverse=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a RSS feed containing the most recent articles: https://thewest.com.au/rss


@attribute
def topics(self) -> List[str]:
return generic_topic_parsing(self.precomputed.ld.bf_search("keywords"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason TheWestAustralian seems to include a lot of duplicate topics. I think we should filter them to be unique (maybe it makes sense to implement it in generic_topic_parsing?). This is already visible in the testcase or in This article. Also, they add the topic premium to their plus articles, I would consider filtering that as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Point! I will add this in a separate PR

"string(//script[re:test(text(), 'window.PAGE_DATA')])",
namespaces={"re": "http://exslt.org/regular-expressions"},
)
_json_type_pattern = re.compile(r"^(?P<type>.*)=\s*{")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be leftover. At least I couldn't find any usage in the project files.

@MaxDall MaxDall merged commit 369a6a3 into master Sep 30, 2024
5 checks passed
@MaxDall MaxDall deleted the add-west-australian branch September 30, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants