Skip to content

Commit

Permalink
Merge pull request #587 from flairNLP/add-globe-and-mail
Browse files Browse the repository at this point in the history
Add The Globe and Mail
  • Loading branch information
MaxDall authored Sep 3, 2024
2 parents 34a7f30 + defa7f7 commit 376e7f8
Show file tree
Hide file tree
Showing 6 changed files with 145 additions and 0 deletions.
15 changes: 15 additions & 0 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,21 @@
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
<td>
<code>TheGlobeAndMail</code>
</td>
<td>
<div>The Globe and Mail</div>
</td>
<td>
<a href="https://www.theglobeandmail.com">
<span>www.theglobeandmail.com</span>
</a>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
</table>

Expand Down
10 changes: 10 additions & 0 deletions src/fundus/publishers/ca/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from fundus.publishers.base_objects import Publisher, PublisherGroup
from fundus.publishers.ca.cbc_news import CBCNewsParser
from fundus.publishers.ca.globe_and_mail import TheGlobeAndMailParser
from fundus.publishers.ca.national_post import NationalPostParser
from fundus.scraping.url import NewsMap, RSSFeed, Sitemap

Expand All @@ -17,6 +18,15 @@ class CA(metaclass=PublisherGroup):
RSSFeed("https://www.cbc.ca/webfeed/rss/rss-canada"),
],
)
TheGlobeAndMail = Publisher(
name="The Globe and Mail",
domain="https://www.theglobeandmail.com",
parser=TheGlobeAndMailParser,
sources=[
NewsMap("https://www.theglobeandmail.com/arc/outboundfeeds/news-sitemap-index/?outputType=xml"),
NewsMap("https://www.theglobeandmail.com/arc/outboundfeeds/sitemap-index/?outputType=xml"),
],
)

NationalPost = Publisher(
name="National Post",
Expand Down
49 changes: 49 additions & 0 deletions src/fundus/publishers/ca/globe_and_mail.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import datetime
from typing import List, Optional

from lxml.cssselect import CSSSelector

from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
generic_date_parsing,
generic_topic_parsing,
)


class TheGlobeAndMailParser(ParserProxy):
class V1(BaseParser):
_subheadline_selector = CSSSelector("article > h4")
_paragraph_selector = CSSSelector("article > p")

@attribute
def body(self) -> ArticleBody:
return extract_article_body_with_selector(
self.precomputed.doc,
subheadline_selector=self._subheadline_selector,
paragraph_selector=self._paragraph_selector,
)

@attribute
def authors(self) -> List[str]:
return generic_author_parsing(self.precomputed.ld.bf_search("author"))

@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.ld.bf_search("datePublished"))

@attribute
def title(self) -> Optional[str]:
return self.precomputed.meta.get("og:title")

@attribute
def topics(self) -> List[str]:
topic_list = [topic.lower() for topic in generic_topic_parsing(self.precomputed.meta.get("keywords"))]
topic_set = set(topic_list)
topic_duplicates = list(topic_list)
for element in topic_set:
topic_duplicates.remove(element)
for duplicate in topic_duplicates:
topic_list.remove(duplicate)
return [topic.title() for topic in topic_list if "news" not in topic]
67 changes: 67 additions & 0 deletions tests/resources/parser/test_data/ca/TheGlobeAndMail.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
{
"V1": {
"authors": [
"Chris Wilson-Smith"
],
"body": {
"summary": [],
"sections": [
{
"headline": [],
"paragraphs": [
"If one analyst catches Nvidia chief executive Jensen Huang so much as sneezing tomorrow, I’d stay away from market news for a few days. Below, as investors sharpen their knives in case the AI giant reports merely terrific earnings, we look at why Nvidia’s long-term growth could be damaging in unexpected ways.",
"Earnings up: Royal Bank of Canada has surpassed analysts’ estimates for quarterly profit as it set aside a smaller than expected sum to protect itself against losses on bad loans.",
"And up again: National Bank of Canada also has posted profits that exceeded analysts expectations, a week ahead of a key vote on its proposed $5-billion takeover of rival Canadian Western Bank.",
"Stare down: Buyers smell blood in the water as distressed commercial properties are put up for sale. But so far, sellers of that troubled real estate are refusing to accept rock-bottom values.",
"Chair down: The inaugural chair of the organization in charge of overseeing Canada’s adoption of international sustainability reporting standards has stepped down, prompting a search for a replacement at a key time in its duties.",
"Deep down: A court battle pitting the two brothers behind Dye & Durham Ltd. against one another has exposed broad discontent among institutional shareholders toward the real-estate software company dating to well before activists launched campaigns against it this year."
]
},
{
"headline": [
"Nvidia’s new vertical: Nation building"
],
"paragraphs": [
"Nvidia’s chief executive is known for a few sayings and stylistic choices: We are at the beginning of “a new industrial revolution” powered by artificial intelligence. His company’s relatively affordable products are “democratizing” access to its computational powers. He has a cool leather jacket.",
"Of late, he seems focused on another vision for the future: “Sovereign AI.”",
"That would be the idea that each nation produces artificial intelligence using its own infrastructure, data, work force and business networks.",
"Canada is among the subscribers to this idea, and the argument is similar to the one made by manufacturers of, say, electric vehicles: If we don’t protect our industry, if we don’t develop and innovate, we are then beholden to the whims of industry giants in other countries. We lose homegrown winners and jobs, and possibly expose ourselves to security threats.",
"It’s a great idea for Jensen Huang, who gets to sell billions of dollars worth of chips to governments. But it’s a little more complicated – and possibly even more dangerous – than he makes it out to be.",
"The biggest problem might have been best illustrated by Huang himself. At the World Governments Summit in Dubai this February, he reminded an audience of leaders across industry and politics that investment in AI infrastructure is essential.",
"He then told these leaders, who were gathered in a country that criminalizes being gay, what he would do if he were a leader of a developing nation: “The first thing that I would do, of course, is I would codify the language, the data of your culture into your own large language model.”",
"Did leaders of authoritarian nations lean forward in their seats?",
"That possibility is one of many concerns held by critics of sovereign AI. They argue embracing the concept, especially with the support of a global AI leader, could legitimize and accelerate state programs that codify belief systems, language preferences, behaviours. And if every nation becomes responsible for its own AI innovations, some of those breakthroughs could be left trapped behind geographic borders. That’s not to mention the risk of fuelling an already dominant Nvidia into a force that could squeeze out competition completely.",
"How does all this square with “democratizing” access to artificial intelligence? And more to Huang’s point: are there lines he wouldn’t feel comfortable seeing crossed?",
"In July, the Digital Forensic Research Lab outlined ways authoritarian governments that embrace sovereign AI could use it to further erode human rights.",
"At a more basic level, the report says, state-backed data initiatives for sovereign AI are likely to hurt marginalized populations, given governments’ views on national identity tend to be rooted in more deeply held – if not completely fixed – ideas. The report points to China, which has already succeeded in censoring models that threaten Beijing’s messaging. But the warning applies to any nation embracing the concept.",
"Canada, which has an “AI Sovereign Compute Strategy” as part of a broader set of measures, seems attuned to these risks. As part of its efforts to spur the development of Canadian-owned and located AI infrastructure, it launched consultations with businesses, developers, researchers and Indigenous groups that end on Sept. 6.",
"We’ll be curious to see what these consultations find, and how they will reflect Canada’s “culture.” (Not that the country has ever struggled to define what that is, of course.)",
"Huang’s own remarks suggest his strategy is to form the building blocks of the next industrial revolution, then leave it to his client countries to decide how to use them.",
"Today, even the slightest hint of weakness in Nvidia’s forecast could make for volatile trading over the coming weeks – but most analysts don’t see much of a threat to the AI giant. Longer-term, though, Nvidia’s expansion might attract more scrutiny to its growing role – whether it acknowledges it or not – as a nation builder with no apparent code of its own.",
"Reliance on the low-wage stream of the temporary foreign work program has shot up since 2022. The federal government agreed to ease access to the program in response to calls from restaurant owners and other employers who said they were struggling to find staff after months of pandemic restrictions. Ottawa announced this week plans to cut the low-wage stream back to prepandemic levels amid criticism of its growing use by Canadian employers.",
"Today: Nvidia and CrowdStrike report after close, assuming no faulty software updates. Investors will be chewing over reports from RBC and National Bank of Canada as they wait for ...",
"Tomorrow: ... Canadian Imperial Bank of Commerce earnings, which will be the last of the Big Six this quarter. Other earnings include Dell Technologies Inc., Dollar General Corp., and Lululemon Athletica Inc.",
"Friday: Canadian Western Bank reports as it awaits approval to be purchased by National Bank. Canada reports monthly GDP growth, and the U.S. releases two indicators of consumer spending and price growth.",
"Long arms of the claw: Inadequate federal enforcement of the lobster fishery in southwestern Nova Scotia is emboldening organized crime that is “terrorizing” the local community."
]
},
{
"headline": [
"Morning markets"
],
"paragraphs": [
"Global markets held steady as investors stayed on the sidelines ahead of Nvidia’s earnings release after the closing bell. Wall Street futures and TSX futures were little changed.",
"Overseas, the pan-European STOXX 600 was up 0.49 per cent in morning trading. Britain’s FTSE 100 slipped 0.14 per cent, Germany’s DAX rose 0.78 per cent and France’s CAC 40 gained 0.49 per cent.",
"In Asia, Japan’s Nikkei closed 0.22 per cent higher, while Hong Kong’s Hang Seng dropped 1.02 per cent.",
"The Canadian dollar traded at 74.27 U.S. cents."
]
}
]
},
"publishing_date": "2024-08-28 11:26:02.614000+00:00",
"title": "Business Brief: Nvidia’s into nation building. We cool with that?",
"topics": [
"Noastack"
]
}
}
Binary file not shown.
4 changes: 4 additions & 0 deletions tests/resources/parser/test_data/ca/meta.info
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,9 @@
"NationalPost_2024_08_28.html.gz": {
"url": "https://nationalpost.com/news/canada/kamala-harris-childhood-montreal-canada",
"crawl_date": "2024-08-28 13:13:43.905282"
},
"TheGlobeAndMail_2024_08_28.html.gz": {
"url": "https://www.theglobeandmail.com/business/article-business-brief-nvidias-into-nation-building-we-cool-with-that/",
"crawl_date": "2024-08-28 13:26:27.319831"
}
}

0 comments on commit 376e7f8

Please sign in to comment.