SaysWho is a Python package for identifying and attributing quotes in text. It uses a combination of logic and grammer to find quotes and their speakers, then uses a coreferencing model to better clarify who is speaking. It's built on Textacy and SpaCy.
-
Corefencing is an experimental feature not fully integrated into SpaCy, and the current pipeline is built on SpaCy 3.4. I haven't had any problems using it with SpaCy 3.5+, but it takes some finesse to navigate the different versions.
-
SaysWho grew out of a larger project for analyzing newspaper articles from Lexis between ~250 and ~2000 words, and it is optimized to navitage the syntax and common errors particular to that text.
-
The output of this version is kind of open-ended, and possibly not as useful as it could be. HTML viz is coming, but I'm open to any suggestions about how this could be more useful!
Install and update using pip:
$ pip install sayswho
Install the pre-trained SpaCy coreferencing pipeline.
$ pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl
(Optional) If you want to use the most recent version of SpaCy, you will need to update it here. (see Notes)
$ pip install spacy -U
Download the SpaCy large english model.
$ spacy download en_core_web_lg
Sample text adapted from here:
Nets Coach Jacque Vaughn was optimistic when discussing Ben Simmons's prospects on NBA TV.
“It’s been great, being able to check in with Ben," Vaughn said, via Nets Daily. “I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.
"He has an innate ability to impact the basketball game on both ends of the floor. So, we missed that in the Philly series and looking forward to it.”
Simmons arrived in Brooklyn during the 2021-22 season, but did not play that year after a back injury. The 26-year-old would make 42 appearances (33 starts) during a tumult-filled season for Brooklyn.
“He is on the court. No setbacks," Vaughn later told reporters about Simmons' workouts. “We’ll continue to see him improve through the offseason.”
from sayswho import SaysWho
sw = SaysWho(text)
print(sw.quotes)
[DQTriple(speaker=[Vaughn], cue=[said], content=“It’s been great, being able to check in with Ben,"),
DQTriple(speaker=[Vaughn], cue=[said], content=“I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward."),
DQTriple(speaker=[Vaughn], cue=[told], content=“He is on the court. No setbacks,"),
DQTriple(speaker=[Vaughn], cue=[told], content=“We’ll continue to see him improve through the offseason.”)]
print(sw.clusters)
[[Ben Simmons's,
Ben,
a healthy Ben Simmons,
him,
He,
Simmons,
The 26-year-old,
He,
Simmons'x,
him],
[Nets Coach Jacque Vaughn, Vaughn, I, Vaughn],
[Nets, The team, our, we],
[an innate ability to impact the basketball game on both ends of the floor,
that,
it],
[the 2021-22 season, that year],
[Brooklyn, Brooklyn, We]]
sw.print_clusters()
0 {'Ben', 'He', 'The 26-year-old', 'a healthy Ben Simmons', "Simmons'x", "Ben Simmons's", 'Simmons', 'him'}
1 {'I', 'Nets Coach Jacque Vaughn', 'Vaughn'}
2 {'The team', 'our', 'we', 'Nets'}
3 {'it', 'an innate ability to impact the basketball game on both ends of the floor', 'that'}
4 {'that year', 'the 2021-22 season'}
5 {'Brooklyn', 'We'}
for qm in sw.quote_matches:
print(qm)
QuoteClusterMatch(quote_index=0, cluster_index=1)
QuoteClusterMatch(quote_index=1, cluster_index=1)
QuoteClusterMatch(quote_index=2, cluster_index=1)
QuoteClusterMatch(quote_index=3, cluster_index=1)
sw.expand_match()
QUOTE : 0
DQTriple(speaker=[Vaughn], cue=[said], content=“It’s been great, being able to check in with Ben,")
CLUSTER : 1
['Nets Coach Jacque Vaughn', 'Vaughn']
QUOTE : 1
DQTriple(speaker=[Vaughn], cue=[said], content=“I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.")
CLUSTER : 1
['Nets Coach Jacque Vaughn', 'Vaughn']
QUOTE : 2
DQTriple(speaker=[Vaughn], cue=[told], content=“He is on the court. No setbacks,")
CLUSTER : 1
['Nets Coach Jacque Vaughn', 'Vaughn']
QUOTE : 3
DQTriple(speaker=[Vaughn], cue=[told], content=“We’ll continue to see him improve through the offseason.”)
CLUSTER : 1
['Nets Coach Jacque Vaughn', 'Vaughn']
Use .render_to_html()
to output an HTML file with your text, highlighted quotes, and associated clusters.
sw.render_to_html(article_title="My Article Title")