why the keyword phrase include a PRON, like "it" #271

chencjiajy · 2024-01-04T13:26:41Z

I have run the following code snippet, the output including word "it", pos_kept don't include the PRON.

import spacy
import pytextrank

nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank", config={'pos_kept': ["NOUN", "PROPN", "VERB"]})

text = '''The MCU SDK for WRG1 general firmware has been launched, and it can be automatically generated after creating the product.'''
doc = nlp(text)

for phrase in doc._.phrases[:10]:
    print(phrase.text, phrase.rank, phrase.count, phrase.chunks)

## the output is 
# the product 0.12286712485174818 1 [the product]
# WRG1 general firmware 0.10712303413227088 1 [WRG1 general firmware]
# The MCU SDK 0.0834726982382997 1 [The MCU SDK]
# it 0.0 1 [it]

The text was updated successfully, but these errors were encountered:

ceteri · 2024-01-04T22:08:14Z

Hi @chencjiajy, great question.

The library considers noun chunks and apparently spaCy parses the term it as that.

The coreference capabilities for spaCy are currently marked "experimental", which is a nice way to say "Good luck installing and running this part in production" :) I've evaluated multiple options for coreference (including the AllenNLP integration) and they each seem to have serious limitations. That said, if these capabilities were available, it would be relatively simple to resolve a pronoun reference within the graph. In that case, the term it would add more weight to The MCU SDK instead.

If you want, the term it might be good to add to the stop words list for your application?

chencjiajy · 2024-01-08T23:38:21Z

Hi, @ceteri , I found it's not useful to add item it to the stop words list, and the same as other single PRON words. Because pos_kept don't include the PRON, I don't need to add a single PRON word to stop words. In the code of function _collect_phrases atbase.py, pytextrank will exclude single PRON word that not be included in the pos_kept. So for single PRON word, it's rank will always be 0.0, So what I need to do is to filter the phrase it's rank is equal to zero.

        phrases: typing.Dict[Span, float] = {
            span: sum(
                ranks[Lemma(token.lemma_, token.pos_)]
                for token in span
                if self._keep_token(token)
            )
            for span in spans
        }

ceteri added the question label Jan 4, 2024

chencjiajy closed this as completed Jan 8, 2024

chencjiajy reopened this Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the keyword phrase include a PRON, like "it" #271

why the keyword phrase include a PRON, like "it" #271

chencjiajy commented Jan 4, 2024

ceteri commented Jan 4, 2024

chencjiajy commented Jan 8, 2024

why the keyword phrase include a PRON, like "it" #271

why the keyword phrase include a PRON, like "it" #271

Comments

chencjiajy commented Jan 4, 2024

ceteri commented Jan 4, 2024

chencjiajy commented Jan 8, 2024