Extracting Keywords That Are Subsets Of Each Other #137

shner-elmo · 2023-01-08T17:59:32Z

Hey everyone, I was wondering why is it that when we have multiple keywords that overlap, for example: ["super computer", "computer game"] we only extract the longest one, why not extract both of them?
I would assume that if you want to extract a bunch of keywords from a document it makes sense to get all the matches (even those that overlap), then the user could decide what to do with them.

In this test we want to make sure that with the following sentence: sentence = "distributed super computer game" and these following keywords:

{
    "Distributed Super Computer": ["distributed super computer"],
    "Computer Game": ["computer game"]
}

we only extract the first keyword which is the longest: "Distributed Super Computer", but why not get both in their order? i.e: ["Distributed Super Computer", "computer game"]

The text was updated successfully, but these errors were encountered:

Raidus · 2023-03-15T06:22:08Z

Would be also interested in an efficient way to do that. It would not be compute efficient and would only work on very short text like a keyword list but you could create ngram (1,2, 3) pairs like distributed, computer, distributed super, super computer and distributed super computer and do flashtext the work on the bigrams list. Then you could match it.

You could try out hyperscan. It's extremely fast. But compiling a huge list of regex rules takes a bit of time. Here is a great tutorial: https://geekmonkey.org/regular-expression-matching-at-scale-with-hyperscan/

Sandy4321 · 2023-05-28T20:41:35Z

hyperscan is not working for windows..

NLPShenanigans · 2023-11-23T16:45:17Z

Hey @shner-elmo and @Raidus, while it doesn't look that this repo is well-maintained at this point, I'd like to give a stab why the algorithm doesn't accommodate. In short, it's greedy, so it seeks to locally optimise the length of the extracted tokens. This is done for efficiency, but you both rightly point out that this can lead to instances, such as in named entity recognition, where you'll miss mentions of overlapping name entities. Intuitively, I'd expect that this isn't wholly undesirable though, since in the toy example you mention, the named entity is the whole sentence, "distributed super computer game". In this, I would argue building your vocabulary to add into the keyword processor class instance, either via word tokenisation w/n-gram enrichment, or subword tokenisation via something like BPE, would be a suitable choice to ensure your vocabulary is better aligned to the true named entities you wish to extract. If you still need the smaller token counts too, building a set of unigrams, bigrams and trigrams as your vocabulary, and then a more brute force approach may be required. I hope that helps a little!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Keywords That Are Subsets Of Each Other #137

Extracting Keywords That Are Subsets Of Each Other #137

shner-elmo commented Jan 8, 2023

Raidus commented Mar 15, 2023 •

edited

Loading

Sandy4321 commented May 28, 2023

NLPShenanigans commented Nov 23, 2023

Extracting Keywords That Are Subsets Of Each Other #137

Extracting Keywords That Are Subsets Of Each Other #137

Comments

shner-elmo commented Jan 8, 2023

Raidus commented Mar 15, 2023 • edited Loading

Sandy4321 commented May 28, 2023

NLPShenanigans commented Nov 23, 2023

Raidus commented Mar 15, 2023 •

edited

Loading