Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a setting for how to deal with overlapping matches #523

Open
jan-niestadt opened this issue Jul 2, 2024 · 0 comments
Open

Add a setting for how to deal with overlapping matches #523

jan-niestadt opened this issue Jul 2, 2024 · 0 comments

Comments

@jan-niestadt
Copy link
Member

Certain queries such as [lemma="cat"] [lemma!="dog"]{10} can produce a bunch of overlapping hits (cat followed by 1 non-dog; cat followed by 2 non-dogs; etc.). For certain queries, you want all the possibilities, but for others, you would prefer it if these hits were filtered to just include the ones most relevant to you.

This is somewhat similar to how regex engines usually have greedy, reluctant and possessive matching modes (see e.g. here), although replicating those exact behaviours in BlackLab would be challenging, because it finds matches in a different way, using the reverse index.

There are many ways BlackLab could filter out certain overlapping hits, e.g.:

  • keep everything (this is how it currently works)
  • for hits with the same start position, discard all but the longest (or shortest) (but giving start position a special meaning seems arbitrary)
  • when two hits overlap, keep the one that starts the earliest in the document; discard the other (again, seems arbitrary)
  • discard any hits that are fully contained in another hit (or that fully contain another hit)
  • when two hits (partially or fully) overlap, keep the longest (or shortest); discard the other

We should try to support some of the most helpful modes.

(via @franklandsbergen)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant