-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse and forward matching use slightly different regex syntax #440
Comments
ChatGPT gives the following suggestion. Not all of it works, but the approach (with some updates/fixes to the code) seems viable: Lucene's regular expression support is provided through the RegExp class in the org.apache.lucene.util.automaton package. Here's an example of how you might use Lucene's regex engine to find a pattern in a string and iterate over the matches, accessing captured groups: import org.apache.lucene.util.automaton.RegExp;
import org.apache.lucene.util.automaton.Transition;
import org.apache.lucene.util.automaton.TransitionIterator;
public class LuceneRegexExample {
public static void main(String[] args) {
// Your input string
String input = "The quick brown fox jumps over the lazy dog";
// Your regular expression pattern
String pattern = "\\b(\\w+)(\\s+\\w+)*\\b"; // Example pattern to match words
// Compile the regular expression
RegExp re = new RegExp(pattern);
// Get the automaton for the regular expression
// You can also use re.toAutomaton() if you don't need to modify the automaton further
// This can be useful for optimizing the regex compilation if it's used multiple times
RegExp.Automaton automaton = re.toAutomaton();
// Iterate over transitions to find matches
TransitionIterator iterator = automaton.getInitialState().getTransitions();
int currentIndex = 0;
while (iterator.hasNext()) {
Transition transition = iterator.next();
int nextState = transition.getDest().getNumber();
// Check if the transition is a match
if (transition.getMin() <= currentIndex && currentIndex < transition.getMax()) {
System.out.println("Match found at index: " + currentIndex);
// Access captured groups if needed
String matchedText = input.substring(currentIndex, currentIndex + (transition.getMax() - transition.getMin()));
System.out.println("Matched text: " + matchedText);
// Access captured groups
for (int group = 1; group <= re.numberOfGroups(); group++) {
int start = re.start(group);
int end = re.end(group);
if (start != -1 && end != -1) {
String capturedGroup = input.substring(start, end);
System.out.println("Group " + group + ": " + capturedGroup);
}
}
// Move the current index to the next character after the match
currentIndex = currentIndex + (transition.getMax() - transition.getMin());
} else {
// Move to the next character if there is no match
currentIndex++;
}
}
}
} In this example, we use |
See e.g. https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html#COMPLEMENT 👍
Maybe we can use Lucene's regex engine there as well? Otherwise we'd have to try to translate the regex to the other engine's syntax, which could be challenging.
Not a huge issue in practice, but could in rare cases lead to baffling matching bugs...
If we want to enable optional features in Lucene's regex engine such as the complement operator
~
, this becomes more of a problem. We've enabled this for relations matches now, but those never use the forward index.The text was updated successfully, but these errors were encountered: