Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse and forward matching use slightly different regex syntax #440

Open
jan-niestadt opened this issue Jul 7, 2023 · 1 comment
Open
Labels
bug refactor Proposed internal reorganization or cleanup of the code

Comments

@jan-niestadt
Copy link
Member

See e.g. https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html#COMPLEMENT 👍

The reserved characters used in the (enabled) syntax must be escaped with backslash () or double-quotes ("..."). (In contrast to other regexp syntaxes, this is required also in character classes.)

Maybe we can use Lucene's regex engine there as well? Otherwise we'd have to try to translate the regex to the other engine's syntax, which could be challenging.

Not a huge issue in practice, but could in rare cases lead to baffling matching bugs...

If we want to enable optional features in Lucene's regex engine such as the complement operator ~, this becomes more of a problem. We've enabled this for relations matches now, but those never use the forward index.

@jan-niestadt jan-niestadt added bug refactor Proposed internal reorganization or cleanup of the code labels Jul 7, 2023
@jan-niestadt
Copy link
Member Author

ChatGPT gives the following suggestion. Not all of it works, but the approach (with some updates/fixes to the code) seems viable:


Lucene's regular expression support is provided through the RegExp class in the org.apache.lucene.util.automaton package. Here's an example of how you might use Lucene's regex engine to find a pattern in a string and iterate over the matches, accessing captured groups:

import org.apache.lucene.util.automaton.RegExp;
import org.apache.lucene.util.automaton.Transition;
import org.apache.lucene.util.automaton.TransitionIterator;

public class LuceneRegexExample {

    public static void main(String[] args) {
        // Your input string
        String input = "The quick brown fox jumps over the lazy dog";

        // Your regular expression pattern
        String pattern = "\\b(\\w+)(\\s+\\w+)*\\b";  // Example pattern to match words

        // Compile the regular expression
        RegExp re = new RegExp(pattern);

        // Get the automaton for the regular expression
        // You can also use re.toAutomaton() if you don't need to modify the automaton further
        // This can be useful for optimizing the regex compilation if it's used multiple times
        RegExp.Automaton automaton = re.toAutomaton();

        // Iterate over transitions to find matches
        TransitionIterator iterator = automaton.getInitialState().getTransitions();

        int currentIndex = 0;

        while (iterator.hasNext()) {
            Transition transition = iterator.next();
            int nextState = transition.getDest().getNumber();

            // Check if the transition is a match
            if (transition.getMin() <= currentIndex && currentIndex < transition.getMax()) {
                System.out.println("Match found at index: " + currentIndex);

                // Access captured groups if needed
                String matchedText = input.substring(currentIndex, currentIndex + (transition.getMax() - transition.getMin()));
                System.out.println("Matched text: " + matchedText);

                // Access captured groups
                for (int group = 1; group <= re.numberOfGroups(); group++) {
                    int start = re.start(group);
                    int end = re.end(group);

                    if (start != -1 && end != -1) {
                        String capturedGroup = input.substring(start, end);
                        System.out.println("Group " + group + ": " + capturedGroup);
                    }
                }

                // Move the current index to the next character after the match
                currentIndex = currentIndex + (transition.getMax() - transition.getMin());
            } else {
                // Move to the next character if there is no match
                currentIndex++;
            }
        }
    }
}

In this example, we use org.apache.lucene.util.automaton.RegExp to compile the regular expression pattern, and then we obtain the automaton for the regular expression using re.toAutomaton(). We iterate over the transitions of the automaton and check for matches, accessing captured groups as needed. The RegExp class provides methods like start(group) and end(group) to get the start and end indices of captured groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug refactor Proposed internal reorganization or cleanup of the code
Projects
None yet
Development

No branches or pull requests

1 participant