Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding regex replacement feature #202
base: main
Are you sure you want to change the base?
Adding regex replacement feature #202
Changes from 1 commit
ac5a79f
eb1b4f6
ad93f1f
303464a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good example and easily understandable, thanks for this thorough documentation. In the context of Wikipedia extracts, more sentences might actually mean less content, as a sentence might be fulfilling all rule requirements, but then gets split into two. And then only one of them gets picked. Of course this heavily depends on how many potential sentences a given article has. In many cases (such as yours), this might be beneficial, but it doesn't always have to be. Might be worth it to write a short explanation here for that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where it is most worthwhile: If the article does not have enough sentences to select from (<3) because of the rules, especially max_words and/or max_characters. At that time, this algorithm can kick in and try to produce split sentences.
There is no way for us to know if pre-split or post-split can produce more "valuable" sentences. But many of the "sub-sentences" might be simple introductory wording etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and that's exactly why I would prefer a short sentence explaining that, so people don't just blindly copy. If we have an indication that it works in all corpuses, then we could also just do it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note on sentence splitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This made me wonder if this implementation should go further than just with 3 values. Initially I thought such a regex implementation would only take two arguments and basically work like the
replace_all
function. But thinking about it, I can absolutely see why 3 arguments can be even more helpful, though many use cases could also be covered by named capture groups (but not all!).Would you be interested in implementing a second form of this that accepts two arguments and replaces every matched occurrence with that string? This of course could be done outside this PR as a follow-up.
Check warning on line 34 in src/replacer.rs
GitHub Actions / lint