Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding regex replacement feature #202

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 48 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,30 +155,31 @@ cargo run --release -- -l en -d ../texts/ extract-file >> file.en.txt

The following rules can be configured per language. Add a `<language>.toml` file in the `rules` directory to enable a new locale. Note that the `replacements` get applied before any other rules are checked.

| Name | Description | Values | Default |
|--------|-----------------------|---------|---------|
| abbreviation_patterns | Regex defining abbreviations | Rust Regex Array | all abbreviations allowed
| allowed_symbols_regex | Regex of allowed symbols or letters. Each character gets matched against this pattern. | String Array | not used
| broken_whitespace | Array of broken whitespaces. This could for example disallow two spaces following each other | String Array | all types of whitespaces allowed
| disallowed_symbols | Use `allowed_symbols_regex` instead. Array of disallowed symbols or letters. Only used when allowed_symbols_regex is not set or is an empty String. | String Array | all symbols allowed
| disallowed_words | Array of disallowed words. Prefer the blocklist approach when possible. | String Array | all words allowed
| even_symbols | Symbols that always need an even count | Char Array | []
| matching_symbols | Symbols that map to another | Array of matching configurations: each configuration is an Array of two values: `["match", "match"]`. See example below. | []
| max_word_count | Maximum number of words in a sentence | integer | 14
| may_end_with_colon | If a sentence can end with a : or not | boolean | false
| min_characters | Minimum of character occurrences | integer | 0
| max_characters | Maximum of character occurrences | integer | MAX
| min_trimmed_length | Minimum length of string after trimming | integer | 3
| min_word_count | Minimum number of words in a sentence | integer | 1
| needs_letter_start | If a sentence needs to start with a letter | boolean | true
| needs_punctuation_end | If a sentence needs to end with a punctuation | boolean | false
| needs_uppercase_start | If a sentence needs to start with an uppercase | boolean | false
| other_patterns | Regex to disallow anything else | Rust Regex Array | all other patterns allowed
| quote_start_with_letter | If a quote needs to start with a letter | boolean | true
| remove_brackets_list | Removes (possibly nested) user defined brackets and content inside them `(anything [else])` from the sentence before replacements and checking other rules | Array of matching brackets: each configuration is an Array of two values: `["opening_bracket", "closing_bracket"]`. See example below. | []
| replacements | Replaces abbreviations or other words according to configuration. This happens before any other rules are checked. | Array of replacement configurations: each configuration is an Array of two values: `["search", "replacement"]`. See example below. | nothing gets replaced
| segmenter | Segmenter to use for this language. See below for more information. | "python" | using `rust-punkt` by default
| stem_separator_regex | If given, splits words at the given characters to reach the stem words to check them again against the blacklist, e.g. prevents "Rust's" to pass if "Rust" is in the blacklist. | Simple regex of separators, e.g. for apostrophe `stem_separator_regex = "[']"` | ""
| Name | Description | Values | Default |
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|---------|
| abbreviation_patterns | Regex defining abbreviations | Rust Regex Array | all abbreviations allowed
| allowed_symbols_regex | Regex of allowed symbols or letters. Each character gets matched against this pattern. | String Array | not used
| broken_whitespace | Array of broken whitespaces. This could for example disallow two spaces following each other | String Array | all types of whitespaces allowed
| disallowed_symbols | Use `allowed_symbols_regex` instead. Array of disallowed symbols or letters. Only used when allowed_symbols_regex is not set or is an empty String. | String Array | all symbols allowed
| disallowed_words | Array of disallowed words. Prefer the blocklist approach when possible. | String Array | all words allowed
| even_symbols | Symbols that always need an even count | Char Array | []
| matching_symbols | Symbols that map to another | Array of matching configurations: each configuration is an Array of two values: `["match", "match"]`. See example below. | []
| max_word_count | Maximum number of words in a sentence | integer | 14
| may_end_with_colon | If a sentence can end with a : or not | boolean | false
| min_characters | Minimum of character occurrences | integer | 0
| max_characters | Maximum of character occurrences | integer | MAX
| min_trimmed_length | Minimum length of string after trimming | integer | 3
| min_word_count | Minimum number of words in a sentence | integer | 1
| needs_letter_start | If a sentence needs to start with a letter | boolean | true
| needs_punctuation_end | If a sentence needs to end with a punctuation | boolean | false
| needs_uppercase_start | If a sentence needs to start with an uppercase | boolean | false
| other_patterns | Regex to disallow anything else | Rust Regex Array | all other patterns allowed
| quote_start_with_letter | If a quote needs to start with a letter | boolean | true
| remove_brackets_list | Removes (possibly nested) user defined brackets and content inside them `(anything [else])` from the sentence before replacements and checking other rules | Array of matching brackets: each configuration is an Array of two values: `["opening_bracket", "closing_bracket"]`. See example below. | []
| replacements | Replaces abbreviations or other words according to configuration. This happens before any other rules are checked. | Array of replacement configurations: each configuration is an Array of two values: `["search", "replacement"]`. See example below. | nothing gets replaced
| regex_replacement_list | Finds regex and makes replacements within found patterms. This happens before any other rules are checked. | Array of configurations: each configuration is an Array of three values: `["regex", "search", "replacement"]`. See example below. | nothing gets replaced
raivisdejus marked this conversation as resolved.
Show resolved Hide resolved
| segmenter | Segmenter to use for this language. See below for more information. | "python" | using `rust-punkt` by default
| stem_separator_regex | If given, splits words at the given characters to reach the stem words to check them again against the blacklist, e.g. prevents "Rust's" to pass if "Rust" is in the blacklist. | Simple regex of separators, e.g. for apostrophe `stem_separator_regex = "[']"` | ""

### Example for `matching_symbols`

Expand Down Expand Up @@ -239,6 +240,29 @@ Input: I am foo test a test
Output: I am hi a hi
```

### Example for `regex_replacement_list`

```
regex_replacement_list = [
# Split glued sentences
["\\ [a-z]{3,}\\.[A-Z][a-z]{2,}\\ ", ".", ". "],

# Split long sentences
["\\b(?:\\S+\\s+){15,}\\S+[.!?]", ", but ", ". But "],
]
```

This will find words that glue two sentences and will add a space to un-glue them.
And will split a long sentence in two smaller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good example and easily understandable, thanks for this thorough documentation. In the context of Wikipedia extracts, more sentences might actually mean less content, as a sentence might be fulfilling all rule requirements, but then gets split into two. And then only one of them gets picked. Of course this heavily depends on how many potential sentences a given article has. In many cases (such as yours), this might be beneficial, but it doesn't always have to be. Might be worth it to write a short explanation here for that as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where it is most worthwhile: If the article does not have enough sentences to select from (<3) because of the rules, especially max_words and/or max_characters. At that time, this algorithm can kick in and try to produce split sentences.

There is no way for us to know if pre-split or post-split can produce more "valuable" sentences. But many of the "sub-sentences" might be simple introductory wording etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and that's exactly why I would prefer a short sentence explaining that, so people don't just blindly copy. If we have an indication that it works in all corpuses, then we could also just do it by default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note on sentence splitting.


```
Input: A sentence.Glued to another.
Output: A sentence. Glued to another.

Input: A first part of a long sentence that would be rejected, but infact it could be used.
Output: A first part of a long sentence that would be rejected. But infact it could be used.
```

## Using disallowed words

In order to increase the quality of the final output, you might want to consider filtering out some words that are complex, too long or non-native.
Expand Down
29 changes: 29 additions & 0 deletions src/replacer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,19 @@
}
}

// regex replacements
for regex_replacement in rules.regex_replacement_list.iter() {
if Value::as_array(regex_replacement).unwrap().len() == 3 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me wonder if this implementation should go further than just with 3 values. Initially I thought such a regex implementation would only take two arguments and basically work like the replace_all function. But thinking about it, I can absolutely see why 3 arguments can be even more helpful, though many use cases could also be covered by named capture groups (but not all!).

Would you be interested in implementing a second form of this that accepts two arguments and replaces every matched occurrence with that string? This of course could be done outside this PR as a follow-up.

let regex = Regex::new(&regex_replacement[0].as_str().unwrap()).unwrap();

Check warning on line 34 in src/replacer.rs

View workflow job for this annotation

GitHub Actions / lint

this expression creates a reference which is immediately dereferenced by the compiler
raivisdejus marked this conversation as resolved.
Show resolved Hide resolved
let search = regex_replacement[1].as_str().unwrap();
let replacement = regex_replacement[2].as_str().unwrap();

result = regex.replace_all(&result, |caps: &regex::Captures| {
caps[0].replace(search, replacement)
}).to_string();
}
}

result
}

Expand Down Expand Up @@ -168,4 +181,20 @@
assert_eq!(replace_strings(&rules, &String::from("Four: (content (and nested one)) should be removed.")), "Four: should be removed.");
assert_eq!(replace_strings(&rules, &String::from("Five: (one) (two) and [three] 'and' should stay.")), "Five: and 'and' should stay.");
}

#[test]
fn test_regex_replacement() {
let rules = Rules {
regex_replacement_list: vec![
Value::try_from([
Value::try_from("\\ [a-z]{3,}\\.[A-Z][a-z]{2,}\\ ").unwrap(),
Value::try_from(".").unwrap(),
Value::try_from(". ").unwrap()
]).unwrap(),
],
..Default::default()
};

assert_eq!(replace_strings(&rules, &String::from("A sentence.Glued to another.")), "A sentence. Glued to another.");
}
}
3 changes: 3 additions & 0 deletions src/rules.rs
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ pub struct Rules {
pub other_patterns: Array,
pub stem_separator_regex: String,
pub replacements: Array,
pub regex_replacement_list: Array,
pub even_symbols: Array,
pub matching_symbols: Array,
}
Expand Down Expand Up @@ -84,6 +85,7 @@ impl Default for Rules {
other_patterns: vec![],
stem_separator_regex: String::from(""),
replacements: vec![],
regex_replacement_list: vec![],
even_symbols: vec![],
matching_symbols: vec![],
}
Expand Down Expand Up @@ -121,6 +123,7 @@ mod test {
assert_eq!(rules.other_patterns, vec![]);
assert_eq!(rules.stem_separator_regex, String::from(""));
assert_eq!(rules.replacements, vec![]);
assert_eq!(rules.regex_replacement_list, vec![]);
assert_eq!(rules.even_symbols, vec![]);
assert_eq!(rules.matching_symbols, vec![]);
}
Expand Down
Loading