Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
- Breaking backward compatibility bumps the major (and resets the minor and patch)
- New additions without breaking backward compatibility bumps the minor (and resets the patch)
- Bug fixes and misc changes bumps the patch
BUG FIXES
-
replace_emoticon
replaced emoticon-like substrings within actual words.
Spotted thanks to Carolyn Challoner; see issue #46. -
replace_number
failed if the number pattern contained two leading decimals or hyphens. Spotted thanks to Stefano De Sabbata; see issue #60. -
replace_word_elongation
failed for repeating of the same character but of different case (e.g.,replace_word_elongation("Ooo")
resulted inNA
. This has been corrected. Additionally, theelongation.search.pattern
defined as"(?i)(?:^|\\b)\\w*([a-z])(?:\\1{2,})\\w*($|\\b)"
has been moved exterally, to a parameter, allowing the user to alter this pattern if desired. Spotted thanks to Stefano De Sabbata; see issue #59.
NEW FEATURES
-
replace_misspelling
added as a way to replace misspelled words with their most likely replacement using hunspell in the backend. Suggested by Surin Space; see issue #39. -
as_ordinal
added as a convenience wrapper forenglish::ordinal
that takes integers and converts them to ordinal form. -
%like%
added as an binary operator similar to SQL's LIKE.
MINOR FEATURES
fix_mdyyyy
added to correct dates in the form of m/d/yyyy to yyyy-mm-dd.
IMPROVEMENTS
-
replace_html
pics up the ability to replace "«" & "»" with ASCII equivalents "<<" & ">>". Suggested by Ilya Shutov; see issue #48. -
All internal calls to
grepl()
now haveperl = TRUE
added as this is generally a speed up. Suggested by Kyle Haynes (see #51).
CHANGES
filter_element()
andfilter_row()
have been deprecated for a few years.
They have now been removed.
Version update to comply with changes in the glue package's API.
BUG FIXES
fgsub
had a bug in which the the originalpattern
infgsub
matches the location in the string but when the replacement occurs this was done on the entire string rather than the location of the firstpattern
match. This means the extracted string was used as a search and might be found in places other than the original location (e.g., a leading boundary in '^T' replaced with '__' may have led to '__he __itle' rather than '__he Title' as expected in the string 'The Title'). See #35 for details. The fix will add some time to the computation but is safer.
NEW FEATURES
-
replace_to
/replace_from
added to remove from/to begin/end of string to/from a character(s). -
The following replacement functions were added to provide remediation for problems found in
check_text
:replace_email
,replace_hash
,replace_tag
, &replace_url
.
MINOR FEATURES
check_text
picks up achecks
andn
argument. The former allows the user to specify which checks to conduct. The latter allows the user to truncate the output to n number of elements with a closing...[truncated]...
. This makes the function more useful and the code easier to maintain.
IMPROVEMENTS
replace_non_ascii
did not replace all non-ASCII characters. This has been fixed by an explicit replacement of '[^ -~]+' which are all non-ASCII characters. See issue #34 for details.
Maintenance release to bring package up to date with the lexicon package API changes.
NEW FEATURES
-
match_tokens
added to find all the tokens that match a regex(es) within a given text vector. This useful when combined with thereplace_tokens
function. -
Fixed versions of
drop_element
/keep_element
added to allow for dropping elements specified by a known vector rather than a regex. -
The
collapse
andglue
functions from the glue package are reexported for easy string manipulation. -
replace_date
added for normalizing dates. -
replace_time
added for normalizing time stamps. -
replace_money
added for normalizing money references. -
mgsub
picks up asafe
argument using the mgsub package as the backend. In additionmgsub_regex_safe
added to make the usage explicit. The safe mode comes at the cost of speed.
IMPROVEMENTS
-
replace_names
drops the replacement ofc('An', 'To', 'Oh', 'So', 'Do', 'He', 'Ha', 'In', 'Pa', 'Un')
which are likely words and not names. -
replace_html
picks ups some additional symbol replacements including:c("™", "“", "”", "‘", "’", "•", "·", "⋅", "–", "—", "≠", "½", "¼", "¾", "°", "←", "→", "…")
.
NEW FEATURES
-
replace_kern
added to replace a form of informal emphasis in which the writer takes words >2 letters long, capitalizes the entire word, and places spaces in between each letter. This was contributed by Stack Overflow's @ctwheels: https://stackoverflow.com/a/47438305/1000343. -
replace_internet_slang
added to replace Internet acronyms and abbreviations with machine friendly word equivalents. -
replace_word_elongation
added to replace word elongations (a.k.a. "word lengthening") with the most likely normalized word form. See http://www.aclweb.org/anthology/D11-105 for details. -
fgsub
added for the ability to match, extract, operate a function over the extracted strings, & replace the original matches with the extracted strings. This performs similar functionality togsubfn::gsubfn
but is less powerful. For more powerful needs see the gsubfn package.
BUG FIXES
replace_grade
did not usefixed = TRUE
for its call tomgsub
. This could result in the plus signs being interpreted as meta-characters. This has been corrected.
NEW FEATURES
-
replace_names
added to remove/replace common first and last names from text data. -
make_plural
added to make a vector of singular noun forms plural. -
replace_emoji
andreplace_emoji_identifier
added for replacing emojis with text or an identifier token for use in the sentimentr package.
MINOR FEATURES
-
mgsub_regex
andmgsub_fixed
to provide wrappers formgsub
that makes their use apparent without setting thefixed
command. -
replace_curly_quote
added to replace curly quotes with straight versions.
IMPROVEMENTS
-
replace_non_ascii
now usesstringi::stri_trans_general
to coerce more non-ASCII characters to ASCII format. -
check_text
now checks for HTML characters/tags. Thanks to @Peter Gensler for suggesting this (see issue #15).
CHANGES
filter_
functions deprecated in favor ofdrop_
/keep_
versions of filter functions. This was change was to address the opposite meaning that dplyr'sfilter
has, which retains rows matching a pattern be default.
BUG FIXES
replace_tokens
added to complementmgsub
for times when the user wants to replace fixed tokens with a single value or remove them entirely. This yields an optimized solution that is much faster thanmgsub
.
CHANGES
mgusb
no longer usestrim = TRUE
by default.
BUG FIXES
check_text
reported to usereplace_incomplete
rather thanadd_missing_endmark
when endmark is missing.
NEW FEATURES
-
The
replace_emoticon
,replace_grade
andreplace_rating
functions have been moved from the sentimentr package to textclean as these are cleaning functions. This makes the functions more modular and generalizable to all types of text cleaning. These functions are still imported and exported by sentimentr. -
replace_html
added to remove html tags and repalce symbols with appropriate ASCII symbols. -
add_missing_endmarks
added to detect missing endmarks and replace with the desired symbol.
IMPROVEMENTS
replace_number
now uses the english package making it faster and more maintainable. In addition, the function now handles decimal places as well.
BUG FIXES
check_text
reportedNA
as non-ASCII. This has been fixed.
NEW FEATURES
-
check_text
added to report on potential problems in a text vector. -
replace_ordinal
added to replace ordinal numbers (e.g., 1st) with word representation (e.g., first). -
swap
added to swap two patterns simultaneously. -
filter_element
added to exclude matching elements from a vector.
This package is a collection of tools to clean and process text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster.