Skip to content

Latest commit

 

History

History
548 lines (427 loc) · 15.9 KB

Text_Processing.md

File metadata and controls

548 lines (427 loc) · 15.9 KB

Text Processing


String methods

  • translate string characters
    • str.maketrans() to get translation table
    • translate() to perform the string mapping based on translation table
  • the first argument to maketrans() is string characters to be replaced, the second is characters to replace with and the third is characters to be mapped to None
  • character translation examples
>>> greeting = '===== Have a great day ====='
>>> greeting.translate(str.maketrans('=', '-'))
'----- Have a great day -----'

>>> greeting = '===== Have a great day!! ====='
>>> greeting.translate(str.maketrans('=', '-', '!'))
'----- Have a great day -----'

>>> import string
>>> quote = 'SIMPLICITY IS THE ULTIMATE SOPHISTICATION'
>>> tr_table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase)
>>> quote.translate(tr_table)
'simplicity is the ultimate sophistication'

>>> sentence = "Thi1s is34 a senten6ce"
>>> sentence.translate(str.maketrans('', '', string.digits))
'This is a sentence'
>>> greeting.translate(str.maketrans('', '', string.punctuation))
' Have a great day '
  • removing leading/trailing/both characters
  • only consecutive characters from start/end string are removed
  • by default whitespace characters are stripped
  • if more than one character is specified, it is treated as a set and all combinations of it are used
>>> greeting = '      Have a nice day :)     '
>>> greeting.strip()
'Have a nice day :)'
>>> greeting.rstrip()
'      Have a nice day :)'
>>> greeting.lstrip()
'Have a nice day :)     '

>>> greeting.strip(') :')
'Have a nice day'

>>> greeting = '===== Have a great day!! ====='
>>> greeting.strip('=')
' Have a great day!! '
  • styling
  • width argument specifies total output string length
>>> ' Hello World '.center(40, '*')
'************* Hello World **************'
  • changing case and case checking
>>> sentence = 'thIs iS a saMple StrIng'

>>> sentence.capitalize()
'This is a sample string'

>>> sentence.title()
'This Is A Sample String'

>>> sentence.lower()
'this is a sample string'

>>> sentence.upper()
'THIS IS A SAMPLE STRING'

>>> sentence.swapcase()
'THiS Is A SAmPLE sTRiNG'

>>> 'good'.islower()
True

>>> 'good'.isupper()
False
  • check if string is made up of numbers
>>> '1'.isnumeric()
True
>>> 'abc1'.isnumeric()
False
>>> '1.2'.isnumeric()
False
  • check if character sequence is present or not
>>> sentence = 'This is a sample string'
>>> 'is' in sentence
True
>>> 'this' in sentence
False
>>> 'This' in sentence
True
>>> 'this' in sentence.lower()
True
>>> 'is a' in sentence
True
>>> 'test' not in sentence
True
  • get number of times character sequence is present (non-overlapping)
>>> sentence = 'This is a sample string'
>>> sentence.count('is')
2
>>> sentence.count('w')
0

>>> word = 'phototonic'
>>> word.count('oto')
1
  • matching character sequence at start/end of string
>>> sentence
'This is a sample string'

>>> sentence.startswith('This')
True
>>> sentence.startswith('The')
False

>>> sentence.endswith('ing')
True
>>> sentence.endswith('ly')
False
  • split string based on character sequence
  • returns a list
  • to split using regular expressions, use re.split() instead
>>> sentence = 'This is a sample string'

>>> sentence.split()
['This', 'is', 'a', 'sample', 'string']

>>> "oranges:5".split(':') 
['oranges', '5']
>>> "oranges :: 5".split(' :: ') 
['oranges', '5']

>>> "a e i o u".split(' ', maxsplit=1) 
['a', 'e i o u']
>>> "a e i o u".split(' ', maxsplit=2) 
['a', 'e', 'i o u']

>>> line = '{1.0 2.0 3.0}'
>>> nums = [float(s) for s in line.strip('{}').split()]
>>> nums
[1.0, 2.0, 3.0]
  • joining list of strings
>>> str_list
['This', 'is', 'a', 'sample', 'string']
>>> ' '.join(str_list)
'This is a sample string'
>>> '-'.join(str_list)
'This-is-a-sample-string'

>>> c = ' :: '
>>> c.join(str_list)
'This :: is :: a :: sample :: string'
  • replace characters
  • third argument specifies how many times replace has to be performed
  • variable has to be explicitly re-assigned to change its value
>>> phrase = '2 be or not 2 be'
>>> phrase.replace('2', 'to')
'to be or not to be'

>>> phrase
'2 be or not 2 be'

>>> phrase.replace('2', 'to', 1)
'to be or not 2 be'

>>> phrase = phrase.replace('2', 'to')
>>> phrase
'to be or not to be'

Further Reading


Regular Expressions

  • Handy reference of regular expression (RE) elements
Meta characters Description
\A anchor to restrict matching to beginning of string
\Z anchor to restrict matching to end of string
^ anchor to restrict matching to beginning of line
$ anchor to restrict matching to end of line
. Match any character except newline character \n
| OR operator for matching multiple patterns
(RE) capturing group
(?:RE) non-capturing group
[] Character class - match one character among many
\^ prefix \ to literally match meta characters like ^

Greedy Quantifiers Description
* Match zero or more times
+ Match one or more times
? Match zero or one times
{m,n} Match m to n times (inclusive)
{m,} Match at least m times
{,n} Match up to n times (including 0 times)
{n} Match exactly n times

Appending a ? to greedy quantifiers makes them non-greedy


Character classes Description
[aeiou] Match any vowel
[^aeiou] ^ inverts selection, so this matches any consonant
[a-f] - defines a range, so this matches any of abcdef characters
\d Match a digit, same as [0-9]
\D Match non-digit, same as [^0-9] or [^\d]
\w Match alphanumeric and underscore character, same as [a-zA-Z0-9_]
\W Match non-alphanumeric and underscore character, same as [^a-zA-Z0-9_] or [^\w]
\s Match white-space character, same as [\ \t\n\r\f\v]
\S Match non white-space character, same as [^\s]
\b word boundary, see \w for characters constituting a word
\B not a word boundary

Flags Description
re.I Ignore case
re.M Multiline mode, ^ and $ anchors work on lines
re.S Singleline mode, . will also match \n
re.X Verbose mode, for better readability and adding comments

See Python docs - Compilation Flags for more details and long names for flags


Variable Description
\1, \2, \3 ... \99 backreferencing matched patterns
\g<1>, \g<2>, \g<3> ... backreferencing matched patterns, prevents ambiguity
\g<0> entire matched portion

\0 and \100 onwards are considered as octal values, hence cannot be used as backreference.


Pattern matching and extraction

To match/extract sequence of characters, use

  • re.search() to see if input string contains a pattern or not
  • re.findall() to get a list of all matching portions
  • re.finditer() to get an iterator of re.Match objects of all matching portions
  • re.split() to get a list from splitting input string based on a pattern

Their syntax is as follows:

re.search(pattern, string, flags=0)
re.findall(pattern, string, flags=0)
re.finditer(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)
  • As a good practice, always use raw strings to construct RE, unless other formats are required
    • this will avoid clash of backslash escaping between RE and normal quoted strings
  • examples for re.search
>>> sentence = 'This is a sample string'

# using normal string methods
>>> 'is' in sentence
True
>>> 'xyz' in sentence
False

# need to load the re module before use
>>> import re
# check if 'sentence' contains the pattern described by RE argument
>>> bool(re.search(r'is', sentence))
True
>>> bool(re.search(r'this', sentence, flags=re.I))
True
>>> bool(re.search(r'xyz', sentence))
False
  • examples for re.findall
# match whole word par with optional s at start and e at end
>>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare')
['par', 'spar', 'spare', 'pare']

# numbers >= 100 with optional leading zeros
>>> re.findall(r'\b0*[1-9]\d{2,}\b', '0501 035 154 12 26 98234')
['0501', '154', '98234']

# if multiple capturing groups are used, each element of output
# will be a tuple of strings of all the capture groups
>>> re.findall(r'(x*):(y*)', 'xx:yyy x: x:yy :y')
[('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')]

# normal capture group will hinder ability to get whole match
# non-capturing group to the rescue
>>> re.findall(r'\b\w*(?:st|in)\b', 'cost akin more east run against')
['cost', 'akin', 'east', 'against']

# useful for debugging purposes as well before applying substitution
>>> re.findall(r't.*?a', 'that is quite a fabricated tale')
['tha', 't is quite a', 'ted ta']
  • examples for re.split
# split based on one or more digit characters
>>> re.split(r'\d+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']

# split based on digit or whitespace characters
>>> re.split(r'[\d\s]+', '**1\f2\n3star\t7 77\r**')
['**', 'star', '**']

# to include the matching delimiter strings as well in the output
>>> re.split(r'(\d+)', 'Sample123string42with777numbers')
['Sample', '123', 'string', '42', 'with', '777', 'numbers']

# use non-capturing group if capturing is not needed
>>> re.split(r'hand(?:y|ful)', '123handed42handy777handful500')
['123handed42', '777', '500']
  • backreferencing
# whole words that have at least one consecutive repeated character
>>> words = ['effort', 'flee', 'facade', 'oddball', 'rat', 'tool']

>>> [w for w in words if re.search(r'\b\w*(\w)\1\w*\b', w)]
['effort', 'flee', 'oddball', 'tool']
  • The re.search function returns a re.Match object from which various details can be extracted like the matched portion of string, location of matched portion, etc
  • Note that output here is shown for Python version 3.7
>>> re.search(r'b.*d', 'abc ac adc abbbc')
<re.Match object; span=(1, 9), match='bc ac ad'>
# retrieving entire matched portion
>>> re.search(r'b.*d', 'abc ac adc abbbc')[0]
'bc ac ad'

# capture group example
>>> m = re.search(r'a(.*)d(.*a)', 'abc ac adc abbbc')
# to get matched portion of second capture group
>>> m[2]
'c a'
# to get a tuple of all the capture groups
>>> m.groups()
('bc ac a', 'c a')
  • examples for re.finditer
>>> m_iter = re.finditer(r'(x*):(y*)', 'xx:yyy x: x:yy :y')
>>> [(m[1], m[2]) for m in m_iter]
[('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')]

>>> m_iter = re.finditer(r'ab+c', 'abc ac adc abbbc')
>>> for m in m_iter:
...     print(m.span())
... 
(0, 3)
(11, 16)

Search and Replace

Syntax

re.sub(pattern, repl, string, count=0, flags=0)
  • examples
  • Note that as strings are immutable, re.sub will not change value of variable passed to it, has to be explicity assigned
>>> ip_lines = "catapults\nconcatenate\ncat"
>>> print(re.sub(r'^', r'* ', ip_lines, flags=re.M))
* catapults
* concatenate
* cat

# replace 'par' only at start of word
>>> re.sub(r'\bpar', r'X', 'par spar apparent spare part')
'X spar apparent spare Xt'

# same as: r'part|parrot|parent'
>>> re.sub(r'par(en|ro)?t', r'X', 'par part parrot parent')
'par X X X'

# remove first two columns where : is delimiter
>>> re.sub(r'\A([^:]+:){2}', r'', 'foo:123:bar:baz', count=1)
'bar:baz'
  • backreferencing
# remove any number of consecutive duplicate words separated by space
# quantifiers can be applied to backreferences too!
>>> re.sub(r'\b(\w+)( \1)+\b', r'\1', 'aa a a a 42 f_1 f_1 f_13.14')
'aa a 42 f_1 f_13.14'

# add something around the matched strings
>>> re.sub(r'\d+', r'(\g<0>0)', '52 apples and 31 mangoes')
'(520) apples and (310) mangoes'

# swap words that are separated by a comma
>>> re.sub(r'(\w+),(\w+)', r'\2,\1', 'a,b 42,24')
'b,a 24,42'
  • using functions in replace part of re.sub()
  • Note that Python version 3.7 is used here
>>> from math import factorial
>>> numbers = '1 2 3 4 5'
>>> def fact_num(n):
...     return str(factorial(int(n[0])))
... 
>>> re.sub(r'\d+', fact_num, numbers)
'1 2 6 24 120'

# using lambda
>>> re.sub(r'\d+', lambda m: str(factorial(int(m[0]))), numbers)
'1 2 6 24 120'

Compiling Regular Expressions

  • Regular expressions can be compiled using re.compile function, which gives back a re.Pattern object
  • The top level re module functions are all available as methods for this object
  • Compiling a regular expression helps if the RE has to be used in multiple places or called upon multiple times inside a loop (speed benefit)
  • By default, Python maintains a small list of recently used RE, so the speed benefit doesn't apply for trivial use cases
>>> pet = re.compile(r'dog')
>>> type(pet)
<class 're.Pattern'>
>>> bool(pet.search('They bought a dog'))
True
>>> bool(pet.search('A cat crossed their path'))
False

>>> remove_parentheses = re.compile(r'\([^)]*\)')
>>> remove_parentheses.sub('', 'a+b(addition) - foo() + c%d(#modulo)')
'a+b - foo + c%d'
>>> remove_parentheses.sub('', 'Hi there(greeting). Nice day(a(b)')
'Hi there. Nice day'

Further Reading on Regular Expressions