Python Topics : Regex - Part 1
RegEx in Python
built-in module named re which is used for regular expressions
# importing re module
import re

s = 'GeeksforGeeks: A computer science portal for geeks'

match = re.search(r'portal', s)

print('Start Index:', match.start())
print('End Index:', match.end())
the r character stands for raw
a raw string is different from a regular string as it won't interpret the '\' character as an escape character
the regular expression engine uses '/' character for its own escaping purpose
RegEx Functions
functiondescription
re.findall() finds and returns all matching occurrences in a list
re.compile() regular expressions are compiled into pattern objects
re.split() split string by the occurrences of a character or a pattern
re.sub() replaces all occurrences of a character or patter with a replacement string
re.escape() escapes special character
re.search() searches for first occurrence of character or pattern

re.findall()
returns all non-overlapping matches of pattern in string as a list of strings
the string is scanned left-to-right
matches are returned in the order found

code below uses a regular expression (\d+) to find all the sequences of one or more digits in the given string
searches for numeric values and stores them in a list

- example - re.findall()

import re
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
regex = '\d+'

match = re.findall(regex, string)
print(match)
# output ['123456789', '987654321']

re.compile()
regular expressions are compiled into pattern objects
objects have methods for various operations such as searching for pattern matches or performing string substitutions

below a regular expression pattern[a-e] is used to find and list all lowercase letters from 'a' to 'e' in the input string - example - re.compile()

import re
p = re.compile('[a-e]')

print(p.findall("Aye, said Mr. Gibenson Stark"))
# output ['e', 'a', 'd', 'b', 'e', 'a']
first occurrence is 'e' in 'Aye' and not 'A', as it is case sensitive

- another example - re.compile()

below regular expressions are used to find and list all single digits and sequences of digits in the given input strings
finds single digits with \d and sequences of digits with \d+

import re
p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))
# output ['1', '1', '4', '1', '8', '8', '6']
p = re.compile('\d+')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))
# output ['11', '4', '1886']
- another example - re.compile()

below regular expressions are used to find and list

  • word characters
  • sequences of word characters
  • non-word characters
returns lists of the matched characters or sequences
import re

p = re.compile('\w')
print(p.findall("He said * in some_lang."))
# output ['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']

p = re.compile('\w+')
print(p.findall("I went to him at 11 A.M., he \
said *** in some_language."))
# output ['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']

p = re.compile('\W')
print(p.findall("he said *** in some_language."))
# output [' ', ' ', '*', '*', '*', ' ', ' ', '.']
- another example - re.compile()

below a regular expression pattern 'ab*' is used to find and list all occurrences of 'ab' followed by zero or more ‘b' characters in the input string

import re

p = re.compile('ab*')
print(p.findall("ababbaabbb"))
# output ['ab', 'abb', 'a', 'abbb']

re.split()
split string by the occurrences of a character or a pattern
upon finding that pattern, the remaining characters from the string are returned as part of the resulting list
re.split(pattern, string, maxsplit=0, flags=0)
pattern denotes the regular expression
string is the string to be searched for and in which splitting occurs
maxsplit if not provided is considered to be zero '0'
if any nonzero value is provided then at most that many splits occur
if maxsplit = 1, then the string will split once only, resulting in a list of length 2
the flags are very useful and can help to shorten code
are not necessary parameters
flags = re.IGNORECASE
the flag indicates the lowercase or the uppercase are to be ignored

- example - re.split()

from re import split

print(split('\W+', 'Words, words , Words'))
print(split('\W+', "Word's words Words"))
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM'))
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM'))
the first statement splits a string using non-word characters and spaces as delimiters
the second statement shows apostrophes considered as non-word characters
the third statement splits using non-word characters and digits
the fourth statement splits using digits as the delimiter
['Words', 'words', 'Words']
['Word', 's', 'words', 'Words']
['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']
- another example - re.split()
import re

print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM', 1))
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags=re.IGNORECASE))
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here'))
first statement splits the string at the first occurrence of one or more digits
second statement splits the string using lowercase letters a to f as delimiters, case-insensitive
third statement splits the string using lowercase letters a to f as delimiters, case-sensitive
['On ', 'th Jan 2016, at 11:02 AM']
['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', '']
['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']

re.sub()
syntax
re.sub(pattern, repl, string, count=0, flags=0)
the 'sub' arg is substring to be searched for
the 'repl' arg is string to replace the found substrings
the 'string' arg is the string to search
the 'count' args checks and maintains the number of times replacement occurs

- example - re.sub()

import re
print(re.sub('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE))
print(re.sub('ub', '~*', 'Subject has Uber booked already'))
print(re.sub('ub', '~*', 'Subject has Uber booked already', count=1, flags=re.IGNORECASE))
print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE))
first statement replaces all occurrences of 'ub' with '~*' (case-insensitive)
second statement replaces all occurrences of 'ub' with '~*' (case-sensitive)
third statement replaces the first occurrence of 'ub' with '~*' (case-insensitive)
fourth replaces 'AND' with ' & ' (case-insensitive)
S~*ject has ~*er booked already
S~*ject has Uber booked already
S~*ject has Uber booked already
Baked Beans & Spam

re.subn()
syntax
re.subn(pattern, repl, string, count=0, flags=0)
similar to sub()
replaces all occurrences of a pattern in a string
returns a tuple with the modified string and the count of substitutions made

- example - re.subn()

import re

print(re.subn('ub', '~*', 'Subject has Uber booked already'))

t = re.subn('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE)
print(t)
print(len(t))
print(t[0])
output
('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)
2
S~*ject has ~*er booked already

re.escape(string)
used to escape special characters in a string, making it safe to be used as a pattern in regular expressions
ensures any characters with special meanings in regular expressions are treated as literal characters

- example - re.subn()

import re
print(re.escape("This is Awesome even 1 AM"))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))
output
This\ is\ Awesome\ even\ 1\ AM
I\ Asked\ what\ is\ this\ \[a\-9\]\,\ he\ said\ \    \ \^WoW

re.search()
method either returns
  • None (if the pattern doesn't match)
  • a re.MatchObject contains information about the matching part of the string
code uses a regular expression to search for a pattern in the string
if a match is found, it extracts and prints the matched portions of the string

- example - re.search()

import re
regex = r"([a-zA-Z]+) (\d+)"

match = re.search(regex, "I was born on June 24")
if match != None:
    print ("Match at index %s, %s" % (match.start(), match.end()))
    print ("Full match: %s" % (match.group(0)))
    print ("Month: %s" % (match.group(1)))
    print ("Day: %s" % (match.group(2)))
else:
    print ("The regex pattern does not match.")
in the example searches for a pattern that consists of a month (letters) followed by a day (digits)
Match at index 14, 21
Full match: June 24
Month: June
Day: 24
Meta-characters
metacharacters are the characters with special meaning

metacharacterdescription
\ used to drop the special meaning of character following it
[] represent a character class
^ matches the beginning
$ matches the end
. matches any character except newline
| means OR matches with any of the characters separated by it
? matches zero or one occurrence
* any number of occurrences (including 0 occurrences)
+ one or more occurrences
{} indicate the number of occurrences of a preceding regex to match
() enclose a group of Regex

\ – Backslash
the backslash (\) makes sure that the character is not treated in a special way
can be considered a way of escaping metacharacters
import re

s = 'geeks.forgeeks'

# without using \
match = re.search(r'.', s)
print(match)

# using \
match = re.search(r'\.', s)
print(match)
the first search matches any character not just the period
the second search specifically looks for and matches the period character
<re.Match object; span=(0, 1), match='g'>
<re.Match object; span=(5, 6), match='.'>

[] – Square Brackets
Square Brackets ([]) represent a character class consisting of a set of characters to match
example : the character class [abc] will match any single a, b, or c
can also specify a range of characters using – inside the square brackets
  • [0, 3] is sample as [0123]
  • [a-c] is same as [abc]
can also invert the character class using the caret(^) symbol
  • [^0-3] means any character except 0, 1, 2, or 3
  • [^a-c] means any character except a, b, or c
import re

string = "The quick brown fox jumps over the lazy dog"
pattern = "[a-m]"
result = re.findall(pattern, string)

print(result)
use regular expressions to find all the characters in the string that fall within the range of 'a' to 'm'
returns a list of all such characters
['h', 'e', 'i', 'c', 'k', 'b', 'f', 'j', 'm', 'e', 'h', 'e', 'l', 'a', 'd', 'g']

^ – Caret
Caret (^) symbol matches the beginning of the string
  • ^g will check if the string starts with g
  • ^ge will check if the string starts with ge
import re
regex = r'^The'
strings = ['The quick brown fox', 'The lazy dog', 'A quick brown fox']
for string in strings:
    if re.match(regex, string):
        print(f'Matched: {string}')
    else:
        print(f'Not matched: {string}')
regular expressions are used to check if a list of strings starts with 'The'
if a string begins with 'The,' it's marked as 'Matched' otherwise it's labeled as 'Not matched'
Matched: The quick brown fox
Matched: The lazy dog
Not matched: A quick brown fox

$ – Dollar
Dollar($) symbol matches the end of the string
  • s$ will check for the string that ends with s
  • ks$ will check for the string that ends with ks
import re

string = "Hello World!"
pattern = r"World!$"

match = re.search(pattern, string)
if match:
    print("Match found!")
else:
    print("Match not found.")
a regular expression is used to check if the string ends with 'World!'
if a match is found, it prints 'Match found!' otherwise it prints 'Match not found'
Match found!

. – Dot
dot(.) symbol matches only a single character except for the newline character (\n)
  • a.b will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc
  • checks if the string contains at least 2 characters
import re

string = "The quick brown fox jumps over the lazy dog."
pattern = r"brown.fox"

match = re.search(pattern, string)
if match:
    print("Match found!")
else:
    print("Match not found.")
a regular expression is used to search for the pattern 'brown.fox'
he dot (.) in the pattern represents any character
if a match is found, it prints 'Match found!' otherwise, it prints 'Match not found'
Match found!

 | – Or
or symbol works as the or operator
it checks whether the pattern before or after the or symbol is present

a|b will match any string that contains a or b such as acd, bcd, abcd, etc.


? – Question Mark
the question mark (?) is a quantifier in regular expressions
indicates that the preceding element should be matched zero or one time
allows specifying the element is optional, meaning it may occur once or not at all

ab?c will be matched for the string ac, acb, dabc
will not be matched for abbc because there are two b's
it will not be matched for abdc because b is not followed by c


* – Star (Asterisk)
star symbol matches zero or more occurrences of the regex preceding the * symbol

ab*c will be matched for the string ac, abc, abbbc, dabc, etc.
will not be matched for abdc because b is not followed by c


+ – Plus
plus (+) symbol matches one or more occurrences of the regex preceding the + symbol

ab+c will be matched for the string abc, abbc, dabc
will not be matched for ac, abdc
there is no b in ac
b is not followed by c in abdc


{m, n} – Braces
braces match any repetitions preceding regex from m to n both inclusive

a{2, 4} will be matched for the string aaab, baaaac, gaad
will not be matched for strings like abc, bc
there is only one a or no a in both the cases


 () – Group
group symbol is used to group sub-patterns

(a|b)cd will match for strings like acd, abcd, gacd, etc.

Special Sequences
special sequences do not match for the actual character in the string
tells the specific location in the search string where the match must occur
makes it easier to write commonly used patterns

Special
Sequence
Description Examples
\A matches if the string begins with the given character \Afor for geeks
for the world
\b matches if the word begins or ends with the given character
\b(string) will check for the beginning of the word
(string)\b will check for the ending of the word
\bge geeks
get
\B the opposite of the \b
the string should not start or end with the given regex
\Bge together
forge
\d matches any decimal digit
is equivalent to the set class [0-9]
\d 123
gee1
\D matches any non-digit character
is equivalent to the set class [^0-9]
\D geek
general
\s Matches any whitespace character \s is rain
dog and cat
\S matches any non-whitespace character \S ab c
abcd
\w matches any alphanumeric character
is equivalent to the class [a-zA-Z0-9_]
\w 123
hello world
\W matches any non-alphanumeric character \W >$
<some tag>
\Z matches if the string ends with the given regex ab\Z abcdab
abababab
Sets for Character Matching
SetDescription
\{n,\} quantifies the preceding character or group and matches at least n occurrences
* quantifies the preceding character or group and matches zero or more occurrences
[0123]matches the specified digits (0, 1, 2, or 3)
[^arn]matches for any character EXCEPT a, r, and n
\dmatches any digit (0-9)
[0-5][0-9]matches for any two-digit numbers from 00 and 59
\wmatches any alphanumeric character (a-z, A-Z, 0-9, or _)
[a-n]matches any lower case alphabet between a and n
\Dmatches any non-digit character
[arn] matches where one of the specified characters (a, r, or n) are present
[a-zA-Z] matches any character between a and z, lower case OR upper case
[0-9]matches any digit between 0 and 9
Match Object

a Match object contains all the information about the search and the result
if there is no match found then None will be returned

Getting the String and the Regex
match.re attribute returns the regular expression passed and match.string attribute returns the string passed
import re
s = "Welcome to GeeksForGeeks"
res = re.search(r"\bG", s)

print(res.re)
print(res.string)
the code searches for the letter 'G' at a word boundary in the string 'Welcome to GeeksForGeeks'
prints the regular expression pattern (res.re) and the original string (res.string)
re.compile('\\bG')
Welcome to GeeksForGeeks
Getting the Index of thr Matched Object
start() method returns the starting index of the matched substring
end() method returns the ending index of the matched substring
span() method returns a tuple containing the starting and the ending index of the matched substring
import re

s = "Welcome to GeeksForGeeks"

res = re.search(r"\bGee", s)

print(res.start())
print(res.end())
print(res.span())
the code searches for the substring 'Gee' at a word boundary in the string 'Welcome to GeeksForGeeks'
prints the start index of the match (res.start())
the end index of the match (res.end())
and the span of the match (res.span()).
11
14
(11, 14)
Getting the Matched Substring
group() method returns the part of the string for which the patterns match
import re
s = "Welcome to GeeksForGeeks"
res = re.search(r"\D{2} t", s)
print(res.group())
pattern specifies for the string that contains at least 2 characters which are followed by a space, and that space is followed by a t
me t
index