RegEx in Python | ||||||||||||||||||||||||||||||||||||||||||||
built-in module named re which is used for regular expressions
# importing re module import re s = 'GeeksforGeeks: A computer science portal for geeks' match = re.search(r'portal', s) print('Start Index:', match.start()) print('End Index:', match.end())the r character stands for raw a raw string is different from a regular string as it won't interpret the '\' character as an escape character the regular expression engine uses '/' character for its own escaping purpose |
||||||||||||||||||||||||||||||||||||||||||||
RegEx Functions | ||||||||||||||||||||||||||||||||||||||||||||
re.findall()returns all non-overlapping matches of pattern in string as a list of strings the string is scanned left-to-right matches are returned in the order found code below uses a regular expression (\d+) to find all the sequences of one or more digits in the given string searches for numeric values and stores them in a list - example - re.findall() import re string = """Hello my Number is 123456789 and my friend's number is 987654321""" regex = '\d+' match = re.findall(regex, string) print(match) # output ['123456789', '987654321'] re.compile()regular expressions are compiled into pattern objects objects have methods for various operations such as searching for pattern matches or performing string substitutions below a regular expression pattern[a-e] is used to find and list all lowercase letters from 'a' to 'e' in the input string - example - re.compile() import re p = re.compile('[a-e]') print(p.findall("Aye, said Mr. Gibenson Stark")) # output ['e', 'a', 'd', 'b', 'e', 'a']first occurrence is 'e' in 'Aye' and not 'A', as it is case sensitive - another example - re.compile() below regular expressions are used to find and list all single digits and sequences of digits in the given input strings finds single digits with \d and sequences of digits with \d+ import re p = re.compile('\d') print(p.findall("I went to him at 11 A.M. on 4th July 1886")) # output ['1', '1', '4', '1', '8', '8', '6'] p = re.compile('\d+') print(p.findall("I went to him at 11 A.M. on 4th July 1886")) # output ['11', '4', '1886']- another example - re.compile() below regular expressions are used to find and list
import re p = re.compile('\w') print(p.findall("He said * in some_lang.")) # output ['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g'] p = re.compile('\w+') print(p.findall("I went to him at 11 A.M., he \ said *** in some_language.")) # output ['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language'] p = re.compile('\W') print(p.findall("he said *** in some_language.")) # output [' ', ' ', '*', '*', '*', ' ', ' ', '.']- another example - re.compile() below a regular expression pattern 'ab*' is used to find and list all occurrences of 'ab' followed by zero or more ‘b' characters in the input string import re p = re.compile('ab*') print(p.findall("ababbaabbb")) # output ['ab', 'abb', 'a', 'abbb'] re.split()split string by the occurrences of a character or a pattern upon finding that pattern, the remaining characters from the string are returned as part of the resulting list re.split(pattern, string, maxsplit=0, flags=0)pattern denotes the regular expression string is the string to be searched for and in which splitting occurs maxsplit if not provided is considered to be zero '0' if any nonzero value is provided then at most that many splits occur if maxsplit = 1, then the string will split once only, resulting in a list of length 2 the flags are very useful and can help to shorten code are not necessary parameters flags = re.IGNORECASEthe flag indicates the lowercase or the uppercase are to be ignored - example - re.split() from re import split print(split('\W+', 'Words, words , Words')) print(split('\W+', "Word's words Words")) print(split('\W+', 'On 12th Jan 2016, at 11:02 AM')) print(split('\d+', 'On 12th Jan 2016, at 11:02 AM'))the first statement splits a string using non-word characters and spaces as delimiters the second statement shows apostrophes considered as non-word characters the third statement splits using non-word characters and digits the fourth statement splits using digits as the delimiter ['Words', 'words', 'Words'] ['Word', 's', 'words', 'Words'] ['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM'] ['On ', 'th Jan ', ', at ', ':', ' AM']- another example - re.split() import re print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM', 1)) print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags=re.IGNORECASE)) print(re.split('[a-f]+', 'Aey, Boy oh boy, come here'))first statement splits the string at the first occurrence of one or more digits second statement splits the string using lowercase letters a to f as delimiters, case-insensitive third statement splits the string using lowercase letters a to f as delimiters, case-sensitive ['On ', 'th Jan 2016, at 11:02 AM'] ['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', ''] ['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', ''] re.sub()syntax re.sub(pattern, repl, string, count=0, flags=0)the 'sub' arg is substring to be searched for the 'repl' arg is string to replace the found substrings the 'string' arg is the string to search the 'count' args checks and maintains the number of times replacement occurs - example - re.sub() import re print(re.sub('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE)) print(re.sub('ub', '~*', 'Subject has Uber booked already')) print(re.sub('ub', '~*', 'Subject has Uber booked already', count=1, flags=re.IGNORECASE)) print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE))first statement replaces all occurrences of 'ub' with '~*' (case-insensitive) second statement replaces all occurrences of 'ub' with '~*' (case-sensitive) third statement replaces the first occurrence of 'ub' with '~*' (case-insensitive) fourth replaces 'AND' with ' & ' (case-insensitive) S~*ject has ~*er booked already S~*ject has Uber booked already S~*ject has Uber booked already Baked Beans & Spam re.subn()syntax re.subn(pattern, repl, string, count=0, flags=0)similar to sub() replaces all occurrences of a pattern in a string returns a tuple with the modified string and the count of substitutions made - example - re.subn() import re print(re.subn('ub', '~*', 'Subject has Uber booked already')) t = re.subn('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE) print(t) print(len(t)) print(t[0])output ('S~*ject has Uber booked already', 1) ('S~*ject has ~*er booked already', 2) 2 S~*ject has ~*er booked already re.escape(string)used to escape special characters in a string, making it safe to be used as a pattern in regular expressions ensures any characters with special meanings in regular expressions are treated as literal characters - example - re.subn() import re print(re.escape("This is Awesome even 1 AM")) print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))output This\ is\ Awesome\ even\ 1\ AM I\ Asked\ what\ is\ this\ \[a\-9\]\,\ he\ said\ \ \ \^WoW re.search()method either returns
if a match is found, it extracts and prints the matched portions of the string - example - re.search() import re regex = r"([a-zA-Z]+) (\d+)" match = re.search(regex, "I was born on June 24") if match != None: print ("Match at index %s, %s" % (match.start(), match.end())) print ("Full match: %s" % (match.group(0))) print ("Month: %s" % (match.group(1))) print ("Day: %s" % (match.group(2))) else: print ("The regex pattern does not match.")in the example searches for a pattern that consists of a month (letters) followed by a day (digits) Match at index 14, 21 Full match: June 24 Month: June Day: 24 |
||||||||||||||||||||||||||||||||||||||||||||
Meta-characters | ||||||||||||||||||||||||||||||||||||||||||||
metacharacters are the characters with special meaning
\ – Backslashthe backslash (\) makes sure that the character is not treated in a special way can be considered a way of escaping metacharacters import re s = 'geeks.forgeeks' # without using \ match = re.search(r'.', s) print(match) # using \ match = re.search(r'\.', s) print(match)the first search matches any character not just the period the second search specifically looks for and matches the period character <re.Match object; span=(0, 1), match='g'> <re.Match object; span=(5, 6), match='.'> [] – Square BracketsSquare Brackets ([]) represent a character class consisting of a set of characters to match example : the character class [abc] will match any single a, b, or c can also specify a range of characters using – inside the square brackets
import re string = "The quick brown fox jumps over the lazy dog" pattern = "[a-m]" result = re.findall(pattern, string) print(result)use regular expressions to find all the characters in the string that fall within the range of 'a' to 'm' returns a list of all such characters ['h', 'e', 'i', 'c', 'k', 'b', 'f', 'j', 'm', 'e', 'h', 'e', 'l', 'a', 'd', 'g'] ^ – CaretCaret (^) symbol matches the beginning of the string
import re regex = r'^The' strings = ['The quick brown fox', 'The lazy dog', 'A quick brown fox'] for string in strings: if re.match(regex, string): print(f'Matched: {string}') else: print(f'Not matched: {string}')regular expressions are used to check if a list of strings starts with 'The' if a string begins with 'The,' it's marked as 'Matched' otherwise it's labeled as 'Not matched' Matched: The quick brown fox Matched: The lazy dog Not matched: A quick brown fox $ – DollarDollar($) symbol matches the end of the string
import re string = "Hello World!" pattern = r"World!$" match = re.search(pattern, string) if match: print("Match found!") else: print("Match not found.")a regular expression is used to check if the string ends with 'World!' if a match is found, it prints 'Match found!' otherwise it prints 'Match not found' Match found! . – Dotdot(.) symbol matches only a single character except for the newline character (\n)
import re string = "The quick brown fox jumps over the lazy dog." pattern = r"brown.fox" match = re.search(pattern, string) if match: print("Match found!") else: print("Match not found.")a regular expression is used to search for the pattern 'brown.fox' he dot (.) in the pattern represents any character if a match is found, it prints 'Match found!' otherwise, it prints 'Match not found' Match found! | – Oror symbol works as the or operator it checks whether the pattern before or after the or symbol is present a|b will match any string that contains a or b such as acd, bcd, abcd, etc. ? – Question Markthe question mark (?) is a quantifier in regular expressions indicates that the preceding element should be matched zero or one time allows specifying the element is optional, meaning it may occur once or not at all ab?c will be matched for the string ac, acb, dabc will not be matched for abbc because there are two b's it will not be matched for abdc because b is not followed by c * – Star (Asterisk)star symbol matches zero or more occurrences of the regex preceding the * symbol ab*c will be matched for the string ac, abc, abbbc, dabc, etc. will not be matched for abdc because b is not followed by c + – Plusplus (+) symbol matches one or more occurrences of the regex preceding the + symbol ab+c will be matched for the string abc, abbc, dabc will not be matched for ac, abdc there is no b in ac b is not followed by c in abdc {m, n} – Bracesbraces match any repetitions preceding regex from m to n both inclusive a{2, 4} will be matched for the string aaab, baaaac, gaad will not be matched for strings like abc, bc there is only one a or no a in both the cases (group symbol is used to group sub-patterns (a|b)cd will match for strings like acd, abcd, gacd, etc. |
||||||||||||||||||||||||||||||||||||||||||||
Special Sequences | ||||||||||||||||||||||||||||||||||||||||||||
special sequences do not match for the actual character in the string tells the specific location in the search string where the match must occur makes it easier to write commonly used patterns
|
||||||||||||||||||||||||||||||||||||||||||||
Sets for Character Matching | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
Match Object | ||||||||||||||||||||||||||||||||||||||||||||
a Match object contains all the information about the search and the result if there is no match found then None will be returned Getting the String and the Regex
match.re attribute returns the regular expression passed and match.string attribute
returns the string passed
import re s = "Welcome to GeeksForGeeks" res = re.search(r"\bG", s) print(res.re) print(res.string)the code searches for the letter 'G' at a word boundary in the string 'Welcome to GeeksForGeeks' prints the regular expression pattern (res.re) and the original string (res.string) re.compile('\\bG') Welcome to GeeksForGeeks Getting the Index of thr Matched Object
start() method returns the starting index of the matched substringend() method returns the ending index of the matched substring span() method returns a tuple containing the starting and the ending index of the matched substring import re s = "Welcome to GeeksForGeeks" res = re.search(r"\bGee", s) print(res.start()) print(res.end()) print(res.span())the code searches for the substring 'Gee' at a word boundary in the string 'Welcome to GeeksForGeeks' prints the start index of the match (res.start()) the end index of the match (res.end()) and the span of the match (res.span()). 11 14 (11, 14) Getting the Matched Substring
group() method returns the part of the string for which the patterns match
import re s = "Welcome to GeeksForGeeks" res = re.search(r"\D{2} t", s) print(res.group())pattern specifies for the string that contains at least 2 characters which are followed by a space, and that space is followed by a t me t |