Regular expressions and FlashText
Regular expressions (regex) are a powerful tool for pattern matching and text manipulation. They allow us to define search patterns for strings, enabling complex string operations such as validation, searching, and replacing. Python provides the re
module for this purpose.
Write the import re
statement at the beginning of every example.
search()
- finding a match in a string
The search()
function locates the first occurrence of a pattern in a string and returns a match object. If no match is found, it returns None
.
pattern = "Python"
string = "I am learning Python programming"
match = re.search(pattern, string)
if match:
print("Match found!")
else:
print("No match")
match()
- matching from the beginning of a string
The match()
function checks if the pattern matches the beginning of the string.
pattern = "I am"
string = "I am learning Python programming"
match = re.match(pattern, string)
if match:
print("Match found at the beginning")
else:
print("No match at the start")
findall()
- retrieving all matches
The findall()
function returns all occurrences of a pattern in a string as a list.
pattern = r"\d+"
string = "There are 3 apples and 5 oranges."
matches = re.findall(pattern, string)
print(matches) # ["3", "5"]
sub()
- replacing patterns
The sub()
function replaces all occurrences of a pattern with a specified string.
pattern = r"\d"
string = "Contact me at 123-456-7890."
result = re.sub(pattern, "X", string) # replacing digits with "X"
print(result) # Contact me at XXX-XXX-XXXX.
compile()
- creating reusable regex patterns
The compile()
function compiles a regex pattern into a regex object, allowing us to reuse the same pattern multiple times.
pattern = re.compile(r"\d{3}-\d{2}-\d{4}") # pattern for a Social Security Number (SSN)
# Using the compiled pattern
string1 = "My SSN is 123-45-6789."
string2 = "Your SSN is 987-65-4321."
match1 = pattern.search(string1)
match2 = pattern.search(string2)
if match1:
print("Match found in string1:", match1.group()) # 123-45-6789
if match2:
print("Match found in string2:", match2.group()) # 987-65-4321
group()
- extracting specific parts of a match
The group()
method of the match object retrieves specific matched groups in the pattern. Groups are defined using parentheses in the regex.
pattern = r"(\d{3})-(\d{2})-(\d{4})" # grouping digits
string = "My SSN is 123-45-6789."
match = re.search(pattern, string)
if match:
print("Entire match:", match.group()) # 123-45-6789
print("Group 1:", match.group(1)) # 123
print("Group 2:", match.group(2)) # 45
print("Group 3:", match.group(3)) # 6789
Special sequences
Regex supports special sequences that simplify pattern creation:
\d |
Matching any digit (equivalent to [0-9] ). |
\D |
Matching any non-digit character. |
\w |
Matching any word character (alphanumeric + underscore). |
\W |
Matching any non-word character. |
\s |
Matching any whitespace character. |
\S |
Matching any non-whitespace character. |
pattern = r"\d+"
string = "There are 12 cats and 34 dogs."
matches = re.findall(pattern, string)
print(matches) # ["12", "34"]
Special characters
Regex supports special characters for advanced pattern matching:
. |
Matching any character except a newline. |
* |
Matches zero or more occurrences of the preceding element. |
+ |
Matches one or more occurrences of the preceding element. |
? |
Matches zero or one occurrence of the preceding element. |
^ |
Matches the start of a string. |
$ |
Matches the end of a string. |
[] |
Matching any character within the brackets. |
\ |
Escaping special characters or signals a special sequence. |
# Matching any character
pattern = r"a.b"
string = "acb aab adb"
matches = re.findall(pattern, string)
print(matches) # ["acb", "aab", "adb"]
# Matching zero or more occurrences
pattern = r"ab*"
string = "abb ab abbb a"
matches = re.findall(pattern, string)
print(matches) # ["abb", "ab", "abbb", "a"]
Regex options (flags)
The re
module provides several options to modify regex behavior:
re.IGNORECASE |
Making the regex case-insensitive. |
re.DOTALL |
Allowing the . character to match newline characters. |
re.VERBOSE |
Allowing the use of whitespace and comments in regex patterns for better readability. |
# Using re.IGNORECASE
pattern = re.compile(r"python", re.IGNORECASE)
match = pattern.search("I love PYTHON programming.")
if match:
print("Case-insensitive match found!")
# Using re.DOTALL
pattern = re.compile(r"Hello.*World", re.DOTALL)
string = "Hello\nWorld"
match = pattern.search(string)
if match:
print("Match found with DOTALL!")
# Using re.VERBOSE
pattern = re.compile(r"""
\d{3} # area code
- # separator
\d{2} # middle part
- # separator
\d{4} # last part
""", re.VERBOSE)
match = pattern.search("123-45-6789")
if match:
print("Verbose match found!")
The FlashText
module
The flashtext module is used for fast keyword searching and replacing text. It is much faster than regular expressions, especially when handling large amounts of text or numerous keywords. This module has to be installed (pip install flashtext
).
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword("Python")
keyword_processor.add_keyword("programming")
text = "Python is a popular programming language."
found_keywords = keyword_processor.extract_keywords(text)
print("Found keywords:", found_keywords)