Regular expressions and FlashText

Regular expressions (regex) are a powerful tool for pattern matching and text manipulation. They allow us to define search patterns for strings, enabling complex string operations such as validation, searching, and replacing. Python provides the re module for this purpose.

Write the import re statement at the beginning of every example.

`search()` - finding a match in a string

The search() function locates the first occurrence of a pattern in a string and returns a match object. If no match is found, it returns None.


pattern = "Python"
string = "I am learning Python programming"
match = re.search(pattern, string)

if match:  
    print("Match found!")  
else:  
    print("No match")

`match()` - matching from the beginning of a string

The match() function checks if the pattern matches the beginning of the string.


pattern = "I am"
string = "I am learning Python programming"
match = re.match(pattern, string)

if match:  
    print("Match found at the beginning")  
else:  
    print("No match at the start")

`findall()` - retrieving all matches

The findall() function returns all occurrences of a pattern in a string as a list.


pattern = r"\d+"
string = "There are 3 apples and 5 oranges."
matches = re.findall(pattern, string)
print(matches) # ["3", "5"]

`sub()` - replacing patterns

The sub() function replaces all occurrences of a pattern with a specified string.


pattern = r"\d"
string = "Contact me at 123-456-7890."
result = re.sub(pattern, "X", string) # replacing digits with "X"
print(result) # Contact me at XXX-XXX-XXXX.

`compile()` - creating reusable regex patterns

The compile() function compiles a regex pattern into a regex object, allowing us to reuse the same pattern multiple times.


pattern = re.compile(r"\d{3}-\d{2}-\d{4}") # pattern for a Social Security Number (SSN)

# Using the compiled pattern
string1 = "My SSN is 123-45-6789."
string2 = "Your SSN is 987-65-4321."

match1 = pattern.search(string1)
match2 = pattern.search(string2)

if match1:  
    print("Match found in string1:", match1.group()) # 123-45-6789

if match2:  
    print("Match found in string2:", match2.group()) # 987-65-4321

`group()` - extracting specific parts of a match

The group() method of the match object retrieves specific matched groups in the pattern. Groups are defined using parentheses in the regex.


pattern = r"(\d{3})-(\d{2})-(\d{4})" # grouping digits
string = "My SSN is 123-45-6789."
match = re.search(pattern, string)

if match:  
    print("Entire match:", match.group()) # 123-45-6789
    print("Group 1:", match.group(1)) # 123
    print("Group 2:", match.group(2)) # 45
    print("Group 3:", match.group(3)) # 6789

Special sequences

Regex supports special sequences that simplify pattern creation:

`\d`	Matching any digit (equivalent to `[0-9]`).
`\D`	Matching any non-digit character.
`\w`	Matching any word character (alphanumeric + underscore).
`\W`	Matching any non-word character.
`\s`	Matching any whitespace character.
`\S`	Matching any non-whitespace character.


pattern = r"\d+"
string = "There are 12 cats and 34 dogs."
matches = re.findall(pattern, string)
print(matches) # ["12", "34"]

Special characters

Regex supports special characters for advanced pattern matching:

`.`	Matching any character except a newline.
`*`	Matches zero or more occurrences of the preceding element.
`+`	Matches one or more occurrences of the preceding element.
`?`	Matches zero or one occurrence of the preceding element.
`^`	Matches the start of a string.
`$`	Matches the end of a string.
`[]`	Matching any character within the brackets.
`\`	Escaping special characters or signals a special sequence.


# Matching any character
pattern = r"a.b"
string = "acb aab adb"
matches = re.findall(pattern, string)
print(matches) # ["acb", "aab", "adb"]

# Matching zero or more occurrences
pattern = r"ab*"
string = "abb ab abbb a"
matches = re.findall(pattern, string)
print(matches) # ["abb", "ab", "abbb", "a"]

Regex options (flags)

The re module provides several options to modify regex behavior:

`re.IGNORECASE`	Making the regex case-insensitive.
`re.DOTALL`	Allowing the `.` character to match newline characters.
`re.VERBOSE`	Allowing the use of whitespace and comments in regex patterns for better readability.


# Using re.IGNORECASE
pattern = re.compile(r"python", re.IGNORECASE)
match = pattern.search("I love PYTHON programming.")
if match:  
    print("Case-insensitive match found!")

# Using re.DOTALL
pattern = re.compile(r"Hello.*World", re.DOTALL)
string = "Hello\nWorld"
match = pattern.search(string)
if match:  
    print("Match found with DOTALL!")

# Using re.VERBOSE
pattern = re.compile(r"""
    \d{3}   # area code
    -       # separator
    \d{2}   # middle part
    -       # separator
    \d{4}   # last part
""", re.VERBOSE)

match = pattern.search("123-45-6789")
if match:  
    print("Verbose match found!")

The `FlashText` module

The flashtext module is used for fast keyword searching and replacing text. It is much faster than regular expressions, especially when handling large amounts of text or numerous keywords. This module has to be installed (pip install flashtext).


from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword("Python")
keyword_processor.add_keyword("programming")

text = "Python is a popular programming language."
found_keywords = keyword_processor.extract_keywords(text)
print("Found keywords:", found_keywords)

Regular expressions and FlashText

search() - finding a match in a string

match() - matching from the beginning of a string

findall() - retrieving all matches

sub() - replacing patterns

compile() - creating reusable regex patterns

group() - extracting specific parts of a match