Regular Expressions for Data Cleaning

Extra Writeups
Author
Affiliation

Jeff Jacobs

Published

September 26, 2023

Since my Data Cleaning with Python video did not end up using regular expressions as part of the data cleaning process, in this writeup I want to quickly show you examples of how learning regular expressions can make your life 1000x easier when cleaning any data involving string variables.

Python’s re Library

Since regular expressions are the key building block for how programming language compilers/interpreters actually figure out what your code is telling it to do, most languages have a regular expression library built-in, and Python is no exception! To load Python’s regular expression library, you can add the following line of code to the top of your Python file/notebook:

import re

Python’s r'' Syntax

Although regular expressions are ordinary string objects in Python, we’re using them in a different way from how we use ordinary strings: While you’ve seen ordinary non-RegEx strings like s = "Hello", x = "abc123", regular expressions are a bit confusing relative to these types of strings, since we’re using them as a sort of “meta-language” to search for patterns within other strings.

This means, for example, that if we include the special character \t in a RegEx string, Python knows to interpret this as “I am looking for a tab character” rather than “Place four spaces here” as it would be interpreted in an ordinary non-RegEx string. To accomplish this separation, Python allows you to prefix regular expression strings with the letter r, like

ordinary_string = "Hello\tWorld"
regex_string = r'Hello\tWorld'

So that now if we print the two strings we can see the difference:

print(ordinary_string)
print(regex_string)

This will become important as you start to use regular expressions to clean data: for example, if you happen to come across a dataset in .tsv format, and you want to convert it into .csv so it is more consistently read and displayed across different Operating Systems, you can use the following RegEx pattern to find all of the \t characters and replace them with a comma (we will learn about each of the functions I’m using here in the sections below, so don’t worry if you don’t understand what’s happening here for the moment):

original_file = """id\tvar1\tvar2
0\tJeff\t5
1\tJames\t6
2\tNakul\t7"""
tab_regex_str = r'\t'
tab_regex = re.compile(tab_regex_str)
cleaned_file = tab_regex.sub(",", original_file)
print(cleaned_file)

Once the library is imported, although there are lots of functions you could use, the following are the main ones you will use for data cleaning:

re.compile()

Long story short, although you could technically get away with using all the features of the re library without compiling your regular expression strings, I highly recommend always converting a “raw” regular expression string into a compiled re object, since compiling your RegEx string into a compiled object will make all the remaining functions run much faster.

In the above .tsv-to-.csv example, we took the raw RegEx string tab_regex_str and compiled it using tab_regex = re.compile(tab_regex_str), so that we could then use tab_regex as a regular expression object with all of the RegEx-related Pythons callable using the . operator:

match_result = tab_regex.match(original_file)
print(match_result)
search_result = tab_regex.search(original_file)
print(search_result)
findall_result = tab_regex.findall(original_file)
print(findall_result)

If, for whatever reason, you don’t want to compile your RegEx strings, you can still call all of these functions by using the re module directly and providing a RegEx string as the first argument to the function, like

match_result_nocompile = re.match(
    tab_regex_str,
    original_file
)
print(match_result)
search_result_nocompile = re.search(tab_regex_str, original_file)
print(search_result)
findall_result_nocompile = re.findall(tab_regex_str, original_file)
print(findall_result)

Since I want you to get in the habit of compiling your regular expression strings, I’m going to use the object-specific syntax (e.g., my_compiled_regex.match()) rather than the global syntax (re.match()), for this and all remaining functions.

Now let’s look at what each of these three functions does!

Binary Accept vs. Reject: match()

You can think of match() as associated with the original purpose of regular expressions: to take in a string and accept or reject that string based on whether or not it matches the pattern described by the regular expression (see Week 05 for examples). This function will return information on the match if the match was successful, or the Python “null” value None otherwise1.

In the above examples, we saw that in fact our RegEx string did not match the provided string original_file: this is the expected behavior, and makes sense, since .match() will only be successful if the provided string EXACTLY matches the provided regular expression. Since original_file was a whole file, with lots of different characters, the regular expression r'\t' (which matches only a single \t character) will not match the file. The only string that our RegEx pattern r'\t' would actually perfectly match would be the string "\t":

single_tab_character = "\t"
print(tab_regex.match(single_tab_character))

But, even adding a single additional character before the \t will cause this RegEx string to no longer match2:

two_characters = "s\t"
print(tab_regex.match(two_characters))

This can be extremely helpful when you want to (for example) validate a column in a dataset, like a phone number column: if you write a phone number RegEx string, you can use it in conjunction with match() and Pandas to check that every string in the column is a match:

import pandas as pd
phone_reg_str = r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
phone_reg = re.compile(phone_reg_str)
number_list = ['202-123-4567','202-999-9999','301x123x1234']
phone_df = pd.DataFrame({'phone_num': number_list})
phone_df['match'] = phone_df['phone_num'].apply(phone_reg.match)
phone_df

Although I do tend to just store the .match() result in a column like this, while cleaning, if you’re going to export the dataset it’s a bit sloppy to just put the re.match object itself into a column like this (it may cause issues if you try to save it as .csv and load it on a different operating system, for example). So, a cleaner version safe for export could look like:

phone_df['match'] = phone_df['phone_num'].apply(lambda x: phone_reg.match(x) is not None)
phone_df

Searching a String for Matches

The findall() function, unlike match(), does not require that the entire string is matched by the RegEx pattern. Instead, it finds all substrings within the bigger string that match the pattern (put another way: all the substrings for which re.match() would return a match).

This is helpful in a different data cleaning case, where for example you may have freeform text and you want to extract all of the phone numbers written in this freeform text. For example: Imagine a dataset containing the results from a survey with a freeform text field like “Introduce Yourself!”, and you want to go through this field and extract all of the phone numbers that have been entered in this field:

responses = [
    "Hello my number is 202-123-4567",
    "Hi thanks for the survey call me at 301-111-1111 or 925-123-1111",
    "I hate this survey don't ever call 240-999-9999",
    "I don't have a phone number sorry"
]
survey_df = pd.DataFrame({
    'response': responses
})
survey_df

We can run the the findall() function on each entry in the response column here, to extract just the phone numbers:

survey_df['matches'] = survey_df['response'].apply(phone_reg.findall)
survey_df

Note the fact that findall() returns a list of matches, rather than just one match, so that if you’re expecting only one you’ll have to handle the case of multiple matches (like in row 2 of this example) as well as the case of no matches (row 4).

Smart Substitution: sub()

Oftentimes you’re not worried about analyzing the re.match objects themselves, you just want to use a RegEx to do a fancier find and replace than what’s possible using .replace(x,y) (the Python string function that finds all instances of x in a string and replaces them with y).

For example, in the Data Cleaning with Python video/writeup, I found that one of the counties in one dataset did not match with the same county in the other dataset, because in one the county name was abbreviate to St. Mary's while in the other the name was fully written out as Saint Mary's.

In that case, I was able to just use .replace(), since there was only one instance and I knew exactly what I wanted to replace with what. But, for a trickier case that regular expressions can handle while .replace() can’t, image merging one dataset where two-character abbreviations like this are followed by a period like St. Marys, with another dataset where two-character abbreviations are not followed by a period, like St Marys (this is actually fairly common, as sometimes names like this are entered into old database systems used by government agencies that don’t allow entering commas and/or apostrophes).

To handle this, we can write code using regular expressions that finds all two-letter abbreviations—portions of a string starting with a capital letter, followed by a lowercase letter, followed by a period—and replaces them with the same string but without the period at the end.

abbrev_reg_str = r'[A-Z][a-z]\.'
abbrev_reg = re.compile(abbrev_reg_str)
county_list = [
    'St. Marys',
    'Ft. Worth',
    'Montgomery',
    'Mt. Everest',
    'Mt. St. Vincent'
]
county_df = pd.DataFrame({'name': county_list})
county_df

Like before, we can use abbrev_reg.findall() to find instances of this type of abbreviation across the dataset:

county_df['matches'] = county_df['name'].apply(abbrev_reg.findall)
county_df

However, to be able to automatically replace the matches with a given replacement, we’ll have to dive one level deeper into regular expressions. I will talk about this more in lecture, but it boils down to: you can use parentheses within your regular expression to indicate match groups: subsets of the regular expression that you would like re to keep track of when it finds a match, and then provide a replacement string that utilizes these match groups. It is easier to explain through example than through text. In the following code, we use match groups to indicate to Python that we want to extract just the country code from an entire phone number:

area_code_reg_str = r'(?P<county_code>[+][0-9]{1,3})-[0-9]{3}-[0-9]{3}-[0-9]{4}'
area_code_reg = re.compile(area_code_reg_str)
my_num = '+970-202-111-1111'
match_result = area_code_reg.match(my_num)
print(match_result)

So, we see that we have a match, but we took the extra step of putting parentheses around the part of the match that we specifically wanted, and telling Python to call this part county_code, so now we can use the .groupdict() function of the re.match object to specifically extract just this subset of the full match:

match_result.groupdict()

We can also directly use the captured groups within the sub() method, like

area_code_reg.sub(r'Country code \1', my_num)

By applying this same logic to the abbreviation matches above, therefore, we

Footnotes

  1. You have to be extremely careful when running RegEx functions (or any functions/libraries where functions can return None) in Jupyter notebooks for this reason: the way Jupyter is set up, if the last line of code in a cell has the value None, Jupyter just won’t display anything at all when you run that cell, even if the execution of the cell was totally successful (i.e., even if there were no errors). That’s why—as in the above examples—I try to remember to always print() the results of RegEx functions in Python, rather than just expecting Jupyter to automatically display their result.↩︎

  2. To reiterate the point about None: if hadn’t used print() in this case, to explicitly tell Jupyter to print the result of the last line, this cell wouldn’t have produced any output.↩︎