Week 5: Data Cleaning

DSAN 5000: Data Science and Analytics

Class Sessions
Author

Prof. Jeff and Prof. James

Published

Thursday, September 26, 2024

Open slides in new window →

Week 04 Recap

  • Dataset Types / Formats
  • Web Scraping
  • APIs

Data Formats: Network Data

  • Network data form two tabular datasets: one for nodes and one for edges
  • Ex: Network on left could be represented via pair of tabular datasets on right
Figure 1: Network of some sort of ancient kinship structure
node_id label
1 Bulbasaur
2 Ivysaur
\(\vdots\) \(\vdots\)
9 Blastoise
Figure 2: A tabular dataset representing the nodes in the network above
edge_id source target weight
0 1 2 16
1 2 3 32
2 4 5 16
3 5 6 36
4 7 8 16
5 8 9 36
Figure 3: A tabular dataset representing the edges in the network above

Web Scraping with requests and BeautifulSoup

What I Want To Do Python Code I Can Use
Send an HTTP GET request response = requests.get(url)
Send an HTTP POST request response = requests.post(url, post_data)
Get just the plain HTML code (excluding headers, JS) returned by the request html_str = response.text
Parse a string containing HTML code soup = BeautifulSoup(html_str, 'html.parser')
Get contents of all <xyz> tags in the parsed HTML xyz_elts = soup.find_all('xyz')
Get contents of the first <xyz> tag in the parsed HTML xyz_elt = soup.find('xyz')
Get just the text (without formatting or tag info) contained in a given element xyz_elt.text

APIs Recap

  • Keep in mind the distinction between your entire application and the API endpoints you want to make available to other developers!
Application Should Be Endpoints Shouldn’t Be Endpoints
Voting Machine cast_vote() (End User), get_vote_totals() (Admin) get_vote(name), get_previous_vote()
Gaming Platform get_points() (Anyone), add_points(), remove_points() (Game Companies) set_points()
Thermometer view_temperature() release_mercury()
Canvas App for Georgetown view_grades() (different for Students and Teachers) SQL Statement for Storing and Retrieving Grades in Georgetown DB

APIs Part 2

REST vs. SOAP vs. GraphQL

  • SOAP: Standard Object Access Protocol: Operates using a stricter schema (data must have this form, must include these keys, etc.), XML only
  • REST (REpresentational State Transfer): Uses standard HTTP, wide range of formats
  • GraphQL (Graph Query Language): Rather than exposing several endpoints, each performing a single function, GraphQL exposes a single endpoint for queries (so that the server then goes and figures out how to satisfy these queries)

G1 cluster_00 Endpoints dev Developer api1 get_users dev->api1 API call api2 get_friends dev->api2 API call (id = 5) figureOut is User 5 friends w Pitbull? dev->figureOut api1->dev api2->dev

Figure 4: Using individual endpoints (get_users and get_friends) to derive answer to “Is User 5 friends with Pitbull?”

G2 cluster_00 (Single) Endpoint api1 get_users gqlServer GraphQL Server api1->gqlServer (internal) dev Developer dev->gqlServer GraphQL API call ("is User 5 friends with Pitbull?") gqlServer->api1 gqlServer->dev api2 get_friends gqlServer->api2 (internal) api2->gqlServer

Figure 5: Answering the same question (“Is User 5 friends with Pitbull?”) directly using GraphQL

What Should I Include in My API?

Key Principle: CRUD

  • Create: Place a new record in some table(s)
  • Read: Get data for all (or subset of) records
  • Update*: Locate record, change its value(s)
    • Upsert”: Update if already exists, otherwise create
  • Delete: Remove record from some table(s)

Cleaning Tabular Data

The Unexpected Pitfalls

  • You find the perfect dataset for your project, only to open it and find…

Data Cleaning

  • The most undervalued skill in data science!
  • Regardless of industry, absurd variety of data formats1

Source: XKCD #927

The Data Cleaning Toolbox

  • Text Editors
  • Regular Expressions
  • Conversion Tools
  • HTML Parsers

Text Editors

  • “Broken” data can often be fixed by manually examining it in a text editor!
my_data.csv
id,var_A,var_B,var_C\n
1,val_1A,val_1B,val_1C\r\n
2,val_2A,val_2B,val_2C\n
3,val_3A,val_3B,val_3C\n
library(readr)
data <- read_csv("assets/my_data.csv")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): index, var_1, var_2, var_3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
index var_1 var_2 var_3
A val_A1 val_A2 val_A3
B val_B1 val_B2 val_B3
C val_C1 val_C2 val_C3

Regular Expressions

  • Language for turning unstructured data into structured data
  • In Computer Science, a whole course, if not two
  • tldr: a regular expression, or a RegEx string, represents a machine that either accepts or rejects input strings
RegEx [A-Za-z0-9]+ @ [A-Za-z0-9.-]+ \. (com|org|edu) Result:
String A jj1088 @ georgetown . edu Accept
String B spammer @ fakesite!! . coolio Reject

Regular Expressions: Intuition

  • The guiding principle: think of the types of strings you want to match:
    • What kinds of characters do they contain? (and, what kinds of characters should they not contain?)
    • What is the pattern in which these characters appear: is there a character limit? Which characters/patterns are optional, which are required?
  • You can then use the RegEx syntax to encode the answers to these questions!

RegEx Syntax: Single Characters

  • z: Match lowercase z, a single time
  • zz: Match two lowercase zs in a row
  • z{n}: Match n lowercase zs in a row
  • [abc]: Match a, b, or c, a single time
  • [A-Z]: Match one uppercase letter
  • [0-9]: Match one numeric digit
  • [A-Za-z0-9]: Match a single alphanumeric character
  • [A-Za-z0-9]{n}: Match n alphanumeric characters

RegEx Syntax: Repeating Patterns

  • z*: Match lowercase z zero or more times
  • z+: Match lowercase z one or more times
  • z?: Match zero or one lowercase zs
z* z+ z? z{3}
""
"z"
"zzz"

Example: US Phone Numbers

  • Area code sometimes surrounded by parentheses:
    • 202-687-1587 and (202) 687-1587 both valid!
  • Which repeating pattern syntax (from the previous slide) helps us here?
RegEx [(]? [0-9]{3} [)]? [ -] [0-9]{3}-[0-9]{4} Result
"202-687-1587" \(\varepsilon\) 202 \(\varepsilon\) - 687-1587 Accept
"(202) 687-1587" ( 202 )   687-1587 Accept
"2020687-1587" \(\varepsilon\) 202 \(\varepsilon\) 0 687-1587 Reject

Building and Testing RegEx Strings

RegExr.com

Recall: Tidy Data

Overview | In-Depth

  • Every column is a variable
  • Every row is an observation
  • Every cell is a value
Var1 Var 2
Obs 1 Val 1 Val 2
Obs 2 Val 3 Val 4
Figure 6: A template for tidy data, with observations in blue, variables in orange, and values in green
Code
library(tidyverse)
Code
table1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ stringr   1.5.1
✔ forcats   1.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583
Figure 7: An example panel dataset in tidy format
Code
table2 |> head(6)
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Figure 8: An example panel dataset in non-tidy format

How Do We Get Our Data Into Tidy Form?

  • R: The Tidyverse
  • Python: Pandas + Regular Expressions (lab demo!)

Cleaning Text Data

One of the Scariest Papers of All Time

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It (Denny and Spirling 2018) (PDF Link)

Figure 9: A plot where \(x\)-axis represents UK party manifestos (arranged on left-right ideology dimension, \(y\)-axis slots represent an ordering of preprocessing steps, and a filled bar means the manifesto was placed incorrectly on the ideological dimension when that preprocessing combination was used.
Figure 10: An ancient hieroglyph

The Secret Behind All Text Analysis

Figure 11: Excerpts from two data science textbooks, plus another book

doc_id text texts Kékkek voice
0 0 6 0 1
1 0 0 3 1
2 6 0 0 0
Figure 12: The Document-Term Matrix (DTM)

 

doc_id text kekkek voice
0 6 0 1
1 0 3 1
2 6 0 0
Figure 13: The cleaned DTM, after lowercasing, lemmatization, and unicode standardization

Lab Demonstrations

Lab Demo: Data Cleaning in Python

Alternative Lab Demo: Regular Expressions for Data Cleaning

Lab Assignment Overview

Lab-2.2: Cleaning Record Data in R and Python

References

Denny, Matthew J., and Arthur Spirling. 2018. “Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It.” Political Analysis 26 (2): 168–89. https://doi.org/10.1017/pan.2017.44.
Immerwahr, Daniel. 2019. How to Hide an Empire: A History of the Greater United States. Farrar, Straus and Giroux.

Footnotes

  1. To fully appreciate the importance of standards in the modern industrial/technology-driven world, see Chapter 18: “The Empire of the Red Octagon” (referring to the US’s octagonal stop sign), in Immerwahr (2019).↩︎