Week 4: Data Formats and APIs

DSAN 5000: Data Science and Analytics

Class Sessions

Author

Prof. Jeff and Prof. James

Published

Thursday, September 19, 2024

Open slides in new window →

Week 03 Recap

Image from Menczer, Fortunato, and Davis (2020, 90)

Git Commands

Command	What It Does
`git clone`	Downloads a repo from the web to our local computer
`git init`	Creates a new, blank Git repository on our local computer (configuration/change-tracking stored in `.git` subfolder)
`git add`	Stages a file(s): Git will now track changes in this file(s)
`git reset`	Undoes a `git add`
`git status`	Shows currently staged files and their status (created, modified, deleted)
`git commit -m "message"`	“Saves” the current version of all staged files, ready to be pushed to a backup dir or remote server like GitHub
`git push`	Transmits local commits to remote server
`git pull`	Downloads commits from remote server to local computer
`git merge`	Merges remote versions of files with local versions

Reproducible Docs/Literate Programming

1980s: \(\LaTeX\) for \(\widehat{\mathcal{T}}\)ypesetting \(\sqrt{math}^2\)
1990s: Python and R as powerful scripting languages (no compilation required)
2000s/2010s: Interactive Python via Jupyter, fancy IDE for R called RStudio
2020s: Quarto (using pandoc under the hood) enables use of markdown for formatting, \(\LaTeX\) for math, and both Python and R in same document, with choice of output formats (HTML, presentations, Word docs, …)

Quiz Time

Data Gathering

Preexisting data sources
Web scraping
Converting between formats

Preexisting Data Sources

Depending on your field, or the type of data you’re looking for, there may be a “standard” data source! For example:
Economics:
- US data: FRED
- Global data: World Bank Open Data, OECD Data, etc.
Political Science:
- ICPSR
Network Science:
- Stanford SNAP: Large Network Dataset Collection

Web Scraping

Fun fact: you can view a webpage’s HTML source code by right-clicking on the page and selecting “View Source”
- On older websites, this means we can just request page and parse the returned HTML
Less fun fact: modern web frameworks (React, Next.js) generate pages dynamically using JS, meaning that what you see on the page will not be visible in the HTML source
- Data scraping still possible for these sites! Using browser automation tools like Selenium

Scraping Difficulty

		How is data loaded?	Solution
😊	Easy	Data in HTML source	“View Source”
😐	Medium	Data loaded dynamically via API	“View Source”, find API call, scrape programmatically
😳	Hard	Data loaded dynamically [internally] via web framework	Use Selenium

Data Structures: Foundations

Could be (is) a whole class, hypothetically taught in the Spring
Could be (is) a whole class just for one type of data (geographic/spatial)
For this class: some foundational principles that should let you figure out fancier data structures you encounter

Opening Datasets With Your Terminator Glasses On

What does a row represent?
What does a column represent?
What does a value in a cell represent?
Are there unique identifiers for the objects you care about?

Figure 1: What you should see when you look at a new dataset

From Raw Data to Clean Data

Data Structures: Simple \(\rightarrow\) Complex

id	name	email
0	K. Desbrow	kd9@dailymail.com
1	D. Minall	dminall1@wired.com
2	C. Knight	ck2@microsoft.com
3	M. McCaffrey	mccaf4@nhs.uk

(a) Record Data

year	month	points
2023	Jan	65
2023	Feb
2023	Mar	42
2023	Apr	11

(b) Time-Series Data

id	date	rating	num_rides
0	2023-01	0.75	45
0	2023-02	0.89	63
0	2023-03	0.97	7
1	2023-06	0.07	10

Source	Target	Weight
IGF2	IGF1R	1
IGF1R	TP53	2
TP53	EGFR	0.5

(d) Network Data

Figure 2: Fake data via Mockaroo and Random.org. Protein-protein interaction network from Agrawal, Zitnik, and Leskovec (2018)

Tabular Data vs. Relational Data

All of the datasets on the previous slide are tabular
DBs like SQLite, MySQL model relationships within and between tabular datasets
Imagine you’re creating the backend for a Georgetown social network. How would you record users and friendships? Your intuition may be record data:

id	name	friends
1	Purna	[2,3,4]
2	Jeff	[1,3,4,5,6]
3	James	[1,2,4,6]
4	Britt	[1,2,3]
5	Dr. Fauci	[2,6]
6	Pitbull	[2,5]

Figure 3: Our first attempt at a data structure for our social network app’s backend

Long story short…

This doesn’t scale
Extremely inefficient to find whether two users are friends
Redundant information: Have to store friendship between A and B in both A’s row and B’s row

A Better Approach

Move the friendship data into its own table!
This table now represents relational data, (user table still corresponds to records):

user_id	name
1	Purna
2	Jeff
3	James
4	Britt
5	Dr. Fauci
6	Pitbull

Figure 4: The user table in our relational structure

id	friend_1	friend_2	id	friend_1	friend_2
1	1	2	6	2	5
2	1	3	7	2	6
3	1	4	8	3	4
4	2	3	9	3	6
5	2	4	10	5	6

Figure 5: The friendships table in our relational structure

May seem weird in terms of human readability, but think in terms of memory/computational efficiency: (a) Scalable, (b) Easy to find if two users are friends (via sorting/searching algorithms), (c) No redundant info

DBs: Relational or Otherwise

For rest of lecture we zoom in on cases where data comes as individual files
But on top of the relational format from previous slide, there are also non-relational database formats, like the document-based format used by e.g. MongoDB¹
In either case, data spread over many files \(\Rightarrow\) to obtain a single dataset we use queries.

Static datasets (individual files on disk)

Datasets formed dynamically via database queries

Data Formats

The most common formats, for most fields:
- .csv: Comma-Separated Values
- .tsv: Tab-Separated Values
- .json: JavaScript Object Notation
- .xls/.xlsx: Excel format
- .dta: Stata format
- .yml: More human-readable alternative to JSON

`.csv` / `.tsv`

👍

my_data.csv

index,var_1,var_2,var_3
A,val_A1,val_A2,val_A3
B,val_B1,val_B2,val_B3
C,val_C1,val_C2,val_C3
D,val_D1,val_D2,val_D3

(👎)

my_data.tsv

index var_1 var_2 var_3
A val_A1  val_A2  val_A3
B val_B1  val_B2  val_B3
C val_C1  val_C2  val_C3
D val_D1  val_D2  val_D3

→

source("../_globals.r")
library(readr)
data <- read_csv("assets/my_data.csv")

Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): index, var_1, var_2, var_3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

disp(data)

# | index | var_1 | var_2 | var_3 |
# | - | - | - | - |
# | A | val_A1 | val_A2 | val_A3 |
# | B | val_B1 | val_B2 | val_B3 |
# | C | val_C1 | val_C2 | val_C3 |
# | D | val_D1 | val_D2 | val_D3 |

Python: pd.read_csv() (from Pandas library)
R: read_csv() (from readr library)

`.json`

courses.json

{
  "dsan5000": {
    "title": "Data Science and Analytics",
    "credits": 3,
    "lectures": [
      "Intro",
      "Tools and Workflow"
    ]
  },
  "dsan5100": {
    "title": "Probabilistic Modeling and Statistical Computing",
    "credits": 3,
    "lectures": [
      "Intro",
      "Conditional Probability"
    ]
  }
}

Python: json (built-in library, import json)
R: jsonlite (install.packages("jsonlite"))
Helpful validator (for when .json file won’t load)

Other Formats

.xls/.xlsx: Requires special libraries in Python/R
- Python: openpyxl
- R: readxl (part of tidyverse)
.dta: Stata format, but can be read/written to in Python/R
- Python: Pandas has built-in pd.read_stata() and pd.to_stata()
- R: read_dta() from Haven library (part of tidyverse)

Web Scraping

		How is data loaded?	Solution
This section →	😊 Easy	Data in HTML source	“View Source”
Next section →	😐 Medium	Data loaded dynamically via API	“View Source”, find API call, scrape programmatically
Future weeks →	😳 Hard	Data loaded dynamically [internally] via web framework	Use Selenium

Scraping HTML with `requests` and BeautifulSoup

requests Documentation | BeautifulSoup Documentation

Code

import requests
# Perform request
response = requests.get("https://en.wikipedia.org/wiki/Data_science")
# Parse HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
all_header_elts = soup.find_all("h2")
all_header_text = [elt.text for elt in all_header_elts]
#section_headers = [h.find("span", {'class': 'mw-headline'}).text for h in all_headers[1:]]
print("\n".join(all_header_text))

Contents
Foundations
Etymology
Data science and data analysis
Data Science as an Academic Discipline
Cloud computing for data science
Ethical consideration in data science
See also
References

Navigating HTML with BeautifulSoup

Let’s focus on this line from the previous slide:

all_headers = soup.find_all("h2")

find_all() is the key function for scraping!
If the HTML has a repeating structure (like rows in a table), find_all() can instantly parse this structure into a Python list

The Power of `find_all()`

data_page.html

<div class="all-the-data">
    <h4>First Dataset</h4>
    <div class="data-1">
        <div class="dataval">1</div>
        <div class="dataval">2</div>
        <div class="dataval">3</div>
    </div>
    <h4>Second Dataset</h4>
    <div class="data-2">
        <ul>
            <li>4.0</li>
            <li>5.5</li>
            <li>6.7</li>
        </ul>
    </div>
</div>

Figure 6: Data in page elements (<div>, <li>)

First Dataset

Second Dataset

Figure 7: The code from Figure 6, rendered by your browser

from bs4 import BeautifulSoup
soup = BeautifulSoup(page_html, 'html.parser')
ds1_elt = soup.find("div", class_='data-1')
ds1 = [e.text for e in ds1_elt.find_all("div")]
ds2_elt = soup.find("div", {'class': 'data-2'})
ds2 = [e.text for e in ds2_elt.find_all("li")]

Figure 8: BeautifulSoup code used to parse HTML

print(f"dataset-1: {ds1}\ndataset-2: {ds2}")

dataset-1: ['1', '2', '3']
dataset-2: ['4.0', '5.5', '6.7']

Figure 9: Python lists containing parsed data

Parsing HTML Tables

table_data.html

<table>
<thead>
    <tr>
        <th>X1</th><th>X2</th><th>X3</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td>1</td><td>3</td><td>5</td>
    </tr>
    <tr>
        <td>2</td><td>4</td><td>6</td>
    </tr>
</tbody>
</table>

Figure 10: Data in HTML table format

X1	X2	X3
1	3	5
2	4	6

Figure 11: The HTML table code, as rendered by your browser

from bs4 import BeautifulSoup
soup = BeautifulSoup(table_html, 'html.parser')
thead = soup.find("thead")
headers = [e.text for e in thead.find_all("th")]
tbody = soup.find("tbody")
rows = tbody.find_all("tr")
data = [[e.text for e in r.find_all("td")]
            for r in rows]

Figure 12: BeautifulSoup code used to parse HTML

print(f"headers: {headers}\ndata: {data}")

headers: ['X1', 'X2', 'X3']
data: [['1', '3', '5'], ['2', '4', '6']]

Figure 13: Python lists containing parsed data

APIs

Application Programming Interfaces: developer-facing part of data pipeline/service. Hides unnecessary details:

Example	Care about 🧐	Don’t care about 🙅‍♂️
Electrical outlet	Electricity	Details of Alternating/Direct Currents
Water fountain	Water	Details of how it’s pumped into the fountain
Car	Accelerate, brake, reverse	Details of combustion engine

What Does an API Do?

Exposes endpoints for use by developers, without requiring them to know the nuts and bolts of your pipeline/service:

Example	Endpoint	Not Exposed
Electrical outlet	Socket	Internal wiring
Water fountain	Aerator	Water pump
Car	Pedals, Steering wheel, etc.	Engine

When I’m teaching programming to students in refugee camps who may have never used a computer before, I try to use the idea of “robots”: a program is a robot trained to sit there and wait for inputs, then process them in some way and spit out some output. APIs really capture this notion, honestly.

Example: Math API

Base URL: https://newton.vercel.app/api/v2/
The endpoint: factor
The argument: "x^2 - 1"
The request: https://newton.vercel.app/api/v2/factor/x^2-1

Code

import requests
response = requests.get("https://newton.vercel.app/api/v2/factor/x^2-1")
print(response.json())

{'operation': 'factor', 'expression': 'x^2-1', 'result': '(x - 1) (x + 1)'}

Math API Endpoints

Operation	API Endpoint	Result
Simplify	`/simplify/2^2+2(2)`	`8`
Factor	`/factor/x^2 + 2x`	`x (x + 2)`
Derive	`/derive/x^2+2x`	`2 x + 2`
Integrate	`/integrate/x^2+2x`	`1/3 x^3 + x^2 + C`
Find 0’s	`/zeroes/x^2+2x`	`[-2, 0]`
Find Tangent	`/tangent/2\|x^3`	`12 x + -16`
Area Under Curve	`/area/2:4\|x^3`	`60`
Cosine	`/cos/pi`	`-1`
Sine	`/sin/0`	`0`
Tangent	`/tan/0`	`0`

Authentication

Unlike the math API, most APIs do not allow requests to be made by anonymous requesters, and require authentication.
For example, you can access public GitHub repos anonymously, but to access private GitHub repos using GitHub’s API, you’ll need to authenticate that you are in fact the one making the request

Authentication via `PyGithub`

PyGithub Installation

Install using the following terminal/shell command [Documentation]

pip install PyGithub

PyGithub can handle authentication for you. Example: this private repo in my account does not show up unless the request is authenticated (via a Personal Access Token)²:

import github
g = github.Github()
try:
  g.get_repo("jpowerj/private-repo-test")
except Exception as e:
  print(e)

Figure 14: Using the GitHub API without authentication

# Load the access token securely
import os
my_access_token = os.getenv('GITHUB_TOKEN')
import github
# Use the access token to make an API request
auth = github.Auth.Token(my_access_token)
g = github.Github(auth=auth)
g.get_user().get_repo("private-repo-test")

Figure 15: Using the GitHub API with authentication

References

Agrawal, Monica, Marinka Zitnik, and Jure Leskovec. 2018. “Large-Scale Analysis of Disease Pathways in the Human Interactome.” In PACIFIC SYMPOSIUM on BIOCOMPUTING 2018: Proceedings of the Pacific Symposium, 111–22. World Scientific.

Menczer, Filippo, Santo Fortunato, and Clayton A. Davis. 2020. A First Course in Network Science. Cambridge University Press.

Appendix: Code Examples in R

Scraping HTML with `httr2` and `xml2`

httr2 Documentation | xml2 Documentation

Code

library(httr2)
request_obj <- request("https://en.wikipedia.org/wiki/Data_science")
response_obj <- req_perform(request_obj)
# Parse HTML
library(xml2)


Attaching package: 'xml2'

The following object is masked from 'package:httr2':

    url_parse

Code

html_obj <- response_obj |> resp_body_html()
html_obj |> xml_find_all('//h2') |> xml_text()

[1] "Contents"                              
[2] "Foundations"                           
[3] "Etymology"                             
[4] "Data science and data analysis"        
[5] "Data Science as an Academic Discipline"
[6] "Cloud computing for data science"      
[7] "Ethical consideration in data science" 
[8] "See also"                              
[9] "References"

Note: httr2 is a re-written version of the original httr package, which is now deprecated. You’ll still see lots of code using httr, however, so it’s good to know how both versions work. Click here for a helpful vignette on the original httr library.

Navigating HTML with XPath

XPath Cheatsheet

Notice the last line on the previous slide:

html_obj |> xml_find_all('//h2') |> xml_text()

The string '//h2' is an XPath selector
XPath selectors are used by many different libraries, including Selenium (which we’ll look at very soon) and jQuery (an extension to plain JS allowing easy search/manipulation of the DOM), so it’s good to learn it now!

XPath I: Selecting Elements

mypage.html

<div class="container">
  <h1>Header</h1>
  <p id="page-content">Content</p>
  <img class="footer-image m-5" src="footer.png">
</div>

'//div' matches all elements <div> in the document:

<div class="container">
  <h1>Header</h1>
  <p id="page-content">Content</p>
  <img class="footer-image m-5" src="footer.png">
</div>

'//div//img' matches <img> elements which are children of <div> elements:
```
<img class="footer-image m-5" src="footer.png">
```

XPath II: Filtering by Attributes

mypage.html

<div class="container">
  <h1>Header</h1>
  <p id="page-content">Content</p>
  <img class="footer-image m-5" src="footer.png">
</div>

'//p[id="page-content"]' matches <p> elements with id page-content³:
```
<p id="page-content">Content</p>
```
Matching classes is a bit trickier:

'//img[contains(concat(" ", normalize-space(@class), " "), " foot ")]'

matches <img> elements with foot as one of their classes⁴
```
<img class="foot m-5" src="footer.png" />
```

Example: Math API

Base URL: https://newton.vercel.app/api/v2/
The endpoint: factor
The argument: "x^2 - 1"
The request: https://newton.vercel.app/api/v2/factor/x^2-1

Code

library(httr2)
request_obj <- request("https://newton.vercel.app/api/v2/factor/x^2-1")
response_obj <- req_perform(request_obj)
writeLines(response_obj %>% resp_body_string())

{"operation":"factor","expression":"x^2-1","result":"(x - 1) (x + 1)"}

Authentication

Most APIs don’t allow requests to be made by anonymous requesters, and require authentication.
For example, to access private GitHub repos using GitHub’s API, you’ll need to authenticate that you are in fact the one making the request

Authentication via `GH`

The GH library for R can handle this authentication process for you. For example, this private repo in my account does not show up if requested anonymously, but does show up if requested using GH with a Personal Access Token:

Code

library(gh)
tryCatch({
    result <- gh("GET /repos/jpowerj/private-repo-test")
    writeLines(paste0(result$name, ": ",result$description))
}, error = function(e) {
    writeLines(paste0("Failed to authenticate: ",e))
})

private-repo-test: Private repo example for DSAN5000

Your code should never contain authentication info, especially when using GitHub. In this case, I created an OS environment variable called GITHUB_TOKEN containing my Personal Access Token, which GH then uses to make authenticated requests
On Mac, this is accomplished by adding a line to your ~/.zprofile file: export GITHUB_TOKEN="<token goes here>"

Footnotes

For (much) more on this topic, see this page from Prisma, a high-level “wrapper” that auto-syncs your DB structure with a TypeScript schema, so your code knows exactly “what’s inside” a variable whose content was retrieved from the DB…↩︎
Your code should 🚨never🚨 contain authentication info, especially when using GitHub. In this case, I created an OS environment variable called GITHUB_TOKEN containing my Personal Access Token, which I then loaded using os.getenv() and provided to PyGithub.↩︎
HTML requires ids to be unique to elements (and elements cannot have more than one id), meaning that this returns a single element for valid HTML code (not all webpages!). Also note the double-quotes after id=, required in XPath↩︎
Your intuition may be to just use '//img[@class="footer-image"]'. Sadly, however, this will match only elements with footer-image as their only class. i.e., it will match <img class="footer-image"> but not <img class="footer-image another-class">. This will usually fail, since most elements have several classes: e.g., in Bootstrap, <p class="p-5 m-3"></p> creates a <p> element with padding of 5 pixels and margin of 3 pixels.↩︎

Week 03 Recap

Git Commands

Reproducible Docs/Literate Programming

Quiz Time

Data Gathering

Preexisting Data Sources

Web Scraping

Scraping Difficulty

Data Structures: Foundations

Opening Datasets With Your Terminator Glasses On

From Raw Data to Clean Data

Data Structures: Simple \(\rightarrow\) Complex

Tabular Data vs. Relational Data

A Better Approach

DBs: Relational or Otherwise

Data Formats

.csv / .tsv

.json

Other Formats

Web Scraping

Scraping HTML with requests and BeautifulSoup

Navigating HTML with BeautifulSoup

The Power of find_all()

First Dataset

Second Dataset

Parsing HTML Tables

APIs

What Does an API Do?

Example: Math API

Math API Endpoints

Authentication

Authentication via PyGithub

References

Appendix: Code Examples in R

Scraping HTML with httr2 and xml2

Navigating HTML with XPath

XPath I: Selecting Elements

XPath II: Filtering by Attributes

Example: Math API

Authentication

Authentication via GH

Footnotes

`.csv` / `.tsv`

`.json`

Scraping HTML with `requests` and BeautifulSoup

The Power of `find_all()`

Authentication via `PyGithub`

Scraping HTML with `httr2` and `xml2`

Authentication via `GH`