Week 4: Data Formats and APIs

DSAN 5000: Data Science and Analytics

Class Sessions
Author

Prof. Jeff and Prof. James

Published

Thursday, September 19, 2024

Open slides in new window →

Week 03 Recap

Image from Menczer, Fortunato, and Davis (2020, 90)

Git Commands

Command What It Does
git clone Downloads a repo from the web to our local computer
git init Creates a new, blank Git repository on our local computer (configuration/change-tracking stored in .git subfolder)
git add Stages a file(s): Git will now track changes in this file(s)
git reset Undoes a git add
git status Shows currently staged files and their status (created, modified, deleted)
git commit -m "message" “Saves” the current version of all staged files, ready to be pushed to a backup dir or remote server like GitHub
git push Transmits local commits to remote server
git pull Downloads commits from remote server to local computer
git merge Merges remote versions of files with local versions

Reproducible Docs/Literate Programming

  • 1980s: \(\LaTeX\) for \(\widehat{\mathcal{T}}\)ypesetting \(\sqrt{math}^2\)
  • 1990s: Python and R as powerful scripting languages (no compilation required)
  • 2000s/2010s: Interactive Python via Jupyter, fancy IDE for R called RStudio
  • 2020s: Quarto (using pandoc under the hood) enables use of markdown for formatting, \(\LaTeX\) for math, and both Python and R in same document, with choice of output formats (HTML, presentations, Word docs, …)

Quiz Time

Data Gathering

  • Preexisting data sources
  • Web scraping
  • Converting between formats

Preexisting Data Sources

Web Scraping

  • Fun fact: you can view a webpage’s HTML source code by right-clicking on the page and selecting “View Source”
    • On older websites, this means we can just request page and parse the returned HTML
  • Less fun fact: modern web frameworks (React, Next.js) generate pages dynamically using JS, meaning that what you see on the page will not be visible in the HTML source
    • Data scraping still possible for these sites! Using browser automation tools like Selenium

Scraping Difficulty

How is data loaded? Solution Example
😊 Easy Data in HTML source “View Source”
😐 Medium Data loaded dynamically via API “View Source”, find API call, scrape programmatically
😳 Hard Data loaded dynamically [internally] via web framework Use Selenium

Data Structures: Foundations

  • Could be (is) a whole class, hypothetically taught in the Spring
  • Could be (is) a whole class just for one type of data (geographic/spatial)
  • For this class: some foundational principles that should let you figure out fancier data structures you encounter

Opening Datasets With Your Terminator Glasses On

  • What does a row represent?
  • What does a column represent?
  • What does a value in a cell represent?
  • Are there unique identifiers for the objects you care about?
Figure 1: What you should see when you look at a new dataset

From Raw Data to Clean Data

Data Structures: Simple \(\rightarrow\) Complex

id name email
0 K. Desbrow kd9@dailymail.com
1 D. Minall dminall1@wired.com
2 C. Knight ck2@microsoft.com
3 M. McCaffrey mccaf4@nhs.uk
(a) Record Data
year month points
2023 Jan 65
2023 Feb
2023 Mar 42
2023 Apr 11
(b) Time-Series Data
id date rating num_rides
0 2023-01 0.75 45
0 2023-02 0.89 63
0 2023-03 0.97 7
1 2023-06 0.07 10
(c) Panel Data
Source Target Weight
IGF2 IGF1R 1
IGF1R TP53 2
TP53 EGFR 0.5
(d) Network Data
Figure 2: Fake data via Mockaroo and Random.org. Protein-protein interaction network from Agrawal, Zitnik, and Leskovec (2018)

Tabular Data vs. Relational Data

  • All of the datasets on the previous slide are tabular
  • DBs like SQLite, MySQL model relationships within and between tabular datasets
  • Imagine you’re creating the backend for a Georgetown social network. How would you record users and friendships? Your intuition may be record data:
id name friends
1 Purna [2,3,4]
2 Jeff [1,3,4,5,6]
3 James [1,2,4,6]
4 Britt [1,2,3]
5 Dr. Fauci [2,6]
6 Pitbull [2,5]
Figure 3: Our first attempt at a data structure for our social network app’s backend

Long story short…

  • This doesn’t scale
  • Extremely inefficient to find whether two users are friends
  • Redundant information: Have to store friendship between A and B in both A’s row and B’s row

A Better Approach

  • Move the friendship data into its own table!
  • This table now represents relational data, (user table still corresponds to records):
user_id name
1 Purna
2 Jeff
3 James
4 Britt
5 Dr. Fauci
6 Pitbull
Figure 4: The user table in our relational structure
id friend_1 friend_2 id friend_1 friend_2
1 1 2 6 2 5
2 1 3 7 2 6
3 1 4 8 3 4
4 2 3 9 3 6
5 2 4 10 5 6
Figure 5: The friendships table in our relational structure
  • May seem weird in terms of human readability, but think in terms of memory/computational efficiency: (a) Scalable, (b) Easy to find if two users are friends (via sorting/searching algorithms), (c) No redundant info

DBs: Relational or Otherwise

  • For rest of lecture we zoom in on cases where data comes as individual files
  • But on top of the relational format from previous slide, there are also non-relational database formats, like the document-based format used by e.g. MongoDB1
  • In either case, data spread over many files \(\Rightarrow\) to obtain a single dataset we use queries.

G file File (.csv/.json/etc.) load read_csv() file->load dataset Dataset load->dataset

Static datasets (individual files on disk)

G cluster_00 Database tab1 Table 1 query Query tab1->query tab2 Table 2 tab2->query tabdots ... tabdots->query tabN Table N tabN->query dataset Dataset query->dataset

Datasets formed dynamically via database queries

Data Formats

  • The most common formats, for most fields:
    • .csv: Comma-Separated Values
    • .tsv: Tab-Separated Values
    • .json: JavaScript Object Notation
    • .xls/.xlsx: Excel format
    • .dta: Stata format
    • .yml: More human-readable alternative to JSON

.csv / .tsv

👍

my_data.csv
index,var_1,var_2,var_3
A,val_A1,val_A2,val_A3
B,val_B1,val_B2,val_B3
C,val_C1,val_C2,val_C3
D,val_D1,val_D2,val_D3

(👎)

my_data.tsv
index var_1 var_2 var_3
A val_A1  val_A2  val_A3
B val_B1  val_B2  val_B3
C val_C1  val_C2  val_C3
D val_D1  val_D2  val_D3

source("../_globals.r")
library(readr)
data <- read_csv("assets/my_data.csv")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): index, var_1, var_2, var_3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
disp(data)
# | index | var_1 | var_2 | var_3 |
# | - | - | - | - |
# | A | val_A1 | val_A2 | val_A3 |
# | B | val_B1 | val_B2 | val_B3 |
# | C | val_C1 | val_C2 | val_C3 |
# | D | val_D1 | val_D2 | val_D3 | 
  • Python: pd.read_csv() (from Pandas library)
  • R: read_csv() (from readr library)

.json

courses.json
{
  "dsan5000": {
    "title": "Data Science and Analytics",
    "credits": 3,
    "lectures": [
      "Intro",
      "Tools and Workflow"
    ]
  },
  "dsan5100": {
    "title": "Probabilistic Modeling and Statistical Computing",
    "credits": 3,
    "lectures": [
      "Intro",
      "Conditional Probability"
    ]
  }
}

Other Formats

  • .xls/.xlsx: Requires special libraries in Python/R
  • .dta: Stata format, but can be read/written to in Python/R
    • Python: Pandas has built-in pd.read_stata() and pd.to_stata()
    • R: read_dta() from Haven library (part of tidyverse)

Web Scraping

How is data loaded? Solution Example
This section → 😊 Easy Data in HTML source “View Source”
Next section → 😐 Medium Data loaded dynamically via API “View Source”, find API call, scrape programmatically
Future weeks → 😳 Hard Data loaded dynamically [internally] via web framework Use Selenium

Scraping HTML with requests and BeautifulSoup

requests Documentation | BeautifulSoup Documentation

Code
import requests
# Perform request
response = requests.get("https://en.wikipedia.org/wiki/Data_science")
# Parse HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
all_header_elts = soup.find_all("h2")
all_header_text = [elt.text for elt in all_header_elts]
#section_headers = [h.find("span", {'class': 'mw-headline'}).text for h in all_headers[1:]]
print("\n".join(all_header_text))
Contents
Foundations
Etymology
Data science and data analysis
Data Science as an Academic Discipline
Cloud computing for data science
Ethical consideration in data science
See also
References

The Power of find_all()

data_page.html
<div class="all-the-data">
    <h4>First Dataset</h4>
    <div class="data-1">
        <div class="dataval">1</div>
        <div class="dataval">2</div>
        <div class="dataval">3</div>
    </div>
    <h4>Second Dataset</h4>
    <div class="data-2">
        <ul>
            <li>4.0</li>
            <li>5.5</li>
            <li>6.7</li>
        </ul>
    </div>
</div>
Figure 6: Data in page elements (<div>, <li>)

First Dataset

1
2
3

Second Dataset

  • 4.0
  • 5.5
  • 6.7
Figure 7: The code from Figure 6, rendered by your browser

from bs4 import BeautifulSoup
soup = BeautifulSoup(page_html, 'html.parser')
ds1_elt = soup.find("div", class_='data-1')
ds1 = [e.text for e in ds1_elt.find_all("div")]
ds2_elt = soup.find("div", {'class': 'data-2'})
ds2 = [e.text for e in ds2_elt.find_all("li")]
Figure 8: BeautifulSoup code used to parse HTML

print(f"dataset-1: {ds1}\ndataset-2: {ds2}")
dataset-1: ['1', '2', '3']
dataset-2: ['4.0', '5.5', '6.7']
Figure 9: Python lists containing parsed data

Parsing HTML Tables

table_data.html
<table>
<thead>
    <tr>
        <th>X1</th><th>X2</th><th>X3</th>
    </tr>
</thead>
<tbody>
    <tr>
        <td>1</td><td>3</td><td>5</td>
    </tr>
    <tr>
        <td>2</td><td>4</td><td>6</td>
    </tr>
</tbody>
</table>
Figure 10: Data in HTML table format

X1 X2 X3
1 3 5
2 4 6
Figure 11: The HTML table code, as rendered by your browser

from bs4 import BeautifulSoup
soup = BeautifulSoup(table_html, 'html.parser')
thead = soup.find("thead")
headers = [e.text for e in thead.find_all("th")]
tbody = soup.find("tbody")
rows = tbody.find_all("tr")
data = [[e.text for e in r.find_all("td")]
            for r in rows]
Figure 12: BeautifulSoup code used to parse HTML

print(f"headers: {headers}\ndata: {data}")
headers: ['X1', 'X2', 'X3']
data: [['1', '3', '5'], ['2', '4', '6']]
Figure 13: Python lists containing parsed data

APIs

Application Programming Interfaces: developer-facing part of data pipeline/service. Hides unnecessary details:

Example Care about 🧐 Don’t care about 🙅‍♂️
Electrical outlet Electricity Details of Alternating/Direct Currents
Water fountain Water Details of how it’s pumped into the fountain
Car Accelerate, brake, reverse Details of combustion engine

What Does an API Do?

Exposes endpoints for use by developers, without requiring them to know the nuts and bolts of your pipeline/service:

Example Endpoint Not Exposed
Electrical outlet Socket Internal wiring
Water fountain Aerator Water pump
Car Pedals, Steering wheel, etc. Engine

When I’m teaching programming to students in refugee camps who may have never used a computer before, I try to use the idea of “robots”: a program is a robot trained to sit there and wait for inputs, then process them in some way and spit out some output. APIs really capture this notion, honestly.

Example: Math API

Code
import requests
response = requests.get("https://newton.vercel.app/api/v2/factor/x^2-1")
print(response.json())
{'operation': 'factor', 'expression': 'x^2-1', 'result': '(x - 1) (x + 1)'}

Math API Endpoints

Operation API Endpoint Result
Simplify /simplify/2^2+2(2) 8
Factor /factor/x^2 + 2x x (x + 2)
Derive /derive/x^2+2x 2 x + 2
Integrate /integrate/x^2+2x 1/3 x^3 + x^2 + C
Find 0’s /zeroes/x^2+2x [-2, 0]
Find Tangent /tangent/2|x^3 12 x + -16
Area Under Curve /area/2:4|x^3 60
Cosine /cos/pi -1
Sine /sin/0 0
Tangent /tan/0 0

Authentication

  • Unlike the math API, most APIs do not allow requests to be made by anonymous requesters, and require authentication.
  • For example, you can access public GitHub repos anonymously, but to access private GitHub repos using GitHub’s API, you’ll need to authenticate that you are in fact the one making the request

Authentication via PyGithub

PyGithub Installation

Install using the following terminal/shell command [Documentation]

pip install PyGithub

PyGithub can handle authentication for you. Example: this private repo in my account does not show up unless the request is authenticated (via a Personal Access Token)2:

import github
g = github.Github()
try:
  g.get_repo("jpowerj/private-repo-test")
except Exception as e:
  print(e)
Figure 14: Using the GitHub API without authentication
# Load the access token securely
import os
my_access_token = os.getenv('GITHUB_TOKEN')
import github
# Use the access token to make an API request
auth = github.Auth.Token(my_access_token)
g = github.Github(auth=auth)
g.get_user().get_repo("private-repo-test")
Figure 15: Using the GitHub API with authentication

References

Agrawal, Monica, Marinka Zitnik, and Jure Leskovec. 2018. “Large-Scale Analysis of Disease Pathways in the Human Interactome.” In PACIFIC SYMPOSIUM on BIOCOMPUTING 2018: Proceedings of the Pacific Symposium, 111–22. World Scientific.
Menczer, Filippo, Santo Fortunato, and Clayton A. Davis. 2020. A First Course in Network Science. Cambridge University Press.

Appendix: Code Examples in R

Scraping HTML with httr2 and xml2

httr2 Documentation | xml2 Documentation

Code
library(httr2)
request_obj <- request("https://en.wikipedia.org/wiki/Data_science")
response_obj <- req_perform(request_obj)
# Parse HTML
library(xml2)

Attaching package: 'xml2'
The following object is masked from 'package:httr2':

    url_parse
Code
html_obj <- response_obj |> resp_body_html()
html_obj |> xml_find_all('//h2') |> xml_text()
[1] "Contents"                              
[2] "Foundations"                           
[3] "Etymology"                             
[4] "Data science and data analysis"        
[5] "Data Science as an Academic Discipline"
[6] "Cloud computing for data science"      
[7] "Ethical consideration in data science" 
[8] "See also"                              
[9] "References"                            

Note: httr2 is a re-written version of the original httr package, which is now deprecated. You’ll still see lots of code using httr, however, so it’s good to know how both versions work. Click here for a helpful vignette on the original httr library.

XPath I: Selecting Elements

mypage.html
<div class="container">
  <h1>Header</h1>
  <p id="page-content">Content</p>
  <img class="footer-image m-5" src="footer.png">
</div>
  • '//div' matches all elements <div> in the document:

    <div class="container">
      <h1>Header</h1>
      <p id="page-content">Content</p>
      <img class="footer-image m-5" src="footer.png">
    </div>
  • '//div//img' matches <img> elements which are children of <div> elements:

    <img class="footer-image m-5" src="footer.png">

XPath II: Filtering by Attributes

mypage.html
<div class="container">
  <h1>Header</h1>
  <p id="page-content">Content</p>
  <img class="footer-image m-5" src="footer.png">
</div>
  • '//p[id="page-content"]' matches <p> elements with id page-content3:

    <p id="page-content">Content</p>
  • Matching classes is a bit trickier:

    '//img[contains(concat(" ", normalize-space(@class), " "), " foot ")]'

    matches <img> elements with foot as one of their classes4

    <img class="foot m-5" src="footer.png" />

Example: Math API

Code
library(httr2)
request_obj <- request("https://newton.vercel.app/api/v2/factor/x^2-1")
response_obj <- req_perform(request_obj)
writeLines(response_obj %>% resp_body_string())
{"operation":"factor","expression":"x^2-1","result":"(x - 1) (x + 1)"}

Authentication

  • Most APIs don’t allow requests to be made by anonymous requesters, and require authentication.
  • For example, to access private GitHub repos using GitHub’s API, you’ll need to authenticate that you are in fact the one making the request

Authentication via GH

  • The GH library for R can handle this authentication process for you. For example, this private repo in my account does not show up if requested anonymously, but does show up if requested using GH with a Personal Access Token:
Code
library(gh)
tryCatch({
    result <- gh("GET /repos/jpowerj/private-repo-test")
    writeLines(paste0(result$name, ": ",result$description))
}, error = function(e) {
    writeLines(paste0("Failed to authenticate: ",e))
})
private-repo-test: Private repo example for DSAN5000
  • Your code should never contain authentication info, especially when using GitHub. In this case, I created an OS environment variable called GITHUB_TOKEN containing my Personal Access Token, which GH then uses to make authenticated requests
  • On Mac, this is accomplished by adding a line to your ~/.zprofile file: export GITHUB_TOKEN="<token goes here>"

Footnotes

  1. For (much) more on this topic, see this page from Prisma, a high-level “wrapper” that auto-syncs your DB structure with a TypeScript schema, so your code knows exactly “what’s inside” a variable whose content was retrieved from the DB…↩︎

  2. Your code should 🚨never🚨 contain authentication info, especially when using GitHub. In this case, I created an OS environment variable called GITHUB_TOKEN containing my Personal Access Token, which I then loaded using os.getenv() and provided to PyGithub.↩︎

  3. HTML requires ids to be unique to elements (and elements cannot have more than one id), meaning that this returns a single element for valid HTML code (not all webpages!). Also note the double-quotes after id=, required in XPath↩︎

  4. Your intuition may be to just use '//img[@class="footer-image"]'. Sadly, however, this will match only elements with footer-image as their only class. i.e., it will match <img class="footer-image"> but not <img class="footer-image another-class">. This will usually fail, since most elements have several classes: e.g., in Bootstrap, <p class="p-5 m-3"></p> creates a <p> element with padding of 5 pixels and margin of 3 pixels.↩︎