Week 3: Data Science Workflow

DSAN 5000: Data Science and Analytics

Class Sessions
Author

Prof. Jeff and Prof. James

Published

Thursday, September 12, 2024

Open slides in new tab →

Schedule

Today’s Planned Schedule (Section 03):

Start End Topic
Lecture 3:30pm 4:00pm How the Internet Works →
4:00pm 4:30pm Quarto and Reproducible Research →
4:30pm 5:00pm Git and GitHub →
Break! 5:00pm 5:10pm
Lab 5:10pm 5:50pm Lab Demonstrations →
5:50pm 6:00pm Lab Assignment Overview →

How the Internet Works

With great power… (Image from Amazon.com)

…comes great responsibility. (Archive.org, July 1, 2010)

Intranet vs. Internet

  • Crucial distinction: can set up a “mini-internet”, an intranet, within your own home
  • Organizations (businesses, government agencies) with security needs often do exactly this: link a set of computers and servers together, no outside access

  • Internet = basically a giant intranet, open to the whole world

Key Building Blocks: Locating Servers

  • IP Addresses (Internet Protocol addresses): Numeric addresses for uniquely identifying computers on a network
  • URLs (Uniform Resource Locators): The more human-readable website addresses you’re used to: google.com, georgetown.edu, etc.
    • Built on top of IP addresses, via a directory which maps URLs → IP addresses
    • georgetown.edu, for example, is really 23.185.0.21

What Happens When I Visit a URL/IP?

  • HTTP(S) (HyperText Transfer Protocol (Secure)): common syntax for web clients to make requests and servers to respond
    • Several types of requests can be made: GET, POST, HEAD; for now, we focus on the GET request, the request your browser makes by default
  • HTML (HyperText Markup Language): For specifying layout and content of page
    • Structure is analogous to boxes of content: <html> box contains <head> (metadata, e.g., page title) and <body> (page content) boxes, <body> box contains e.g. header, footer, navigation bar, and main content of page.
    • Modern webpages also include CSS (Cascading Style Sheets) for styling this content, and Javascript2 for interactivity (changing/updating content)
    • HTML allows linking to another page with a special anchor tag (<a>): <a href="https://npr.org/">news</a> creates a link, so when you click “news”, browser will request (fetch the HTML for) the URL https://npr.org

HTTP(S) Requests in Action

Image from Menczer, Fortunato, and Davis (2020, 90)

How Does a Web Server Work?

  • We use the term “server” metonymously3
    • Sometimes we mean the hardware, the box of processors and hard drives
    • But, sometimes we mean the software that runs on the hardware
  • A web server, in the software sense, is a program that is always running, 24/7
  • Waits for requests (via HTTPS), then serves HTML code in response (also via HTTPS)
hello_server.py
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello_world():
    return "<p>Hello, World!</p>"
@app.route("/hack")
def hacker_detected():
    return "<p>Hacker detected, pls stop</p>"
$ flask --app hello_server run
 * Serving Flask app 'hello_server'
 * Running on http://127.0.0.1:5000 (CTRL+C to quit)
127.0.0.1 [06/Sep/2023 00:11:05] "GET / HTTP" 200
127.0.0.1 [06/Sep/2023 00:11:06] "GET /hack HTTP" 200
Figure 1: Basic web server (written in Flask)

Figure 2: [Browser-parsed] responses to GET requests

How Does a Web Client Work?

  • Once the server has responded to your request, you still only have raw HTML code
  • So, the browser is the program that renders this raw HTML code as a visual, (possibly) interactive webpage
  • As a data scientist, the most important thing to know is that different browsers can render the same HTML differently!
  • A headache when pages are accessed through laptops
  • A nightmare when pages are accessed through laptops and mobile

Connecting to Servers

  • We’ve talked about the shell on your local computer, as well as the Georgetown Domains shell
  • We used Georgetown Domains’ web interface to access that shell, but you can remotely connect to any other shell from your local computer using the ssh command!

Transferring Files to/from Servers

  • Recall the copy command, cp, for files on your local computer
  • There is a remote equivalent, scp (Secure Copy Protocol), which you can use to copy files to/from remote servers to your local computer

Important Alternative: rsync

  • Similar to scp, with same syntax, except it synchronizes (only copies files which are different or missing)
sync_files.sh
rsync -avz source_directory/ user@remote_server:/path/to/destination/
  • -a (“archive”) tells rsync you want it to copy recursively
  • -v (“verbose”) tells rsync to print information as it copies
  • -z (“zip/compress”) tells rsync to compress files before copying and then decompress them on the server (thus massively speeding up the transfer)
  • https://explainshell.com/explain?cmd=rsync+-avz

Quarto and Reproducible Research

Why Do We Need Reproducible Research?

  • Main human motivations (Max Weber): Wealth, Prestige, Power → “TED talk circuit”

Science vs. Human Fallibility

  • Scientific method + replicability/pre-registration = “Tying ourselves to the mast”

John William Waterhouse, Ulysses and the Sirens, Public domain, via Wikimedia Commons
  • If we aim to disprove (!) our hypotheses, and we pre-register our methodology, we are bound to discovering truth, even when it is disadvantageous to our lives…

Human Fallibility is Winning…

More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Those are some of the telling figures that emerged from Nature’s survey of 1,576 researchers (Baker 2016)

source("../_globals.r")
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
ga_lawyers <- c(21362, 22254, 23134, 23698, 24367, 24930, 25632, 26459, 27227, 27457)
ski_df <- tibble::tribble(
  ~year, ~varname, ~value,
  2000, "ski_revenue", 1551,
  2001, "ski_revenue", 1635,
  2002, "ski_revenue", 1801,
  2003, "ski_revenue", 1827,
  2004, "ski_revenue", 1956,
  2005, "ski_revenue", 1989,
  2006, "ski_revenue", 2178,
  2007, "ski_revenue", 2257,
  2008, "ski_revenue", 2476,
  2009, "ski_revenue", 2438,
)
ski_mean <- mean(ski_df$value)
ski_sd <- sd(ski_df$value)
ski_df <- ski_df %>% mutate(val_scaled = 12*value, val_norm = (value - ski_mean)/ski_sd)
law_df <- tibble::tibble(year=2000:2009, varname="ga_lawyers", value=ga_lawyers)
law_mean <- mean(law_df$value)
law_sd <- sd(law_df$value)
law_df <- law_df %>% mutate(val_norm = (value - law_mean)/law_sd)
spur_df <- dplyr::bind_rows(ski_df, law_df)
ggplot(spur_df, aes(x=year, y=val_norm, color=factor(varname, labels = c("Ski Revenue","Lawyers in Georgia")))) +
  stat_smooth(method="loess", se=FALSE) +
  geom_point(size=g_pointsize/4) +
  labs(
    fill="",
    title="Ski Revenue vs. Georgia Lawyers",
    x="Year",
    color="Correlation: 99.2%",
    linetype=NULL
  ) +
  dsan_theme("custom", 18) +
  scale_x_continuous(
    breaks=seq(from=2000, to=2014, by=2)
  ) +
  #scale_y_continuous(
  #  name="Total Revenue, Ski Facilities (Million USD)",
  #  sec.axis = sec_axis(~ . * law_sd + law_mean, name = "Number of Lawyers in Georgia")
  #) +
  scale_y_continuous(breaks = -1:1,
    labels = ~ . * round(ski_sd,1) + round(ski_mean,1),
    name="Total Revenue, Ski Facilities (Million USD)",
    sec.axis = sec_axis(~ . * law_sd + law_mean, name = "Number of Lawyers in Georgia")) +
  expand_limits(x=2010) +
  #geom_hline(aes(yintercept=x, color="Mean Values"), as.data.frame(list(x=0)), linewidth=0.75, alpha=1.0, show.legend = TRUE) +
  scale_color_manual(
    breaks=c('Ski Revenue', 'Lawyers in Georgia'),
    values=c('Ski Revenue'=cbPalette[1], 'Lawyers in Georgia'=cbPalette[2]))
`geom_smooth()` using formula = 'y ~ x'

cor(ski_df$value, law_df$value)
[1] 0.9921178

R vs. RStudio vs. Quarto

  • GUI wrapper around R (Integrated Development Environment = IDE)
  • Run blocks of R code (.qmd chunks)

The R Language

  • Programming language
  • Runs scripts via Rscript <script>.r

+

  • GUI wrapper around Python (IDE)
  • Run blocks of Python code (.ipynb cells)

The Python Language

  • Scripting language
  • On its own, just runs scripts via python <script>.py

Reproducibility and Literate Programming

  • Reproducible document: includes both the content (text, tables, figures) and the code or instructions required to generate that content.
    • Designed to ensure that others can reproduce the same document, including its data analysis, results, and visualizations, consistently and accurately.
    • tldr: If you’re copying-and-pasting results from your code output to your results document, a red flag should go off in your head!
  • Literate programming is a coding and documentation approach where code and explanations of the code are combined in a single document.
    • Emphasizes clear and understandable code by interleaving human-readable text (explanations, comments, and documentation) with executable code.

Single Source, Many Outputs

  • We can create content (text, code, results, graphics) within a source document, and then use different weaving engines to create different document types:
  • Documents
    • Web pages (HTML)
    • Word documents
    • PDF files
  • Presentations
    • HTML
    • PowerPoint
  • Websites/blogs
  • Books
  • Dashboards
  • Interactive documents
  • Formatted journal articles

Interactivity!

  • Are we “hiding something” by choosing a specific bin width? Make it transparent!

Git and GitHub

Git vs. GitHub

(Important distinction!)

Git

  • Command-line program
  • git init in shell to create
  • git add to track files
  • git commit to commit changes to tracked files

GitHub

  • Code hosting website
  • Create a repository (repo) for each project
  • Can clone repos onto your local machine
git push/git pull: The link between the two!

Git Diagram

Initializing a Repo

  • Let’s make a directory for our project called cool-project, and initialize a Git repo for it
user@hostname:~$ mkdir cool-project
user@hostname:~$ cd cool-project
user@hostname:~/cool-project$ git init
Initialized empty Git repository in /home/user/cool-project/.git/
  • This creates a hidden folder, .git, in the directory:
user@hostname:~/cool-project$ ls -lah
total 12K
drwxr-xr-x  3 user user 4.0K May 28 00:53 .
drwxr-xr-x 12 user user 4.0K May 28 00:53 ..
drwxr-xr-x  7 user user 4.0K May 28 00:53 .git

Adding and Committing a File

We’re writing Python code, so let’s create and track cool_code.py:

user@hostname:~/cool-project$ touch cool_code.py
user@hostname:~/cool-project$ git add cool_code.py
user@hostname:~/cool-project$ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   cool_code.py

user@hostname:~/cool-project$ git commit -m "Initial version of cool_code.py"
[main (root-commit) b40dc25] Initial version of cool_code.py
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 cool_code.py

The Commit Log

  • View the commit log using git log:
user@hostname:~/cool-project$ git log
commit b40dc252a3b7355cc4c28397fefe7911ff3c94b9 (HEAD -> main)
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:57:16 2023 +0000

    Initial version of cool_code.py

Making Changes

user@hostname:~/cool-project$ git status
On branch main
nothing to commit, working tree clean
user@hostname:~/cool-project$ echo "1 + 1" >> cool_code.py
user@hostname:~/cool-project$ more cool_code.py
1 + 1
user@hostname:~/cool-project$ git add cool_code.py
user@hostname:~/cool-project$ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   cool_code.py

user@hostname:~/cool-project$ git commit -m "Added code to cool_code.py"
[main e3bc497] Added code to cool_code.py
 1 file changed, 1 insertion(+)

The git log will show the new version:

user@hostname:~/cool-project$ git log
commit e3bc497acbb5a487566ff2014dcd7b83d0c75224 (HEAD -> main)
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:38:05 2023 +0000

    Added code to cool_code.py

commit b40dc25b14c0426b06c8d182184e147853f3c12e
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:37:02 2023 +0000

    Initial version of cool_code.py

More Changes

user@hostname:~/cool-project$ echo "2 + 2" >> cool_code.py
user@hostname:~/cool-project$ more cool_code.py
1 + 1
2 + 2
user@hostname:~/cool-project$ git add cool_code.py
user@hostname:~/cool-project$ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   cool_code.py

user@hostname:~/cool-project$ git commit -m "Second version of cool_code.py"
[main 4007db9] Second version of cool_code.py
 1 file changed, 1 insertion(+)

And the git log

user@hostname:~/cool-project$ git log
commit 4007db9a031ca134fe09eab840b2bc845366a9c1 (HEAD -> main)
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:39:28 2023 +0000

    Second version of cool_code.py

commit e3bc497acbb5a487566ff2014dcd7b83d0c75224
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:38:05 2023 +0000

    Added code to cool_code.py

commit b40dc25b14c0426b06c8d182184e147853f3c12e
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:37:02 2023 +0000

    Initial (empty) version of cool_code.py

Undoing a Commit I

First check the git log to find the hash for the commit you want to revert back to:

commit e3bc497acbb5a487566ff2014dcd7b83d0c75224
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:38:05 2023 +0000

    Added code to cool_code.py

Undoing a Commit II

  • This is irreversable!
user@hostname:~/cool-project$ git reset --hard e3bc497ac
HEAD is now at e3bc497 Added code to cool_code.py
user@hostname:~/cool-project$ git log
commit e3bc497acbb5a487566ff2014dcd7b83d0c75224 (HEAD -> main)
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:38:05 2023 +0000

    Added code to cool_code.py

commit b40dc25b14c0426b06c8d182184e147853f3c12e
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:37:02 2023 +0000

    Initial (empty) version of cool_code.py

Onwards and Upwards

user@hostname:~/cool-project$ echo "3 + 3" >> cool_code.py
user@hostname:~/cool-project$ git add cool_code.py
user@hostname:~/cool-project$ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   cool_code.py

user@hostname:~/cool-project$ git commit -m "Added different code to cool_code.py"
[main 700d955] Added different code to cool_code.py
 1 file changed, 1 insertion(+)

The final git log:

user@hostname:~/cool-project$ git log
commit 700d955faacb27d7b8bc464b9451851b5e319f20 (HEAD -> main)
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:44:49 2023 +0000

    Added different code to cool_code.py

commit e3bc497acbb5a487566ff2014dcd7b83d0c75224
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:38:05 2023 +0000

    Added code to cool_code.py

commit b40dc25b14c0426b06c8d182184e147853f3c12e
Author: Jeff Jacobs <jjacobs3@cs.stanford.edu>
Date:   Sun May 28 00:37:02 2023 +0000

    Initial (empty) version of cool_code.py

But Why These Diagrams?

Even the simplest projects can start to look like:

The GitHub Side: Remote

An Empty Repo

Refresh after git push

Commit History

Checking the diff

Web Development

Frontend
Backend
Low Level HTML/CSS/JavaScript GitHub Pages
Middle Level JS Libraries PHP, SQL
High Level React, Next.js Node.js, Vercel

Frontend icons: UI+UI elements, what the user sees (on the screen), user experience (UX), data visualization Backend icons: Databases, Security

Getting Content onto the Internet

  • Step 1: index.html
  • Step 2: Create GitHub repository
  • Step 3: git init, git add -A ., git push
  • Step 4: Enable GitHub Pages in repo settings
  • Step 5: <username>.github.io!

Deploying from a Branch/Folder

Lab Demonstrations

Lab Demonstration 1: Transferring Files

  • ssh
  • scp
  • rsync

Lab Demonstration 2: Quarto

Lab Demonstration 3: Git and GitHub

Lab Assignment Overview

Assignment Overview

  1. Create a repo on your private GitHub account called 5000-lab-1.2
  2. Clone the repo to your local machine with git clone
  3. Create a blank Quarto website project, then use a .bib file to add citations
  4. Add content to index.qmd
  5. Add content to about.ipynb
  6. Build a simple presentation in slides/slides.ipynb using the revealjs format
  7. Render the website using quarto render
  8. Sync your changes to GitHub
  9. Use rsync or scp to copy the _site directory to your GU domains server (within ~/public_html)
  10. Create a Zotero (or Mendeley) account, download the software, and add at least one reference to your site by syncing the .bib file

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.
Menczer, Filippo, Santo Fortunato, and Clayton A. Davis. 2020. A First Course in Network Science. Cambridge University Press.

Footnotes

  1. To see this, you can open your Terminal and run the ping command: ping georgetown.edu.↩︎

  2. Incredibly, despite the name, Javascript has absolutely nothing to do with the Java programming language…↩︎

  3. Sorry for jargon: it just means using the same word for different levels of a system (dangerous when talking computers!)↩︎