Week 13: Course Wrapup

DSAN 5000: Data Science and Analytics

Author

Prof. Jeff and Prof. James

Published

Thursday, December 5, 2024

Open slides in new window →

Student Presentations

Course Recap

(Why Do A Course Recap?)

Data = Ground Truth + Noise

  • Depressing but true origin of statistics (as opposed to probability): the Plague 😷

Ground Truth: The Great Plague (Lord Have Mercy on London, Unknown Artist, circa 1665, via Wikimedia Commons)

Noisy Data (Recorded amidst chaos): London Bill of Mortality, 1665 (Public Domain, Wellcome Collection)

Your Toolbox

Basics
  • GitHub
  • File Formats
  • Web Scraping
Nuts and Bolts
  • Data-Generating Processes (DGPs)
  • Distance Metrics
  • Entropy
  • Gaussian / Normal Distributions
  • Clustering
  • Dimensionality Reduction
Drills and Saws
  • Decision Trees

GitHub (W03)

Data Structures: Simple Complex (W04)

id name email
0 K. Desbrow kd9@dailymail.com
1 D. Minall dminall1@wired.com
2 C. Knight ck2@microsoft.com
3 M. McCaffrey mccaf4@nhs.uk
Figure 1: Record Data
year month points
2023 Jan 65
2023 Feb
2023 Mar 42
2023 Apr 11
Figure 2: Time-Series Data
id date rating num_rides
0 2023-01 0.75 45
0 2023-02 0.89 63
0 2023-03 0.97 7
1 2023-06 0.07 10
Figure 3: Panel Data
id Source Target Weight
1 IGF2 IGF1R 1
2 IGF1R TP53 2
3 TP53 EGFR 0.5
Figure 4: Network Data

Web Scraping (W04)

How is data loaded? Solution Example
😊 Easy Data in HTML source “View Source”
😐 Medium Data loaded dynamically via API “View Source”, find API call, scrape programmatically
😳 Hard Data loaded dynamically [internally] via web framework Use Selenium

EDA: Why We Can’t Just Skip It (W06)

  • Iterative process: Ask questions of the data, find answers, generate more questions
  • You’re probably already used to Mean and Variance: Fancier EDA/robustness methods build upon these two!
  • Why do we need to visualize? Can’t we just use mean, R2?
  • …Enter Anscombe’s Quartet
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="ticks")
# https://towardsdatascience.com/how-to-use-your-own-color-palettes-with-seaborn-a45bf5175146
sns.set_palette(sns.color_palette(cb_palette))

# Load the example dataset for Anscombe's quartet
anscombe_df = sns.load_dataset("anscombe")
#print(anscombe_df)

# Show the results of a linear regression within each dataset
anscombe_plot = sns.lmplot(
    data=anscombe_df, x="x", y="y", col="dataset", hue="dataset",
    col_wrap=4, palette="muted", ci=None,
    scatter_kws={"s": 50, "alpha": 1},
    height=3
);
anscombe_plot;

The Scariest Dataset of All Time

Summary statistics
Code
# Compute dataset means
my_round = lambda x: round(x,2)
data_means = anscombe_df.groupby('dataset').agg(
  x_mean = ('x', np.mean),
  y_mean = ('y', np.mean)
).apply(my_round)
Code
disp(data_means, floatfmt='.2f')
<string>:1: FutureWarning: The provided callable <function mean at 0x1338d9940> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
<string>:1: FutureWarning: The provided callable <function mean at 0x1338d9940> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
dataset x_mean y_mean
I 9.00 7.50
II 9.00 7.50
III 9.00 7.50
IV 9.00 7.50
Figure 5: Column means for each dataset
Code
# Compute dataset SDs
data_sds = anscombe_df.groupby('dataset').agg(
  x_sd = ('x', np.std),
  y_sd = ('y', np.std),
).apply(my_round)
Code
disp(data_sds, floatfmt='.2f')
<string>:2: FutureWarning: The provided callable <function std at 0x1338d9a80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
<string>:2: FutureWarning: The provided callable <function std at 0x1338d9a80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
dataset x_sd y_sd
I 3.32 2.03
II 3.32 2.03
III 3.32 2.03
IV 3.32 2.03
Figure 6: Column SDs for each dataset
Correlations
Code
import tabulate
from IPython.display import HTML
corr_matrix = anscombe_df.groupby('dataset').corr().apply(my_round)
#Markdown(tabulate.tabulate(corr_matrix))
HTML(corr_matrix.to_html())
x y
dataset
I x 1.00 0.82
y 0.82 1.00
II x 1.00 0.82
y 0.82 1.00
III x 1.00 0.82
y 0.82 1.00
IV x 1.00 0.82
y 0.82 1.00
Figure 7: Correlation between x and y for each dataset

It Doesn’t End There…

Code
import statsmodels.formula.api as smf
summary_dfs = []
for cur_ds in ['I','II','III','IV']:
  ds1_df = anscombe_df.loc[anscombe_df['dataset'] == "I"].copy()
  # Fit regression model (using the natural log of one of the regressors)
  results = smf.ols('y ~ x', data=ds1_df).fit()
  # Get R^2
  rsq = round(results.rsquared, 2)
  # Inspect the results
  summary = results.summary()
  summary.extra_txt = None
  summary_df = summary_to_df(summary, corner_col = f'Dataset {cur_ds}<br>R^2 = {rsq}')
  summary_dfs.append(summary_df)
disp(summary_dfs[0], include_index=False)
disp(summary_dfs[1], include_index=False)
disp(summary_dfs[2], include_index=False)
disp(summary_dfs[3], include_index=False)
/Users/jpj/.virtualenvs/r-reticulate/lib/python3.11/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
/Users/jpj/.virtualenvs/r-reticulate/lib/python3.11/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
/Users/jpj/.virtualenvs/r-reticulate/lib/python3.11/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
/Users/jpj/.virtualenvs/r-reticulate/lib/python3.11/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Dataset I
R^2 = 0.67
coef std err t P>|t| [0.025 0.975]
Intercept 3 1.12 2.67 0.03 0.46 5.54
x 0.5 0.12 4.24 0 0.23 0.77
Dataset II
R^2 = 0.67
coef std err t P>|t| [0.025 0.975]
Intercept 3 1.12 2.67 0.03 0.46 5.54
x 0.5 0.12 4.24 0 0.23 0.77
Dataset III
R^2 = 0.67
coef std err t P>|t| [0.025 0.975]
Intercept 3 1.12 2.67 0.03 0.46 5.54
x 0.5 0.12 4.24 0 0.23 0.77
Dataset IV
R^2 = 0.67
coef std err t P>|t| [0.025 0.975]
Intercept 3 1.12 2.67 0.03 0.46 5.54
x 0.5 0.12 4.24 0 0.23 0.77

Naïve Bayes (W07)

Guessing House Prices:

  • If I tell you there’s a house, what is your guess for number of bathrooms it has?
  • If I tell you the house is 50,000 sqft, does your guess go up?

Guessing Word Frequencies:

  • If I tell you there’s a book, how often do you think the word “University” appears?
  • Now if I tell you that the word “Stanford” appears 2,000 times, does your guess go up?
Naïve Bayes’ Answer?

In Math

  • Assume some email E with N=5 words, E=(w1,w2,w3,w4,w5). Say E=(you,win,a,million,dollars).
  • We’re trying to classify S={1if spam0otherwise given E
  • Normal person (marine biologist?):

Pr(S=1w5=dollars,w4=million)>Pr(S=1w5=dollars,w4=octopus)

  • Naïve Bayes classifier:

Pr(S=1w5)Pr(S=1w4)

Clutering (W09)

  • Let μ1=(0.2,0.8), μ2=(0.8,0.2), Σ=diag(1/64), and X=(X1,X2).
  • X1N2(μ1,Σ), X2N2(μ2,Σ)
Code
library(tidyverse)
library(ggforce)
library(MASS)
library(patchwork)
N <- 50
Mu1 <- c(0.2, 0.8)
Mu2 <- c(0.8, 0.2)
sigma <- 1/24
# Data for concentric circles
circle_df <- tribble(
  ~x0, ~y0, ~r, ~Cluster, ~whichR,
  Mu1[1], Mu1[2], sqrt(sigma), "C1", 1,
  Mu2[1], Mu2[2], sqrt(sigma), "C2", 1,
  Mu1[1], Mu1[2], 2 * sqrt(sigma), "C1", 2,
  Mu2[1], Mu2[2], 2 * sqrt(sigma), "C2", 2,
  Mu1[1], Mu1[2], 3 * sqrt(sigma), "C1", 3,
  Mu2[1], Mu2[2], 3 * sqrt(sigma), "C2", 3
)
#print(circle_df)
Sigma <- matrix(c(sigma,0,0,sigma), nrow=2)
#print(Sigma)
x1_df <- as_tibble(mvrnorm(N, Mu1, Sigma))
x1_df <- x1_df |> mutate(
  Cluster='C1'
)
x2_df <- as_tibble(mvrnorm(N, Mu2, Sigma))
x2_df <- x2_df |> mutate(
  Cluster='C2'
)
cluster_df <- bind_rows(x1_df, x2_df)
cluster_df <- cluster_df |> rename(
  x=V1, y=V2
)
known_plot <- ggplot(cluster_df) +
  geom_point(
    data = circle_df,
    aes(x=x0, y=y0)
  ) +
  geom_circle(
    data = circle_df,
    aes(x0=x0, y0=y0, r=r, fill=Cluster),
    linewidth = g_linewidth,
    alpha = 0.25
  ) +
  geom_point(
    data=cluster_df,
    aes(x=x, y=y, fill=Cluster),
    size = g_pointsize / 2,
    shape = 21
  ) +
  dsan_theme("full") +
  coord_fixed() +
  labs(
    x = "x",
    y = "y",
    title = "Data with Known Clusters"
  ) + 
  scale_fill_manual(values=c(cbPalette[2], cbPalette[1], cbPalette[3], cbPalette[4])) +
  scale_color_manual(values=c(cbPalette[1], cbPalette[2], cbPalette[3], cbPalette[4]))
unknown_plot <- ggplot(cluster_df) +
  geom_point(
    data=cluster_df,
    aes(x=x, y=y),
    size = g_pointsize / 2,
    #shape = 21
  ) +
  dsan_theme("full") +
  coord_fixed() +
  labs(
    x = "x",
    y = "y",
    title = "Same Data with Unknown Clusters"
  )
cluster_df |> write_csv("assets/cluster_data.csv")
known_plot + unknown_plot

Clusters as Latent Variables

  • Recall the Hidden Markov Model (one of many examples):

Modeling the Latent Distribution

  • This observed/latent distinction gives us a modeling framework for inferring “underlying” distributions from data!
  • Let’s begin with an overly-simple model: only one cluster (all data drawn from a single normal distribution)

  • Probability that RV Xi takes on value v:

    Pr(Xi=vθD)=φ2(v;μ,Σ)

    where φ2(v;μ,Σ) is pdf of N2(μ,Σ).

  • Let X=(X1,,XN), V=(v1,,vN)

  • Probability that RV X takes on values V:

Pr(X=VθD)=Pr(X1=v1θD)××Pr(XN=vNθD)

So How Do We Infer Latent Vars From Data?

  • If only we had some sort of method for estimating which values of our unknown parameters θD are most likely to produce our observed data X

  • The diagram on the previous slide gave us an equation

    Pr(X=VθD)=Pr(X1=v1θD)××Pr(XN=vNθD)

  • And we know that, when we consider the data as given and view this probability as a function of the parameters, we write it as

    L(X=VθD)=L(X1=v1θD)××L(XN=vNθD)

  • We want to find the most likely θD, that is, θD=argmaxθDL(X=VθD)

  • This value θD is called the Maximum Likelihood Estimate of θD, and is easy to find using calculus tricks

Handling Multiple Clusters

  • Probability Xi takes on value v:

    Pr(Xi=vCi=ci;θD)={φ2(v;μ1,Σ)if ci=1φ2(v;μ2,Σ)otherwise,

    where φ2(v;μ,Σ) is pdf of N2(μ,Σ).

  • Let C=(C1RV,,CN), c=(c1scalar,,cN)

  • Probability that RV X takes on values V:

Pr(X=VC=c;θD)=Pr(X1=v1C1=c1;θD)××Pr(XN=vNCN=cN;θD)

  • It’s the same math as before! Find (C,θD)=argmaxC,θDL(X=VC;θD)

Dimensionality Reduction (W10)

  • High-level goal: Retain information about word-context relationships while reducing the M-dimensional representations of each word down to 3 dimensions.
  • Low-level goal: Generate rank-K matrix W which best approximates distances between words in M-dimensional space (rows in X)

Looking Forward

Backing Up: What is a Neural Network?

  • A linked network of L layers each containing nodes νi[]

What Do the Nodes Do?

Each node νi[] in the network:

  • Takes in an input,
  • Transforms it using a weight wi[] and bias bi[], and
  • Produces an output, typically using a sigmoid function like σ(x)=11+ex:

outputi[]=σ(wi[]input+bi[])

How Does it “Learn”?

  • Need a loss function L(y^,y)
  • Starting from the end, we backpropagate the loss, updating weights and biases as we go
  • Higher loss greater change to weights and biases

Core Courses

  • DSAN 5200: Advanced Data Visualization
  • DSAN 5300: Statistical Learning
  • (DSAN 6000: Big Data and Cloud Computing)

Elective Courses

Computer Science in General
  • DSAN 5500: Data Structures, Objects, and Algorithms in Python
  • (DSAN 5700: Blockchain Technologies)
  • DSAN 6800: Principles of Cybersecurity
AI/Machine Learning
  • DSAN 6500: Comp Vision and Generative Image Modeling
  • DSAN 6550: Adaptive Measurement
  • DSAN 6600: Neural Nets & Deep Learning
  • DSAN 6650: Reinforcement Learning
Math/Stats
  • DSAN 5600: Applied Time Series for Data Science
  • (DSAN 6200: Analytics and Math for Streaming and High Dimension Data)
Language
  • DSAN 5400: Comp Ling, Advanced Python
  • (DSAN 5810: NLP with Large Language Models)
Data Science in Society
  • DSAN 5450: Data Ethics and Policy
  • (DSAN 5550: Data Science and Climate Change)
  • DSAN 5900: Digital Storytelling

Types of Data

Data Over Time
  • DSAN 5600: Applied Time Series for Data Science
Text Data
  • DSAN 5400: Computational Linguistics, Advanced Python
  • (DSAN 5810: NLP with Large Language Models)
Surveys/Tests
  • DSAN 6550: Adaptive Measurement
Geographic
  • (DSAN 5550: Data Science and Climate Change)
  • (DSAN 6750: GIS for Spatial Data Science)
Image Data
  • DSAN 6500: Comp Vision and Generative Image Modeling
All Of The Above
  • DSAN 5500: Data Structures
  • DSAN 5900: Digital Storytelling

Future Electives!

  • DSAN 6100: Optimization
  • DSAN 6300: Database Systems and SQL
  • DSAN 6400: Network Analytics [Summer]
  • DSAN 6700: Machine Learning App Deployment
  • DSAN 6750: GIS for Spatial Data Science
  • DSAN 6850: Causal Inference for Computational Social Science [Summer]

References

Footnotes

  1. (But we might have the opposite result for a marine economist… rly makes u think )↩︎

  2. If you’re in my DSAN5100 class, then you already know this! If not, check out the MLE slides here for more details↩︎