Week 10: Evaluating Spatial Hypotheses II: Areal Data

PPOL 6805 / DSAN 6750: GIS for Spatial Data Science
Fall 2025

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Wednesday, October 29, 2025

Open slides in new tab →

Roadmap to the Midterm!

  • Last Week (Oct 22): Evaluating Hypotheses for Point Data
  • Today (Oct 29): Evaluating Hypotheses for Areal Data
  • In-Class Midterm (Nov 5): Basically a “mini-homework” on Spatial Data Science unit (same topics as on HW3)!

The Point of All This: Null Models via Base Rates!

Null Models

  • Given an intensity function, we can compute a bunch of simulated point patterns (inhomogeneous Poisson Point Process)…
  • How does an observed point pattern differ from the simulated point patterns?

Constant Risk Hypothesis: Motivation

  • Here is a (fictional) map of flu cases in the US “Midwest”
  • Are people in Chicago and Detroit equally “at risk”?
Code
library(mapview) |> suppressPackageStartupMessages()
city_df <- tibble::tribble(
  ~city, ~lon, ~lat, ~pop,
  "Chicago", 41.950567516553896, -87.93011127491978, 2746388,
  "Detroit", 42.45123004999075, -83.18402319217698, 631524
)
city_sf <- sf::st_as_sf(
  city_df,
  coords = c("lat", "lon"),
  crs=4326
)
city_buf_sf <- city_sf |> sf::st_buffer(20000)
city_cases_sf <- city_buf_sf |> sf::st_sample(size=rep(10,2)) |> sf::st_as_sf()
city_cases_sf$city <- "Detroit (10 Cases)"
city_cases_sf[1:10, 'city'] <- "Chicago (10 Cases)"
city_cases_sf$sample <- "Flu"
mapview(city_cases_sf, zcol="city")

Constant Risk Hypothesis: Base Rates

Chicago Detroit
Population 2,746,388 631,524
Code
library(leaflet) |> suppressPackageStartupMessages()
library(leaflet.extras2) |> suppressPackageStartupMessages()
city_pop_sf <- city_buf_sf |> sf::st_sample(size=c(16, 4)) |> sf::st_as_sf()
city_pop_sf$city <- "Detroit"
city_pop_sf[1:16, 'city'] <- "Chicago"
city_pop_sf$sample <- "People"
city_combined_sf <- bind_rows(city_cases_sf, city_pop_sf)
# mapview(city_combined_sf, zcol="city", marker="sample")
Flu = makeIcon(
    "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Maki2-danger-24.svg/240px-Maki2-danger-24.svg.png",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Maki2-danger-24.svg/24px-Maki2-danger-24.svg.png",
    20,
    20
)
People = makeIcon(
  "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ef/Maki2-pitch-24.svg/24px-Maki2-pitch-24.svg.png",
  "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ef/Maki2-pitch-24.svg/24px-Maki2-pitch-24.svg.png",
  20,
  20
)

city_combined_sf$r <- ifelse(city_combined_sf$sample == "Flu", 0, 4)
city_flu_sf <- city_combined_sf |> filter(sample == "Flu")
city_ppl_sf <- city_combined_sf |> filter(sample == "People")
leaflet(city_flu_sf) |>
  addProviderTiles("CartoDB.Positron") |>
  addMarkers(data=city_flu_sf, icon = Flu) |>
  addMarkers(data=city_ppl_sf, icon = People)

Meaningful Results \(\leftrightarrow\) Comparisons with Null Model(s)

  • Cases of disease form an intensity function \(\lambda(\mathbf{s})\)

  • Controls form an “ambient” intensity function \(\lambda_0(\mathbf{s})\)

  • \(\implies\) The following model allows us to separate the quantities we really care about from the base rate information:

    \[ \lambda(\mathbf{s}) = \alpha \overbrace{\lambda_0(\mathbf{s})}^{\mathclap{\text{Base Rate}}} \underbrace{\rho(\mathbf{s})}_{\mathclap{\text{Relative Risk}}}, \]

    where

    \[ \alpha = \frac{\# \text{Cases}}{\# \text{Controls}} \]

  • (Remember: unit of \(\lambda(\mathbf{s})\) is number of cases; not a probability density!)

Example from Moraga (2024)

Code
library(sparr) |> suppressPackageStartupMessages()
data(pbc)
controls <- unmark(pbc[which(pbc$marks == "control"), ])
controls |> sf::st_as_sf() |> ggplot() +
  geom_sf() +
  theme_classic() +
  labs(title="Controls (Population Sample)")

Code
cases <- unmark(pbc[which(pbc$marks == "case"), ])
cases |> sf::st_as_sf() |> ggplot() +
  geom_sf() +
  theme_classic() +
  labs(title="PBC Cases")

Estimating Comparable Intensity Surfaces

  • To ensure comparable intensity functions, need a common bandwidth value
  • density() from spatstat makes “smart” choices for this bandwidth, so for now we just average the bandwidths estimated for cases and controls:
Code
library(sparr)
data(pbc)
cases <- unmark(pbc[which(pbc$marks == "case"), ])
controls <- unmark(pbc[which(pbc$marks == "control"), ])
bwcases <- attr(density(cases), "sigma")
bwcontr <- attr(density(controls), "sigma")
(bw <- (bwcases + bwcontr)/2)
[1] 11.45837
Code
int_controls <- density(controls, sigma = bw, eps=0.25)
plot(int_controls, main = "Control Intensity")

Code
int_cases <- density(cases, sigma = bw, eps=0.25)
plot(int_cases, main = "Case Intensity")

Visualizing Relative Risk Surface

  • All that’s left is \(\alpha = \# \text{Controls} / \# \text{Cases}\)!
Code
library(fields)
(alpha_hat <- cases$n/controls$n)
[1] 0.2519868
Code
x <- int_cases$xcol
y <- int_cases$yrow
rr <- t(int_cases$v)/t(alpha_hat * int_controls$v)
image.plot(x, y, rr, asp = 1)
title(xlab = "Easting", ylab = "Northing")

From Exploratory to Confirmatory Data Analysis

  • How do we know whether ↑ risk around North Newcastle due to “chance”? Maybe relative risk is actually constant, and people in Newcastle are just unlucky
  • Thought experiment: Bro hands you a coin. You flip 10 times. HHHHHTHHHH
  • Coin may be biased-towards-H, or could be fair (lots of heads by chance!)
  • Stats: Compare with null model! For fair coin, how likely is [9 Heads / 10 Flips]?

Exploratory Data Analysis: Collecting evidence

Confirmatory Data Analysis: Making judgements

Null Model: CSR

Figure from Gimond (2024)

The Random Labeling Hypothesis

  • Randomly apply 761 “sick” labels to the 3781 points!
Code
plot_rlabel <- function(rl_ppp) {
  rl_win_sf <- window_to_sf(rl_ppp)
  rl_points_sf <- points_to_sf(rl_ppp) |>
    rename(
      type = spatstat.geom..marks.x.
    )
  rl_plot <- ggplot() +
    geom_sf(
      data=rl_win_sf,
      alpha=0.3
    ) +
    geom_sf(
      mapping=aes(fill=type, alpha=type),
      data=rl_points_sf,
      size=2,
      # alpha=0.75,
      shape=21,
      stroke=0.1,
      # color='white',
    ) +
    # scale_color_manual(
    #   "Point Type",
    #   values=c("control"="gray","case"=cb_palette[1])
    # ) +
    scale_fill_manual(
      "Type",
      values=c("control"="gray","case"=cb_palette[1])
    ) +
    scale_alpha_manual(
      "Type",
      values=c("control"=0.5, "case"=1.0)
    ) +
    theme_classic(base_size=16) +
    labs(
      title="NE Britain Disease Cases"
    )
  return(rl_plot)
}
set.seed(6810)
rl1_ppp <- rlabel(pbc)
plot_rlabel(rl1_ppp)

Code
set.seed(6811)
pbc_2 <- rlabel(pbc)
plot_rlabel(pbc_2)

Code
set.seed(6812)
pbc_3 <- rlabel(pbc)
plot_rlabel(pbc_3)

The Constant Risk Hypothesis

  • Everyone equally at risk of contracting disease, regardless of location: \(\rho(\mathbf{s}) = 1\)
  • In this case: \(\lambda_{CR}(\mathbf{s}) = \alpha \lambda_0(\mathbf{s})\)
Code
pbc_win_sf <- pbc |> sf::st_as_sf() |> filter(label == "window")
plot_pbc_sim <- function() {
    pbc_sim <- spatstat.random::rpoispp(
        lambda = alpha_hat * int_controls
    )
    # Separate window and points
    pbc_sf <- pbc_sim |> sf::st_as_sf()
    pbc_point_sf <- pbc_sf |> filter(label == "point")
    pbc_win_sf <- pbc_sf |> filter(label == "window")
    pbc_plot <- ggplot() +
      geom_sf(data=pbc_win_sf) +
      geom_sf(data=pbc_point_sf, size=0.5, alpha=0.5) +
      theme_classic()
    return(pbc_plot)
}
plot_pbc_sim()

Code
plot_pbc_sim()

Code
plot_pbc_sim()

Point Processes \(\rightarrow\) Areal Processes

  • The “Complete” Part of Complete Spatial Randomness (CSR)
  • Bringing Neighbors Back In
  • Spatial Regression

Points \(\rightarrow\) Areas

Aggregating Point Processes

  • Recall the two “stages” of our Poisson-based point processes
    • A Poisson-distributed number of points, then
    • Uniformly-distributed coordinates for each point
  • One way to see areal data modeling: we keep the Poisson part, but we no longer observe the coordinates for the points!
  • So… why is it still helpful for you to have sat through those lectures on Point Processes? Two reasons…

(1) Areal Patterns = Aggregated Point Patterns

  • One application, mentioned last week: areal-weighted interpolation using actual models of how the points are distributed within the area!
  • Regularly-spaced points is rarely a good “default” model!
  • Humans, for example, rarely live at perfect evenly-spaced intervals… they form households, villages, cities
  • Regularly-spaced points (we now know) \(\implies\) negative autocorrelation \(\implies\) typically due to inhibition process (competition)

(2) Spatial Scan Statistics

  • Areal regions often the result of artificial / “arbitrary” human divisions
    • (Particulate matter doesn’t pass through customs)
  • \(\implies\) If we care about processes which don’t “adhere to” borders (like disease spread), we want to “scan” buffers around points regardless of areal borders

(3) Hierarchical Bayesian Smoothing

  • The issue might just be not enough data for some areas, while we have an abundance of data in nearby areas…

From Kramer (2023), Spatial Epidemiology

Autocorrelation Reminder: Weight Matrix Defines “Neighbors”

Choices Available in spdep

  • \(w_{ij} = 1\) if \(i\) and \(j\) “touch”
    • Rook: Overlap is 1 or 2-dimensional
    • Queen: Overlap is 0, 1, or 2-dimensional
  • \(w_{ij} = \frac{1}{\text{dist}(i, j)}\)
    • \(w(\textsf{Seoul}, \textsf{Incheon})\) > \(w(\textsf{Seoul}, \textsf{Busan})\) > \(w(\textsf{Seoul}, \textsf{Jeju})\)
  • \(w_{ij} = 1\) if \(dist(i, j) < \overline{D}\)
  • \(w_{ij} = 1\) for \(K\) nearest neighbors

Moran’s \(I\) One More Time

  • Round 1 (Week 6): Moran’s \(I\) as “thermometer”
  • Round 2 (Today): Could an \(I\) value this extreme occur due to random noise?

Example: Bolshevik Revolution \(\rightarrow\) “Bipolar” World

  • Null hypothesis: no spatial effect of democratization/de-democratization on neighboring countries
  • If null hypothesis is true, and countries democratize/de-democratize independently… could this pattern still arise?

Spatial Regression

  • (We made it! This is the last Spatial Data Science topic!)
  • (Nearly all extremely-fancy applied GIS models can be implemented via Spatial Regression!)

Motivation: When Does Non-Spatial Regression “Work”?

\[ Y_i = \beta_0 + \beta_1X_{i,1} + \beta_2X_{i,2} + \cdots + \beta_MX_{i,M} + \varepsilon_i \]

  • Importance of OLS regression: can give us the Best Linear Unbiased Estimator (BLUE)

  • This is only true if the Gauss-Markov assumptions hold—one of these is that the error terms are uncorrelated:

    \[ \text{Cov}[\varepsilon_i, \varepsilon_j] = 0 \; \forall i \neq j \]

Spatial Autocorrelation in Residuals

  • We have now seen several models / datasets where effect of some variable \(X\) (say, population) on another variable \(Y\) (say, disease count) is spatial! (Kind of the whole point of the class 😜)
  • So, to see when OLS will “work”, vs. when you need to incorporate GIS, key step is plotting the spatial distribution of regression residuals!

Example: Italian Elections

Ward and Gleditsch (2018), Figure 2.3

Ward and Gleditsch (2018), Figure 2.4

Will OLS “Work” Here?

  • Can we use OLS to derive the BLUE of the effect of GDP on voting?
Plain OLS
\(N = 477\) \(\widehat{\beta}\) SE \(t\)
Intercept 35.30 2.21 15.96
Log GDP per cap 13.46 0.65 20.84

Moran’s \(I\) for residuals = 0.47(!)

 

Spatial Regression
\(N = 477\) \(\widehat{\beta}\) SE \(t\)
Intercept 4.70 1.66 2.80
Log GDP per cap 1.77 0.48 3.66
\(\rho\) 0.87 0.02 36.7

What Does it Mean that “Spatial Effect” is Significant?

Ward and Gleditsch (2018), Figure 2.5

Case 1: No Residual Spatial Autocorrelation

  • Example: Residuals have Moran’s \(I\) near 0…
  • …Don’t need GIS at all!

Case 2: Conditional Autoregressive Model (CAR)

  • GIS, but… only for “fixing” your regression

\[ Y_i = \underbrace{\mu_i}_{\text{Non-spatial model}} + \underbrace{\frac{1}{w_{i,cdot}}\sum_{j \neq i}(Y_j - \mu_j)}_{\text{Spatial Autocorrelation}} + \varepsilon_i \]

Case 3: Simultaneous Autoregressive Model (SAR)

  • The main event! Here we are explicitly modeling “spatially lagged” versions of our dependent variable \(Y\)!

\[ Y = \mathbf{X}\beta + \rho \underbrace{\mathbf{W}Y}_{\mathclap{\text{Spatially-lagged }Y}} + \boldsymbol\varepsilon \]

References

Gimond, Manuel. 2024. Introduction to GIS and Spatial Analysis. Colby College (ES214).
Ward, Michael D., and Kristian Skrede Gleditsch. 2018. Spatial Regression Models. SAGE Publications.