Week 10: Evaluating Spatial Hypotheses II: Areal Data

PPOL 6805 / DSAN 6750: GIS for Spatial Data Science
Fall 2025

Class Sessions

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

Wednesday, October 29, 2025

Open slides in new tab →

Roadmap to the Midterm!

Last Week (Oct 22): Evaluating Hypotheses for Point Data
Today (Oct 29): Evaluating Hypotheses for Areal Data
In-Class Midterm (Nov 5): Basically a “mini-homework” on Spatial Data Science unit (same topics as on HW3)!

Code

source("../dsan-globals/_globals.r")
set.seed(6805)
library(tidyverse) |> suppressPackageStartupMessages()
library(sf) |> suppressPackageStartupMessages()
library(spatstat) |> suppressPackageStartupMessages()
library(mdthemes) |> suppressPackageStartupMessages()
library(mapview) |> suppressPackageStartupMessages()
library(leaflet) |> suppressPackageStartupMessages()

The Point of All This: Null Models via Base Rates!

Null Models

Given an intensity function, we can compute a bunch of simulated point patterns (inhomogeneous Poisson Point Process)…
How does an observed point pattern differ from the simulated point patterns?

Constant Risk Hypothesis: Motivation

Here is a (fictional) map of flu cases in the US “Midwest”
Are people in Chicago and Detroit equally “at risk”?

Code

library(mapview) |> suppressPackageStartupMessages()
city_df <- tibble::tribble(
  ~city, ~lon, ~lat, ~pop,
  "Chicago", 41.950567516553896, -87.93011127491978, 2746388,
  "Detroit", 42.45123004999075, -83.18402319217698, 631524
)
city_sf <- sf::st_as_sf(
  city_df,
  coords = c("lat", "lon"),
  crs=4326
)
city_buf_sf <- city_sf |> sf::st_buffer(20000)
city_cases_sf <- city_buf_sf |> sf::st_sample(size=rep(10,2)) |> sf::st_as_sf()
city_cases_sf$city <- "Detroit (10 Cases)"
city_cases_sf[1:10, 'city'] <- "Chicago (10 Cases)"
city_cases_sf$sample <- "Flu"
mapview(city_cases_sf, zcol="city")

Constant Risk Hypothesis: Base Rates

	Chicago	Detroit
Population	2,746,388	631,524

Code

library(leaflet) |> suppressPackageStartupMessages()
library(leaflet.extras2) |> suppressPackageStartupMessages()
city_pop_sf <- city_buf_sf |> sf::st_sample(size=c(16, 4)) |> sf::st_as_sf()
city_pop_sf$city <- "Detroit"
city_pop_sf[1:16, 'city'] <- "Chicago"
city_pop_sf$sample <- "People"
city_combined_sf <- bind_rows(city_cases_sf, city_pop_sf)
# mapview(city_combined_sf, zcol="city", marker="sample")
Flu = makeIcon(
    "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Maki2-danger-24.svg/240px-Maki2-danger-24.svg.png",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Maki2-danger-24.svg/24px-Maki2-danger-24.svg.png",
    20,
    20
)
People = makeIcon(
  "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ef/Maki2-pitch-24.svg/24px-Maki2-pitch-24.svg.png",
  "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ef/Maki2-pitch-24.svg/24px-Maki2-pitch-24.svg.png",
  20,
  20
)

city_combined_sf$r <- ifelse(city_combined_sf$sample == "Flu", 0, 4)
city_flu_sf <- city_combined_sf |> filter(sample == "Flu")
city_ppl_sf <- city_combined_sf |> filter(sample == "People")
leaflet(city_flu_sf) |>
  addProviderTiles("CartoDB.Positron") |>
  addMarkers(data=city_flu_sf, icon = Flu) |>
  addMarkers(data=city_ppl_sf, icon = People)

Meaningful Results \(\leftrightarrow\) Comparisons with Null Model(s)

Cases of disease form an intensity function \(\lambda(\mathbf{s})\)
Controls form an “ambient” intensity function \(\lambda_0(\mathbf{s})\)
\(\implies\) The following model allows us to separate the quantities we really care about from the base rate information:

\[ \lambda(\mathbf{s}) = \alpha \overbrace{\lambda_0(\mathbf{s})}^{\mathclap{\text{Base Rate}}} \underbrace{\rho(\mathbf{s})}_{\mathclap{\text{Relative Risk}}}, \]

where

\[ \alpha = \frac{\# \text{Cases}}{\# \text{Controls}} \]
(Remember: unit of \(\lambda(\mathbf{s})\) is number of cases; not a probability density!)

Example from Moraga (2024)

Data: 761 primary biliary cirrhosis (PBC) cases, 3020 controls, England 1987-94

Code

library(sparr) |> suppressPackageStartupMessages()
data(pbc)
controls <- unmark(pbc[which(pbc$marks == "control"), ])
controls |> sf::st_as_sf() |> ggplot() +
  geom_sf() +
  theme_classic() +
  labs(title="Controls (Population Sample)")

Code

cases <- unmark(pbc[which(pbc$marks == "case"), ])
cases |> sf::st_as_sf() |> ggplot() +
  geom_sf() +
  theme_classic() +
  labs(title="PBC Cases")

Estimating Comparable Intensity Surfaces

To ensure comparable intensity functions, need a common bandwidth value
density() from spatstat makes “smart” choices for this bandwidth, so for now we just average the bandwidths estimated for cases and controls:

Code

library(sparr)
data(pbc)
cases <- unmark(pbc[which(pbc$marks == "case"), ])
controls <- unmark(pbc[which(pbc$marks == "control"), ])
bwcases <- attr(density(cases), "sigma")
bwcontr <- attr(density(controls), "sigma")
(bw <- (bwcases + bwcontr)/2)

[1] 11.45837

Code

int_controls <- density(controls, sigma = bw, eps=0.25)
plot(int_controls, main = "Control Intensity")

Code

int_cases <- density(cases, sigma = bw, eps=0.25)
plot(int_cases, main = "Case Intensity")

Visualizing Relative Risk Surface

All that’s left is \(\alpha = \# \text{Controls} / \# \text{Cases}\)!

Code

library(fields)
(alpha_hat <- cases$n/controls$n)

[1] 0.2519868

Code

x <- int_cases$xcol
y <- int_cases$yrow
rr <- t(int_cases$v)/t(alpha_hat * int_controls$v)
image.plot(x, y, rr, asp = 1)
title(xlab = "Easting", ylab = "Northing")

From Exploratory to Confirmatory Data Analysis

How do we know whether ↑ risk around North Newcastle due to “chance”? Maybe relative risk is actually constant, and people in Newcastle are just unlucky…
Thought experiment: Bro hands you a coin. You flip 10 times. HHHHHTHHHH…
Coin may be biased-towards-H, or could be fair (lots of heads by chance!)
Stats: Compare with null model! For fair coin, how likely is [9 Heads / 10 Flips]?

**Exploratory** Data Analysis: Collecting **evidence**

**Confirmatory** Data Analysis: Making **judgements**

Null Model: CSR

The Random Labeling Hypothesis

Randomly apply 761 “sick” labels to the 3781 points!

Code

plot_rlabel <- function(rl_ppp) {
  rl_win_sf <- window_to_sf(rl_ppp)
  rl_points_sf <- points_to_sf(rl_ppp) |>
    rename(
      type = spatstat.geom..marks.x.
    )
  rl_plot <- ggplot() +
    geom_sf(
      data=rl_win_sf,
      alpha=0.3
    ) +
    geom_sf(
      mapping=aes(fill=type, alpha=type),
      data=rl_points_sf,
      size=2,
      # alpha=0.75,
      shape=21,
      stroke=0.1,
      # color='white',
    ) +
    # scale_color_manual(
    #   "Point Type",
    #   values=c("control"="gray","case"=cb_palette[1])
    # ) +
    scale_fill_manual(
      "Type",
      values=c("control"="gray","case"=cb_palette[1])
    ) +
    scale_alpha_manual(
      "Type",
      values=c("control"=0.5, "case"=1.0)
    ) +
    theme_classic(base_size=16) +
    labs(
      title="NE Britain Disease Cases"
    )
  return(rl_plot)
}
set.seed(6810)
rl1_ppp <- rlabel(pbc)
plot_rlabel(rl1_ppp)

Code

set.seed(6811)
pbc_2 <- rlabel(pbc)
plot_rlabel(pbc_2)

Code

set.seed(6812)
pbc_3 <- rlabel(pbc)
plot_rlabel(pbc_3)

The Constant Risk Hypothesis

Everyone equally at risk of contracting disease, regardless of location: \(\rho(\mathbf{s}) = 1\)
In this case: \(\lambda_{CR}(\mathbf{s}) = \alpha \lambda_0(\mathbf{s})\)

Code

pbc_win_sf <- pbc |> sf::st_as_sf() |> filter(label == "window")
plot_pbc_sim <- function() {
    pbc_sim <- spatstat.random::rpoispp(
        lambda = alpha_hat * int_controls
    )
    # Separate window and points
    pbc_sf <- pbc_sim |> sf::st_as_sf()
    pbc_point_sf <- pbc_sf |> filter(label == "point")
    pbc_win_sf <- pbc_sf |> filter(label == "window")
    pbc_plot <- ggplot() +
      geom_sf(data=pbc_win_sf) +
      geom_sf(data=pbc_point_sf, size=0.5, alpha=0.5) +
      theme_classic()
    return(pbc_plot)
}
plot_pbc_sim()

Code

plot_pbc_sim()

Code

plot_pbc_sim()

Point Processes \(\rightarrow\) Areal Processes

The “Complete” Part of Complete Spatial Randomness (CSR)
Bringing Neighbors Back In
Spatial Regression

Points \(\rightarrow\) Areas

Aggregating Point Processes

Recall the two “stages” of our Poisson-based point processes
- A Poisson-distributed number of points, then
- Uniformly-distributed coordinates for each point
One way to see areal data modeling: we keep the Poisson part, but we no longer observe the coordinates for the points!
So… why is it still helpful for you to have sat through those lectures on Point Processes? Two reasons…

(1) Areal Patterns = Aggregated Point Patterns

One application, mentioned last week: areal-weighted interpolation using actual models of how the points are distributed within the area!
Regularly-spaced points is rarely a good “default” model!
Humans, for example, rarely live at perfect evenly-spaced intervals… they form households, villages, cities
Regularly-spaced points (we now know) \(\implies\) negative autocorrelation \(\implies\) typically due to inhibition process (competition)

(2) Spatial Scan Statistics

Areal regions often the result of artificial / “arbitrary” human divisions
- (Particulate matter doesn’t pass through customs)
\(\implies\) If we care about processes which don’t “adhere to” borders (like disease spread), we want to “scan” buffers around points regardless of areal borders

(3) Hierarchical Bayesian Smoothing

The issue might just be not enough data for some areas, while we have an abundance of data in nearby areas…

From Kramer (2023), *Spatial Epidemiology*

Autocorrelation Reminder: Weight Matrix Defines “Neighbors”

Choices Available in `spdep`

\(w_{ij} = 1\) if \(i\) and \(j\) “touch”
- Rook: Overlap is 1 or 2-dimensional
- Queen: Overlap is 0, 1, or 2-dimensional
\(w_{ij} = \frac{1}{\text{dist}(i, j)}\)
- \(w(\textsf{Seoul}, \textsf{Incheon})\) > \(w(\textsf{Seoul}, \textsf{Busan})\) > \(w(\textsf{Seoul}, \textsf{Jeju})\)
\(w_{ij} = 1\) if \(dist(i, j) < \overline{D}\)
\(w_{ij} = 1\) for \(K\) nearest neighbors

Moran’s \(I\) One More Time

Round 1 (Week 6): Moran’s \(I\) as “thermometer”
Round 2 (Today): Could an \(I\) value this extreme occur due to random noise?

References

Gimond, Manuel. 2024. Introduction to GIS and Spatial Analysis. Colby College (ES214).

Roadmap to the Midterm!

The Point of All This: Null Models via Base Rates!

Null Models

Constant Risk Hypothesis: Motivation

Constant Risk Hypothesis: Base Rates

Meaningful Results \(\leftrightarrow\) Comparisons with Null Model(s)

Example from Moraga (2024)

Estimating Comparable Intensity Surfaces

Visualizing Relative Risk Surface

From Exploratory to Confirmatory Data Analysis

Null Model: CSR

The Random Labeling Hypothesis

The Constant Risk Hypothesis

Point Processes \(\rightarrow\) Areal Processes

Points \(\rightarrow\) Areas

Aggregating Point Processes

(1) Areal Patterns = Aggregated Point Patterns

(2) Spatial Scan Statistics

(3) Hierarchical Bayesian Smoothing

Autocorrelation Reminder: Weight Matrix Defines “Neighbors”

Choices Available in spdep

Moran’s \(I\) One More Time

References

Choices Available in `spdep`