Lecture 10: Visualizing Big Data

DSAN 5200-03: Advanced Data Visualization

Class Sessions
Authors
Affiliations

Abhijit Dasgupta

Jeff Jacobs

Anderson Monken

Marck Vaisman

Published

Tuesday, January 30, 2024

Open slides in new window →

What Makes Big Data Visualization Different?

(…Let’s brainstorm!)

Memory Issues \(\leadsto\) Computational Issues

  • We’ve assumed one-to-one correspondence between (immediately-accessible) data and visual encoding(s)
  • When working with big data, however:
    • Full dataset may not fit in the user’s browser cache!
    • Even if it does, processing (e.g., placing \(N\) points on map) may be prohibitively slow
  • \(\implies\) Some portion of data / some computations need to be handled server side!

Client-Side vs. Server-Side Computing

  • Reliable estimates of computing power (in FLOPS = Floating-Point Operations Per Second) hard to come by in a world of distributed cloud computing!
  • Back-of-envelope calculation:
    • A given server (AWS, GCP) has 10-100x more computing power than our laptops
    • Servers almost entirely devoted to data processing; laptops have to handle OS GUI, streaming video, conserving battery, etc.

Client-Side vs. Server-Side Memory

  • In Chrome, check JS heap size (in GB) by running:

    window.performance.memory.jsHeapSizeLimit / (10**9)
My Chrome JS Heap 4.2947 GB
2020 US Census data 4.3487 GB
Google Maps (2012) 20 000 000.0000 GB

New Opportunities

  • Allow users to explore time series for arbitrarily-long windows of time!

Helpful Even When Data Does Fit In Memory!

  • Can free user’s CPU for things like lighting computation
Figure 2: “Astronomically correct lighting allows users to see how different buildings shade each other during different times of day and year.”

Is This Lighting Thing A Gimmick?

  • …or a MILLION DOLLAR IDEA!!! 🤑🤑🤑

Achieving the Best of Both Worlds

The General Idea

  • Ad hoc approach, figuring out what to do server-side vs. client-side “on the fly” ❌
  • Instead, we can use systems which integrate them, drawing on respective strengths!
  • Data Visualization Management System (DVMS)

ZQL = SQL for Visualization

  • Input: Description of desired visualization
  • Output: SQL query
x y constraints viz
carrier passengers destination=="New York" bar(y=sum(passengers))

Produces

SELECT carrier, SUM(passengers)
FROM flight delay
GROUP BY carrier
WHERE destination="New York";
  • Maybe non-obvious, a priori, how this helps…
  • Advantages become clear when we start to optimize!

Precomputation

  • SQL in general needs to handle arbitrary queries…
  • But for visualization, certain queries will never be made, while others (counting, summing) will be made frequently
  • Hence we can precompute, on the server side many (most?) of the statistics for layers / levels of aggregation that the users will feasibly want to look at
  • This frees up processing power on the client side, which can be applied instead towards speed, aesthetics, responsive interactivity, etc.

The Power of Precomputation I

The Power of Precomputation II

Precomputation: Designing for an Audience