Lecture 10: Visualizing Big Data

DSAN 5200-03: Advanced Data Visualization

Abhijit Dasgupta

abhijit.dasgupta

Jeff Jacobs

jj1088

Anderson Monken

aem303

Marck Vaisman

marck.vaisman

Tuesday, January 30, 2024

What Makes Big Data Visualization Different?

(…Let’s brainstorm!)

Memory Issues \(\leadsto\) Computational Issues

  • We’ve assumed one-to-one correspondence between (immediately-accessible) data and visual encoding(s)
  • When working with big data, however:
    • Full dataset may not fit in the user’s browser cache!
    • Even if it does, processing (e.g., placing \(N\) points on map) may be prohibitively slow
  • \(\implies\) Some portion of data / some computations need to be handled server side!

Client-Side vs. Server-Side Computing

  • Reliable estimates of computing power (in FLOPS = Floating-Point Operations Per Second) hard to come by in a world of distributed cloud computing!
  • Back-of-envelope calculation:
    • A given server (AWS, GCP) has 10-100x more computing power than our laptops
    • Servers almost entirely devoted to data processing; laptops have to handle OS GUI, streaming video, conserving battery, etc.

Client-Side vs. Server-Side Memory

  • In Chrome, check JS heap size (in GB) by running:

    window.performance.memory.jsHeapSizeLimit / (10**9)
My Chrome JS Heap 4.2947 GB
2020 US Census data 4.3487 GB
Google Maps (2012) 20 000 000.0000 GB

New Opportunities

  • Allow users to explore time series for arbitrarily-long windows of time!

Helpful Even When Data Does Fit In Memory!

  • Can free user’s CPU for things like lighting computation
Figure 2: “Astronomically correct lighting allows users to see how different buildings shade each other during different times of day and year.”

Is This Lighting Thing A Gimmick?

  • …or a MILLION DOLLAR IDEA!!! 🤑🤑🤑

Achieving the Best of Both Worlds

The General Idea

  • Ad hoc approach, figuring out what to do server-side vs. client-side “on the fly” ❌
  • Instead, we can use systems which integrate them, drawing on respective strengths!
  • Data Visualization Management System (DVMS)

ZQL = SQL for Visualization

  • Input: Description of desired visualization
  • Output: SQL query
x y constraints viz
carrier passengers destination=="New York" bar(y=sum(passengers))

Produces

SELECT carrier, SUM(passengers)
FROM flight delay
GROUP BY carrier
WHERE destination="New York";
  • Maybe non-obvious, a priori, how this helps…
  • Advantages become clear when we start to optimize!

Precomputation

  • SQL in general needs to handle arbitrary queries…
  • But for visualization, certain queries will never be made, while others (counting, summing) will be made frequently
  • Hence we can precompute, on the server side many (most?) of the statistics for layers / levels of aggregation that the users will feasibly want to look at
  • This frees up processing power on the client side, which can be applied instead towards speed, aesthetics, responsive interactivity, etc.

The Power of Precomputation I

The Power of Precomputation II

Precomputation: Designing for an Audience

Open in new tab