Lecture 10: Visualizing Big Data
DSAN 5200-03: Advanced Data Visualization
Class Sessions
What Makes Big Data Visualization Different?
(…Let’s brainstorm!)
Memory Issues \(\leadsto\) Computational Issues
- We’ve assumed one-to-one correspondence between (immediately-accessible) data and visual encoding(s)
- When working with big data, however:
- Full dataset may not fit in the user’s browser cache!
- Even if it does, processing (e.g., placing \(N\) points on map) may be prohibitively slow
- \(\implies\) Some portion of data / some computations need to be handled server side!
Client-Side vs. Server-Side Computing
- Reliable estimates of computing power (in FLOPS = Floating-Point Operations Per Second) hard to come by in a world of distributed cloud computing!
- Back-of-envelope calculation:
- A given server (AWS, GCP) has 10-100x more computing power than our laptops
- Servers almost entirely devoted to data processing; laptops have to handle OS GUI, streaming video, conserving battery, etc.
Client-Side vs. Server-Side Memory
In Chrome, check JS heap size (in GB) by running:
window.performance.memory.jsHeapSizeLimit / (10**9)
My Chrome JS Heap | 4.2947 GB |
2020 US Census data | 4.3487 GB |
Google Maps (2012) | 20 000 000.0000 GB |
New Opportunities
- Allow users to explore time series for arbitrarily-long windows of time!
Helpful Even When Data Does Fit In Memory!
- Can free user’s CPU for things like lighting computation
Is This Lighting Thing A Gimmick?
- …or a MILLION DOLLAR IDEA!!! 🤑🤑🤑
Achieving the Best of Both Worlds
The General Idea
- Ad hoc approach, figuring out what to do server-side vs. client-side “on the fly” ❌
- Instead, we can use systems which integrate them, drawing on respective strengths!
- Data Visualization Management System (DVMS)
ZQL = SQL for Visualization
- Input: Description of desired visualization
- Output: SQL query
x |
y |
constraints |
viz |
---|---|---|---|
carrier |
passengers |
destination=="New York" |
bar(y=sum(passengers)) |
Produces
SELECT carrier, SUM(passengers)
FROM flight delay
GROUP BY carrier
WHERE destination="New York";
- Maybe non-obvious, a priori, how this helps…
- Advantages become clear when we start to optimize!
Precomputation
- SQL in general needs to handle arbitrary queries…
- But for visualization, certain queries will never be made, while others (counting, summing) will be made frequently
- Hence we can precompute, on the server side many (most?) of the statistics for layers / levels of aggregation that the users will feasibly want to look at
- This frees up processing power on the client side, which can be applied instead towards speed, aesthetics, responsive interactivity, etc.