DSAN 6000: Big Data and Cloud Computing
Fall 2025
Monday, September 8, 2025
Term | Definition |
---|---|
Local | Your current workstation (laptop, desktop, etc.), wherever you start the terminal/console application. |
Remote | Any machine you connect to via ssh or other means. |
EC2 | Single virtual machine in the cloud where you can run computation |
SageMaker | Integrated Developer Environment where you can conduct data science on single machines or distributed training |
GPU | Graphics Processing Unit - specialized hardware for parallel computation, essential for AI/ML |
TPU | Tensor Processing Unit - Google’s custom AI accelerator chips |
Ephemeral | Lasting for a short time - any machine that will get turned off or place you will lose data |
Persistent | Lasting for a long time - any environment where your work is NOT lost when the timer goes off |
Multithreading | Asynchronous Execution | |
---|---|---|
Unconsciously (you do it already, “naturally”) |
Focus on one speaker within a loud room, with tons of other conversations entering your ears | Put something in oven, set alarm, go do something else, take out of oven once alarm goes off |
Consciously (you can do it with effort/practice) |
Pat head (up and down) and rub stomach (circular motion) “simultaneously” | Throw a ball in the air, clap 3 times, catch ball |
Yes - Parallelize These
Data Preparation:
Data Processing:
No - Keep Sequential
Order-Dependent:
Global Operations:
For data operations in the “No” column, they often require global coordination or maintain strict ordering. However, many can be approximated with parallel algorithms (like approximate deduplication with locality-sensitive hashing)
\[ \begin{align*} x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} = \frac{-7 \pm \sqrt{49 - 4(6)(-3)}}{2(6)} = \frac{-7 \pm 11}{12} = \left\{\frac{1}{3},-\frac{3}{2}\right\} \end{align*} \]
\(\leadsto\) If code is not embarrassingly parallel (instinctually requiring laborious serial execution), | \(\underbrace{6x^2 + 7x - 3 = 0}_{\text{Solve using Quadratic Eqn}}\) |
But can be split into… | \((3x - 1)(2x + 3) = 0\) |
Embarrassingly parallel pieces which combine to same result, | \(\underbrace{3x - 1 = 0}_{\text{Solve directly}}, \underbrace{2x + 3 = 0}_{\text{Solve directly}}\) |
We can use map-reduce to achieve ultra speedup (running “pieces” on GPU!) | \(\underbrace{(3x-1)(2x+3) = 0}_{\text{Solutions satisfy this product}}\) |
Problem from DSAN 5000/5100: Computing SSR (Sum of Squared Residuals)
\(y = (1,3,2), \widehat{y} = (2, 5, 0) \implies \text{SSR} = (1-2)^2 + (3-5)^2 + (2-0)^2 = 9\)
Computing pieces separately:
Combining solved pieces
You may have noticed: map()
and reduce()
are “meta-functions”: functions that take other functions as inputs
In Python, functions can be used as vars (Hence lambda
):
This relates to a whole paradigm, “functional programming”: mostly outside scope of course, but lots of important+useful takeaways/rules-of-thumb!
In CS Theory: enables formal proofs of correctness
In CS practice:
When a program doesn’t work, each function is an interface point where you can check that the data are correct. You can look at the intermediate inputs and outputs to quickly isolate the function that’s responsible for a bug.
(from Python’s “Functional Programming HowTo”)
# Convert to lowercase
Easy case: found typo in punctuation removal code. Fix the error, add comment like # Remove punctuation
Rule 1 of FP: transform these comments into function names
Hard case: Something in load_text()
modifies a variable that later on breaks remove_punct()
(Called a side-effect)
Rule 2 of FP: NO SIDE-EFFECTS!
remove_punct()
!!! 😎 ⏱️ = 💰From Leskovec, Rajaraman, and Ullman (2014)
From Leskovec, Rajaraman, and Ullman (2014), which is (legally) free online!
From Cornell Virtual Workshop, “Understanding GPU Architecture”
DSAN 6000 Week 3: Parallelization Concepts