But, just the “plain”, basic version built into Python (map(), functools.reduce())
Last Week: Athena with AWS Glue under the hood
“Automated” ETL Pipeline!
This Week: Hitting a wall with Athena… It doesn’t know in advance what [types of] queries you’re going to make!
If you’re going to group by (e.g.) state, then rent 50 computers to process one state each, Athena doesn’t know to place the Ohio data on the Ohio computer!
We need a way to “steer” Map-Reduce’s choices… Enter Hadoop MapReduce!
A Reminder: Map-Reduce as a Paradigm
What Happens When Not Embarrassingly Parallel?
Think of the difference between linear and quadratic equations in algebra:
\(3x - 1 = 0\) is “embarrassingly” solvable, on its own: you can solve it directly, by adding 3 to both sides \(\implies x = \frac{1}{3}\). Same for \(2x + 3 = 0 \implies x = -\frac{3}{2}\)
Now consider \(6x^2 + 7x - 3 = 0\): Harder to solve “directly”, so your instinct might be to turn to the laborious quadratic equation:
And yet, \(6x^2 + 7x - 3 = (3x - 1)(2x + 3)\), meaning that we could have split the problem into two “embarrassingly” solvable pieces, then multiplied to get result!
The Analogy to Map-Reduce
\(\leadsto\) If code is not embarrassingly parallel (instinctually requiring laborious serial execution),
\(\underbrace{6x^2 + 7x - 3 = 0}_{\text{Solve using Quadratic Eqn}}\)
But can be split into…
\((3x - 1)(2x + 3) = 0\)
Embarrassingly parallel pieces which combine to same result,
from functools importreducemy_reduce =reduce(lambda piece1, piece2: piece1 + piece2, map_result)my_reduce
9
But… Why is All This Weird Mapping and Reducing Necessary?
Without knowing a bit more of the internals of computing efficiency, it may seem like a huge cost in terms of overly-complicated overhead, not to mention learning curve…
The “Killer Application”: Matrix Multiplication
(I learned from Jeff Ullman, who did the obnoxious Stanford thing of mentioning in passing how “two previous students in the class did this for a cool final project on web crawling and, well, it escalated quickly”, aka became Google)
From Leskovec, Rajaraman, and Ullman (2014), which is (legally) free online!