Week 12: Final Projects, Map-Reduce

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Class Sessions

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

Thursday, April 3, 2025

Open slides in new window →

Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	6:45pm	Final Projects →
	6:45pm	7:00pm	HW4 on Jetstream →
	7:00pm	8:00pm	Beyond Embarrassingly Parallel →
Break!	8:00pm	8:10pm
	8:10pm	9:00pm	Data Mining in Parallel →

Making Final Projects Less Scary!

How Do I Pick A Topic?

I know that “whatever is interesting to you” can be way overly-vague!
So, one approach is: imagine yourself in a job interview for your dream job, and they bring up DSAN 5450: “Interesting, what did you do in that class?”
[Insert final project elevator pitch] “Wow, that’s such a cool project, we really want someone who can [say] take a data-driven approach to a policy question like that. You’re hired!”
(Jeff gets a commission: 10% of your salary)

Getting From Here to There

Minimum Viable Product (MVP)
$\leadsto$ Final Product (but… Hofstadter’s Law)

Hofstadter’s Law (Paraphrase)

The pieces of your DSAN final project will take longer than you expect, even if you take Hofstadter’s Law into account

The Options

A Deeper Dive
Fixing an Issue
Exploring Applications
Proving Correctness and/or Efficiency
Building an ETL Pipeline

(New examples on next 5 slides!)

1. A Deeper Dive

Example 1.3: Creating an interactive visualization (using Streamlit, for example) of a data structure or algorithm. (Streamlit demo coming in next section of slides!)

2. Fixing an Issue

Example 2.2: As you saw in the Midterm, hash tables “degenerate” from (approximate) $O(1)$ efficiency to $O(\log_2(n))$ efficiency if the hashing function we use is not efficient or not collision-resistant.
So, your project could be to:
- Explore and summarize how efficient and collision-resistant hash functions work, and then
- Implement one such hash function in Python.

3. Exploring Applications

Example 3.2: We learned Breadth-First Search (BFS) and Depth-First Search (DFS) in somewhat of a hurry, as a way to traverse over the nodes in a Binary Search Tree, but they both have many more exciting applications!
For example, if you’re interested in web scraping, you could adapt the HW2 code to create a polymorphic web-scraper:
- Each node is a webpage
- Processing a node means adding each link in that page to the ThingContainer.

4. Proving Things

Example 4.2: We were able to gradually improve retrieval:

Linear		Logarithmic		Constant
`LinkedList` $O(n)$	$\rightarrow$	`BST` $O(\log_2(n))$	$\rightarrow$	`HashTable` $O(1 + \varepsilon)$

But then for search, something was missing 🧐:

Linear		Logarithmic		Constant
Insertion-Sort $O(n^2)$	$\rightarrow$	Merge-Sort $O(n\log_2(n))$	$\rightarrow$	???

Prove that it’s not possible to make the additional “jump” to constant-time search in $O(n)$, except in very special cases.

5. Building an ETL Pipeline

Example 5.2: If there are particular APIs and Database solutions that you’ve been hoping to explore, now’s your chance! Create an ETL pipeline which Extracts data from the API, Transforms it in some interesting way, then Loads it into (for example) an SQL or MongoDB database.

Nuts and Bolts for Projects

Fake Data!

When developing (as opposed to deploying) your projects, real data can add additional complexity beyond the MVP!: It might cost $$$, it might be rate-limited, or they might catch you scraping their website and ban you ☠️
- (Also: might be too complicated to handle in the time available, even if you use Prefect 😉)
Instead, use Faker!
In particular, see Standard Providers page for listing of all the types of data you can fake!

Visual Interfaces

A good number of you are interested in visualizing or allowing interaction with the structures/algorithms from class
One quick way to allow this: Streamlit!
However, Streamlit’s “default” is form-based interaction (form in sidebar $\rightarrow$ results in main panel); a bit more work required to make everything interactive
Demo to see what I mean, code that you can steal!

Parallel Processing in Python $\rightarrow$ Prefect

A Reminder (W10)

Humans are not capable of true multitasking! In CS terms, this would be called multiprocessing (more on this later)
We are capable, however, of various modes of concurrency!

	Multithreading	Asynchronous Execution
Unconsciously (you do it already, “naturally”)	Focus on one speaker within a loud room, with tons of other conversations entering your ears	Put something in oven, set alarm, go do something else, take out of oven once alarm goes off
Consciously (you can do it with effort/practice)	Pat head (up and down) and rub stomach (circular motion) “simultaneously”	Throw a ball in the air, clap 3 times, catch ball

Parallel in Python: The Hard Way

threading
multiprocessing
asyncio

…Joblib

Colab Notebook

Python $\rightarrow$ Prefect

Can think of code on prev slide as (implicitly) submitting numbers to the find_prime_factors() task…
So, let’s make this explicit by using Prefect’s @task annotation!
Lab Time!

Functional Programming (FP)

Functions vs. Functionals

You may have noticed: map() and reduce() are “meta-functions”: functions that take other functions as inputs

def add_5(num):
  return num + 5
add_5(10)

def apply_twice(fn, arg):
  return fn(fn(arg))
apply_twice(add_5, 10)

In Python, functions can be used as vars (Hence lambda):

add_5 = lambda num: num + 5
apply_twice(add_5, 10)

This relates to a whole paradigm, “functional programming”: mostly outside scope of course, but lots of important+useful takeaways/rules-of-thumb! →

Train Your Brain for Functional Approach $\implies$ Master Debugging!

In CS Theory: enables formal proofs of correctness
In CS practice:

When a program doesn’t work, each function is an interface point where you can check that the data are correct. You can look at the intermediate inputs and outputs to quickly isolate the function that’s responsible for a bug.
(from Python’s “Functional Programming HowTo”)

Code $\rightarrow$ Pipelines $\rightarrow$ Debuggable Pipelines

Scenario: Run code, check the output, and… it’s wrong 😵 what do you do?
Usual approach: Read lines one-by-one, figuring out what they do, seeing if something pops out that seems wrong; adding comments like # Convert to lowercase

Easy case: found typo in punctuation removal code. Fix the error, add comment like # Remove punctuation

Rule 1 of FP: transform these comments into function names

Hard case: Something in load_text() modifies a variable that later on breaks remove_punct() (Called a side-effect)

Rule 2 of FP: NO SIDE-EFFECTS!

(Does this way of diagramming a program look familiar?)

With side effects: ❌ $\implies$ issue is somewhere earlier in the chain 😩🏃‍♂️
No side effects: ❌ $\implies$ issue must be in remove_punct()!!! 😎 ⏱️ = 💰

If It’s So Useful, Why Doesn’t Everyone Do It?

Trapped in imperative (sequential) coding mindset: Path dependency / QWERTY
Reason we need to start thinking like this is: it’s 1000x harder to debug parallel code! So we need to be less ad hoc in how we write+debug, from here on out! 🙇‍♂️🙏

From Leskovec, Rajaraman, and Ullman (2014)

The title relates to a classic Economics joke (the best kind of joke): “An economist and a CEO are walking down the street, when the CEO points at the ground and tells the economist, ‘look! A $20 bill on the ground!’ The economist keeps on walking, scoffing at the CEO: ‘don’t be silly, if there was a $20 bill on the ground, somebody would have picked it up already’.”

But… Why is All This Weird Mapping and Reducing Necessary?

Without knowing a bit more of the internals of computing efficiency, it may seem like a huge cost in terms of overly-complicated overhead, not to mention learning curve…

The “Killer Application”: Matrix Multiplication

(I learned from Jeff Ullman, who did the obnoxious Stanford thing of mentioning in passing how “two previous students in the class did this for a cool final project on web crawling and, well, it escalated quickly”, aka became Google)

From Leskovec, Rajaraman, and Ullman (2014), which is (legally) free online!

The Killer Way-To-Learn: Text Counts!

(2014): Text counts (2.2) $\rightarrow$ Matrix multiplication (2.3) $\rightarrow \cdots \rightarrow$ PageRank (5.1)
(And yall thought it was just busywork for HW3 😏)
The goal: User searches “Denzel Curry”… How relevant is a given webpage?
Scenario 1: Entire internet fits on CPU $\implies$ We can just make a big big hash table:

If Everything Doesn’t Fit on CPU…

From Cornell Virtual Workshop, “Understanding GPU Architecture”

Break Problem into Chunks for the Green Bois!

$\implies$ Total = $O(3n) = O(n)$
But also optimized in terms of constants, because of sequential memory reads

HW5A

Scrape quotes in parallel!
Advantage: Builds on HW4
Disadvantage: Embarrassingly parallel, so… no Map-Reduce coding practice

HW5B

Map-Reduced Matrix-Vector Multiplication

References

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.

Week 12: Final Projects, Map-Reduce

Schedule

Making Final Projects Less Scary!

How Do I Pick A Topic?

Getting From Here to There

The Options

1. A Deeper Dive

2. Fixing an Issue

3. Exploring Applications

4. Proving Things

5. Building an ETL Pipeline

Nuts and Bolts for Projects

Fake Data!

Visual Interfaces

Parallel Processing in Python \(\rightarrow\) Prefect

A Reminder (W10)

Parallel in Python: The Hard Way

…Joblib

Python \(\rightarrow\) Prefect

Functional Programming (FP)

Functions vs. Functionals

Train Your Brain for Functional Approach \(\implies\) Master Debugging!

Code \(\rightarrow\) Pipelines \(\rightarrow\) Debuggable Pipelines

If It’s So Useful, Why Doesn’t Everyone Do It?

But… Why is All This Weird Mapping and Reducing Necessary?

The “Killer Application”: Matrix Multiplication

The Killer Way-To-Learn: Text Counts!

If Everything Doesn’t Fit on CPU…

Break Problem into Chunks for the Green Bois!

HW5A

HW5B

References