Week 12: Final Projects, Functional Programming, Map-Reduce

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Thursday, April 9, 2026

Open slides in new window →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 6:45pm Final Projects →
6:45pm 7:00pm HW4 on Jetstream →
7:00pm 8:00pm Beyond Embarrassingly Parallel →
Break! 8:00pm 8:10pm
8:10pm 9:00pm Data Mining in Parallel →

Making Final Projects Less Scary!

How Do I Pick A Topic?

  • I know that “whatever is interesting to you” can be way overly-vague!
  • So, one approach is: imagine yourself in a job interview for your dream job, and they bring up DSAN 5450: “Interesting, what did you do in that class?”
  • [Insert final project elevator pitch] “Wow, that’s such a cool project, we really want someone who can [say] take a data-driven approach to a policy question like that. You’re hired!”
  • (Jeff gets a commission: 10% of your salary)

Getting From Here to There

  • Minimum Viable Product (MVP)
  • \(\leadsto\) Final Product (but… Hofstadter’s Law)
Note Hofstadter’s Law (Paraphrase)

The pieces of your DSAN final project will take longer than you expect, even if you take Hofstadter’s Law into account

The Options

  1. A Deeper Dive
  2. Fixing an Issue
  3. Exploring Applications
  4. Proving Correctness and/or Efficiency
  5. Building an ETL Pipeline
  • (New examples on next 5 slides!)

1. A Deeper Dive

  • Example 1.3: Creating an interactive visualization (using Streamlit, for example) of a data structure or algorithm. (Streamlit demo coming in next section of slides!)

2. Fixing an Issue

  • Example 2.2: As you saw in the Midterm, hash tables “degenerate” from (approximate) \(O(1)\) efficiency to \(O(\log_2(n))\) efficiency if the hashing function we use is not efficient or not collision-resistant.
  • So, your project could be to:
    • Explore and summarize how efficient and collision-resistant hash functions work, and then
    • Implement one such hash function in Python.

3. Exploring Applications

  • Example 3.2: We learned Breadth-First Search (BFS) and Depth-First Search (DFS) in somewhat of a hurry, as a way to traverse over the nodes in a Binary Search Tree, but they both have many more exciting applications!
  • For example, if you’re interested in web scraping, you could adapt the HW2 code to create a polymorphic web-scraper:
    • Each node is a webpage
    • Processing a node means adding each link in that page to the ThingContainer.

4. Proving Things

  • Example 4.2: We were able to gradually improve retrieval:
Linear Logarithmic Constant
LinkedList
\(O(n)\)
\(\rightarrow\) BST
\(O(\log_2(n))\)
\(\rightarrow\) HashTable
\(O(1 + \varepsilon)\)
  • But then for search, something was missing 🧐:
Linear Logarithmic Constant
Insertion-Sort
\(O(n^2)\)
\(\rightarrow\) Merge-Sort
\(O(n\log_2(n))\)
\(\rightarrow\) ???
  • Prove that it’s not possible to make the additional “jump” to constant-time search in \(O(n)\), except in very special cases.

5. Building an ETL Pipeline

  • Example 5.2: If there are particular APIs and Database solutions that you’ve been hoping to explore, now’s your chance! Create an ETL pipeline which Extracts data from the API, Transforms it in some interesting way, then Loads it into (for example) an SQL or MongoDB database.

Nuts and Bolts for Projects

Fake Data!

  • When developing (as opposed to deploying) your projects, real data can add additional complexity beyond the MVP!: It might cost $$$, it might be rate-limited, or they might catch you scraping their website and ban you ☠️
    • (Also: might be too complicated to handle in the time available, even if you use Prefect 😉)
  • Instead, use Faker!
  • In particular, see Standard Providers page for listing of all the types of data you can fake!

Visual Interfaces

  • A good number of you are interested in visualizing or allowing interaction with the structures/algorithms from class
  • One quick way to allow this: Streamlit!
  • However, Streamlit’s “default” is form-based interaction (form in sidebar \(\rightarrow\) results in main panel); a bit more work required to make everything interactive
  • Demo to see what I mean, code that you can steal!

Functional Programming (FP)

  • 🌱 Everything is a function (literally everything)
    • (0 \(\leadsto\) def make_zero(): return 0 🤯)
  • 🌱 Functions always return a single value
  • 🌱 Functions calculate their return value based only on their arguments (no “incoming” side effects)
  • 🌱 Functions don’t mutate any already-existing values (no “outgoing” side effects)

Functions vs. Functionals

  • You may have noticed: map() and reduce() are “meta-functions”: functions that take other functions as inputs
def add_5(num):
  return num + 5
add_5(10)
15
def apply_twice(fn, arg):
  return fn(fn(arg))
apply_twice(add_5, 10)
20
  • In Python, functions can be used as vars (Hence lambda):
add_5 = lambda num: num + 5
apply_twice(add_5, 10)
20
  • This is what makes Principle (everything is a function) feasible in Python in the first place! (But see also: Scala)

Train Your Brain for Functional Approach \(\implies\) Master Debugging!

When a program doesn’t work, each function is an interface point where you can check that the data are correct. You can look at the intermediate inputs and outputs to quickly isolate the function that’s responsible for a bug.
(from Python’s “Functional Programming HowTo”)

Code \(\rightarrow\) Pipelines \(\rightarrow\) Debuggable Pipelines

  • Scenario: Run code, check the output, and… it’s wrong 😵 what do you do?
  • Usual approach: Read lines one-by-one, figuring out what they do, seeing if something pops out that seems wrong; adding comments like # Convert to lowercase
  • Easy case: found typo in punctuation removal code. Fix the error, add comment like # Remove punctuation

    First “application” of FP thinking: transform these comments into function names (why?)

  • Hard case: Something in load_text() modifies a variable that later on breaks remove_punct() (Called a side-effect)

    Second “application” of FP thinking: Get rid of these side effects! (why?)

G input in.txt load_text load_text (Verb) input->load_text lowercase lowercase (Verb) load_text->lowercase 🧐 ✅ remove_punct remove_punct (Verb) lowercase->remove_punct 🧐 ✅ remove_stopwords remove_stopwords (Verb) remove_punct->remove_stopwords 🧐 ❌❗️ output out.txt remove_stopwords->output

(Does this way of diagramming a program look familiar?)

  • With side effects: ❌ \(\implies\) issue is somewhere earlier in the chain 😩🏃‍♂️
  • No side effects: ❌ \(\implies\) issue must be in remove_punct()!!! 😎 ⏱️ = 💰

If It’s So Useful, Why Doesn’t Everyone Do It?

  • Trapped in imperative (sequential) coding mindset: Path dependency (QWERTY!)
  • Reason we need to start thinking like this is: it’s 1000x harder to debug parallel code! So we need to be less ad hoc in how we write+debug, from here on out! 🙇‍♂️🙏

From Leskovec et al. (2014)

The title relates to a classic Economics joke (the best kind of joke): “An economist and a CEO are walking down the street, when the CEO points at the ground and tells the economist, ‘look! A $20 bill on the ground!’ The economist keeps on walking, scoffing at the CEO: ‘don’t be silly, if there was a $20 bill on the ground, somebody would have picked it up already’.”

But… Why is All This Weird Mapping and Reducing Necessary?

  • Supercharged Debugging
    • (A bit easier to wrap our heads around in the weeks we have left, so we’ll focus more on this advantage!)
  • “Automatic” Parallelization
    • We can learn a bit more about the internals of computing efficiency, to see why the huge cost in terms of overly-complicated overhead (see prev slide), not to mention learning curve, is still worth it!

The “Pareto Principle” (80-20 Rule)

  • Pro-FP argument: “80% of outcomes are the result of 20% of inputs”

  • Specifically: one argument made for FP is that imperative and/or Object-Oriented* code leads to

    20% of cases (corner cases) causing 80% of issues with code

  • Corner cases = cases where we have to “open up” a function and “trace” an issue line-by-line:

    • “Why did this input (line 1) eventually lead to this error (line 200)?
  • FP goal: Function signatures tell their full story (“success” in FP \(\propto\) never have to look “inside” a function!)

Python Code Lying To You 😈

Code
def get_first_character(s):
  return s[0]

Sounds good to me!

Code
get_first_character("Hello")
'H'

So far so good…

Code
get_first_character("")
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 get_first_character("")

Cell In[4], line 2, in get_first_character(s)
      1 def get_first_character(s):
----> 2   return s[0]

IndexError: string index out of range

Python Code Still Lying To You

Code
def divide(a, b):
  return a / b

Ok, I’m a little scared after the last slide, but I’ll try to trust this function… 🙈

Code
divide(2, 4), divide(4, 2)
(0.5, 2.0)

Perhaps I can trust this function after all? 🥹

Code
divide(2, 0)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[9], line 1
----> 1 divide(2, 0)

Cell In[7], line 2, in divide(a, b)
      1 def divide(a, b):
----> 2   return a / b

ZeroDivisionError: division by zero

How To Lie Less 😇

[In FP] we focus on functions that don’t lie. We want their signatures to tell the whole story about the body (Plachta 2022)

OOP vs. FP, Side-By-Side

An Object-Oriented Tip Calculator:

Code
class TipCalculator:
  def __init__(self):
    self.names = []
    self.tip_percentage = 0
  
  def add_person(self, name: str):
    self.names.append(name)
    if(len(self.names) > 5):
      self.tip_percentage = 20
    else:
      self.tip_percentage = 10
  
  def get_names(self) -> list[str]:
    return self.names
  
  def get_tip_percentage(self) -> int:
    return self.tip_percentage
my_calc = TipCalculator()
my_calc.add_person("Jeff")
my_calc.get_tip_percentage()
10
Code
for i in range(5): my_calc.add_person(f'Jeff{i}')
my_calc.get_tip_percentage()
20

A Functional Tip Calculator:

Code
class TipCalculatorFP:
  def add_person(
    names: list[str], name: str
  ) -> list[str]:
    updated = names.copy()
    updated.append(name)
    return updated
  def get_tip_percentage(
    names: list[str]
  ) -> int:
    if len(names) > 5:
      return 20
    elif len(names) > 0:
      return 10
    else:
      return 0
names = []
names = TipCalculatorFP.add_person(names, "Jeff")
TipCalculatorFP.get_tip_percentage(names)
10
Code
for i in range(5): names = TipCalculatorFP.add_person(names, f'Jeff{i}')
TipCalculatorFP.get_tip_percentage(names)
20

Even Fancier FP (…Fancy Not Always Better!)

Code
from toolz.functoolz import pipe
add_jeff = lambda input: TipCalculatorFP.add_person(input, "Jeff")
pipe(
    [],
    add_jeff,
    TipCalculatorFP.get_tip_percentage
)
10

Do we need to name a whole new function each time we want to add a name? No!

Code
pipe(
    [],
    lambda input: TipCalculatorFP.add_person(input, "Jeff"),
    TipCalculatorFP.get_tip_percentage
)
10

Why Are Side-Effects So Bad for Debugging?

Code
import os
os.environ['exponent'] = '2'
def compute_powers(my_nums: list[int]) -> list[int] | list[float]:
  return [n ** int(os.environ['exponent']) for n in my_nums]
compute_powers([1, 2, 3])
[1, 4, 9]
  • You (responsible for your org’s compute_powers()) commit code and go to sleep 😴
  • Meanwhile, teammate in Singapore starts up Python and runs your code…
Code
import dotenv
dotenv.load_dotenv(override=True)
compute_powers([1, 2, 3])
[1, 8, 27]
  • Same function, run in different environments, produces different outputs
  • One solution: Figure out DC-Singapore environment-coordination system… 😫
  • FP solution: Refactor code to eliminate side-effects (Ex: exponent argument would tell user they need to explicitly specify exponent before using your function!)

“Outgoing” Side-Effects

  • Previous slide: “incoming” side-effects (from exponent environment variable)
  • Here: secret “outgoing” side-effect (may be as “surprising” to user)
Code
def add(a, b):
  with open('log.txt', 'w', encoding='utf-8') as logfile:
    logfile.write(f'User adding {a} and {b}')
  return a + b
  • General principle: Think of who is using function and what they will expect
    • Ex: When doing math, people “expect” PEMDAS

User Expectations Across Locales

Code
def parse_money_amount(money_amount: str) -> float:
  return float(money_amount.replace(",",""))
parse_money_amount("1,000,000")
1000000.0
Code
parse_money_amount("1'000'000")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[20], line 1
----> 1 parse_money_amount("1'000'000")

Cell In[19], line 2, in parse_money_amount(money_amount)
      1 def parse_money_amount(money_amount: str) -> float:
----> 2   return float(money_amount.replace(",",""))

ValueError: could not convert string to float: "1'000'000"

FP “Killer Application”: Matrix Multiplication

  • (I learned from Jeff Ullman, who did the obnoxious Stanford thing of mentioning in passing how “two previous students in the class did this for a cool final project on web crawling and, well, it escalated quickly”, aka became Google)

From Leskovec et al. (2014), which is (legally) free online!

The Killer Way-To-Learn: Text Counts!

  • (2014): Text counts (2.2) \(\rightarrow\) Matrix multiplication (2.3) \(\rightarrow \cdots \rightarrow\) PageRank (5.1)
  • (And yall thought it was just busywork for HW3 😏)
  • The goal: User searches “Denzel Curry”… How relevant is a given webpage?
  • Scenario 1: Entire internet fits on CPU \(\implies\) We can just make a big big hash table:

G internet Scan in O(n): Today Denzel Washington ate a big bowl of Yum's curry. Denzel allegedly rubbed his tum and said "yum yum yum" when he tasted today's curry. "Yum! It is me Denzel, curry is my fav!", he exclaimed. According to his friend Steph, curry is indeed Denzel's fav. We are live with Del Curry in Washington for a Denzel curry update. ccounts Overall Counts ('according',1) ('allegedly',1) ('ate',1) ('big',1) ('bowl',1) ('curry',6) ('del',1) ('denzel',5) ('exclaimed',1) ('fav',2) ('friend',1) ('indeed',1) ('live',1) ('rubbed',1) ('said',1) ('steph',1) ('tasted',1) ('today',2) ('tum',1) ('update',1) ('washington',2) ('yum',4) internet->ccounts Hash Table

If Everything Doesn’t Fit on CPU…

From Cornell Virtual Workshop, “Understanding GPU Architecture”

Break Problem into Chunks for the Green Bois!

G chunked Chunked Document Today Denzel Washington ate a big bowl of Yum's curry. Denzel allegedly rubbed his tum and said "yum yum yum" when he tasted today's curry. "Yum! It is me Denzel, curry is my fav!", he exclaimed. According to his friend Steph, curry is indeed Denzel's fav. We are live with Del Curry in Washington for a Denzel curry update. chcounts Chunked Counts ('today',1) ('denzel',1) ... ('tum',1) ('said',1) ('yum',1) ('yum',1) ('yum',1) ... ('fav',1) ('exclaimed',1) ... ('del',1) ('curry',1) ('washington',1) ('denzel',1) ('curry',1) ('update',1) chunked:p1->chcounts:p1 O(n/4) chunked:p2->chcounts:p2 O(n/4) chunked:p3->chcounts:p3 O(n/4) chunked:p4->chcounts:p4 O(n/4) scounts Hashed Counts ('allegedly',1) ... ('curry',1) ('denzel',2) ... ('yum',1) ('curry',2) ('denzel',1) ... ('yum',4) ('according',1) ('curry',1) ('del',1) ('denzel',1) ... ('curry',2) ('denzel',1) ('update',1) ('washington',1) chcounts:p1->scounts:p1 O(n/4) chcounts:p2->scounts:p2 O(n/4) chcounts:p3->scounts:p3 O(n/4) chcounts:p4->scounts:p4 O(n/4) ccounts Overall Counts ('according',1) ('allegedly',1) ('ate',1) ('big',1) ('bowl',1) ('curry',6) ('del',1) ('denzel',5) ('exclaimed',1) ('fav',2) ('friend',1) ('indeed',1) ('live',1) ('rubbed',1) ('said',1) ('steph',1) ('tasted',1) ('today',2) ('tum',1) ('update',1) ('washington',2) ('yum',4) scounts:p1->ccounts:p1 scounts:p2->ccounts:p1 scounts:p3->ccounts:p1 scounts:p4->ccounts:p1 scounts:p2->ccounts Merge in O(n)

  • \(\implies\) Total = \(O(3n) = O(n)\)
  • But also optimized in terms of constants, because of sequential memory reads

References

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press. http://infolab.stanford.edu/~ullman/mmds/book.pdf.
Plachta, Michal. 2022. Grokking Functional Programming. Simon and Schuster. https://books.google.com?id=Z4aUEAAAQBAJ.