I know that “whatever is interesting to you” can be way overly-vague!
So, one approach is: imagine yourself in a job interview for your dream job, and they bring up DSAN 5450: “Interesting, what did you do in that class?”
[Insert final project elevator pitch] “Wow, that’s such a cool project, we really want someone who can [say] take a data-driven approach to a policy question like that. You’re hired!”
(Jeff gets a commission: 10% of your salary)
Getting From Here to There
Minimum Viable Product (MVP)
\(\leadsto\) Final Product (but… Hofstadter’s Law)
Hofstadter’s Law (Paraphrase)
The pieces of your DSAN final project will take longer than you expect, even if you take Hofstadter’s Law into account
The Options
A Deeper Dive
Fixing an Issue
Exploring Applications
Proving Correctness and/or Efficiency
Building an ETL Pipeline
(New examples on next 5 slides!)
1. A Deeper Dive
Example 1.3: Creating an interactive visualization (using Streamlit, for example) of a data structure or algorithm. (Streamlit demo coming in next section of slides!)
2. Fixing an Issue
Example 2.2: As you saw in the Midterm, hash tables “degenerate” from (approximate) \(O(1)\) efficiency to \(O(\log_2(n))\) efficiency if the hashing function we use is not efficient or not collision-resistant.
So, your project could be to:
Explore and summarize how efficient and collision-resistant hash functions work, and then
Implement one such hash function in Python.
3. Exploring Applications
Example 3.2: We learned Breadth-First Search (BFS) and Depth-First Search (DFS) in somewhat of a hurry, as a way to traverse over the nodes in a Binary Search Tree, but they both have many more exciting applications!
For example, if you’re interested in web scraping, you could adapt the HW2 code to create a polymorphic web-scraper:
Each node is a webpage
Processing a node means adding each link in that page to the ThingContainer.
4. Proving Things
Example 4.2: We were able to gradually improve retrieval:
Linear
Logarithmic
Constant
LinkedList \(O(n)\)
\(\rightarrow\)
BST \(O(\log_2(n))\)
\(\rightarrow\)
HashTable \(O(1 + \varepsilon)\)
But then for search, something was missing 🧐:
Linear
Logarithmic
Constant
Insertion-Sort \(O(n^2)\)
\(\rightarrow\)
Merge-Sort \(O(n\log_2(n))\)
\(\rightarrow\)
???
Prove that it’s not possible to make the additional “jump” to constant-time search in \(O(n)\), except in very special cases.
5. Building an ETL Pipeline
Example 5.2: If there are particular APIs and Database solutions that you’ve been hoping to explore, now’s your chance! Create an ETL pipeline which Extracts data from the API, Transforms it in some interesting way, then Loads it into (for example) an SQL or MongoDB database.
Nuts and Bolts for Projects
Fake Data!
When developing (as opposed to deploying) your projects, real data can add additional complexity beyond the MVP!: It might cost $$$, it might be rate-limited, or they might catch you scraping their website and ban you ☠️
(Also: might be too complicated to handle in the time available, even if you use Prefect 😉)
However, Streamlit’s “default” is form-based interaction (form in sidebar \(\rightarrow\) results in main panel); a bit more work required to make everything interactive
When a program doesn’t work, each function is an interface point where you can check that the data are correct. You can look at the intermediate inputs and outputs to quickly isolate the function that’s responsible for a bug. (from Python’s “Functional Programming HowTo”)
Scenario: Run code, check the output, and… it’s wrong 😵 what do you do?
Usual approach: Read lines one-by-one, figuring out what they do, seeing if something pops out that seems wrong; adding comments like # Convert to lowercase
Easy case: found typo in punctuation removal code. Fix the error, add comment like # Remove punctuation
Rule 1 of FP: transform these comments into function names
Hard case: Something in load_text() modifies a variable that later on breaks remove_punct() (Called a side-effect)
Rule 2 of FP: NO SIDE-EFFECTS!
(Does this way of diagramming a program look familiar?)
With side effects: ❌ \(\implies\) issue is somewhere earlier in the chain 😩🏃♂️
No side effects: ❌ \(\implies\) issue must be in remove_punct()!!! 😎 ⏱️ = 💰
If It’s So Useful, Why Doesn’t Everyone Do It?
Trapped in imperative (sequential) coding mindset: Path dependency / QWERTY
Reason we need to start thinking like this is: it’s 1000x harder to debug parallel code! So we need to be less ad hoc in how we write+debug, from here on out! 🙇♂️🙏
The title relates to a classic Economics joke (the best kind of joke): “An economist and a CEO are walking down the street, when the CEO points at the ground and tells the economist, ‘look! A $20 bill on the ground!’ The economist keeps on walking, scoffing at the CEO: ‘don’t be silly, if there was a $20 bill on the ground, somebody would have picked it up already’.”
But… Why is All This Weird Mapping and Reducing Necessary?
Without knowing a bit more of the internals of computing efficiency, it may seem like a huge cost in terms of overly-complicated overhead, not to mention learning curve…
The “Killer Application”: Matrix Multiplication
(I learned from Jeff Ullman, who did the obnoxious Stanford thing of mentioning in passing how “two previous students in the class did this for a cool final project on web crawling and, well, it escalated quickly”, aka became Google)
From Leskovec, Rajaraman, and Ullman (2014), which is (legally) free online!