import time
from sympy.ntheory import factorint
from joblib import Parallel, delayed
= Parallel(n_jobs=4)
parallel_runner = 500, 580
start, end def find_prime_factors(num):
.01)
time.sleep(return factorint(num, multiple=True)
= lambda start, end: print('{:.4f} s'.format(end - start)) disp_time
Week 12: Final Projects, Interfaces
DSAN 5500: Data Structures, Objects, and Algorithms in Python
Class Sessions
Making Final Projects Less Scary!
How Do I Pick A Topic?
- I know that “whatever is interesting to you” can be way overly-vague!
- So, one approach is: imagine yourself in a job interview for your dream job, and they bring up DSAN 5450: “Interesting, what did you do in that class?”
- [Insert final project elevator pitch] “Wow, that’s such a cool project, we really want someone who can [say] take a data-driven approach to a policy question like that. You’re hired!”
- (Jeff gets a commission: 10% of your salary)
Getting From Here to There
- Minimum Viable Product (MVP)
- \(\leadsto\) Final Product (but… Hofstadter’s Law)
The Options
- A Deeper Dive
- Fixing an Issue
- Exploring Applications
- Proving Correctness and/or Efficiency
- Building an ETL Pipeline
- (New examples on next 5 slides!)
1. A Deeper Dive
- Example 1.3: Creating an interactive visualization (using Streamlit, for example) of a data structure or algorithm. (Streamlit demo coming in next section of slides!)
2. Fixing an Issue
- Example 2.2: As you saw in the Midterm, hash tables “degenerate” from (approximate) \(O(1)\) efficiency to \(O(\log_2(n))\) efficiency if the hashing function we use is not efficient or not collision-resistant.
- So, your project could be to:
- Explore and summarize how efficient and collision-resistant hash functions work, and then
- Implement one such hash function in Python.
3. Exploring Applications
- Example 3.2: We learned Breadth-First Search (BFS) and Depth-First Search (DFS) in somewhat of a hurry, as a way to traverse over the nodes in a Binary Search Tree, but they both have many more exciting applications!
- For example, if you’re interested in web scraping, you could adapt the HW2 code to create a polymorphic web-scraper:
- Each node is a webpage
- Processing a node means adding each link in that page to the
ThingContainer
.
4. Proving Things
- Example 4.2: We were able to gradually improve retrieval:
Linear | Logarithmic | Constant | ||
---|---|---|---|---|
LinkedList \(O(n)\) |
\(\rightarrow\) | BST \(O(\log_2(n))\) |
\(\rightarrow\) | HashTable \(O(1 + \varepsilon)\) |
- But then for search, something was missing 🧐:
Linear | Logarithmic | Constant | ||
---|---|---|---|---|
Insertion-Sort \(O(n^2)\) |
\(\rightarrow\) | Merge-Sort \(O(n\log_2(n))\) |
\(\rightarrow\) | ??? |
- Prove that it’s not possible to make the additional “jump” to constant-time search in \(O(n)\), except in very special cases.
5. Building an ETL Pipeline
- Example 5.2: If there are particular APIs and Database solutions that you’ve been hoping to explore, now’s your chance! Create an ETL pipeline which Extracts data from the API, Transforms it in some interesting way, then Loads it into (for example) an SQL or MongoDB database.
Nuts and Bolts for Projects
Fake Data!
- When developing (as opposed to deploying) your projects, real data can add additional complexity beyond the MVP!: It might cost $$$, it might be rate-limited, or they might catch you scraping their website and ban you ☠️
- (Also: might be too complicated to handle in the time available, even if you use Prefect 😉)
- Instead, use Faker!
- In particular, see Standard Providers page for listing of all the types of data you can fake!
Visual Interfaces
- A good number of you are interested in visualizing or allowing interaction with the structures/algorithms from class
- One quick way to allow this: Streamlit!
- However, Streamlit’s “default” is form-based interaction (form in sidebar \(\rightarrow\) results in main panel); a bit more work required to make everything interactive
- Demo to see what I mean, code that you can steal!
Parallel Processing in Python \(\rightarrow\) Prefect
A Reminder (W10)
- Humans are not capable of true multitasking! In CS terms, this would be called multiprocessing (more on this later)
- We are capable, however, of various modes of concurrency!
Multithreading | Asynchronous Execution | |
---|---|---|
Unconsciously (you do it already, “naturally”) |
Focus on one speaker within a loud room, with tons of other conversations entering your ears | Put something in oven, set alarm, go do something else, take out of oven once alarm goes off |
Consciously (you can do it with effort/practice) |
Pat head (up and down) and rub stomach (circular motion) “simultaneously” | Throw a ball in the air, clap 3 times, catch ball |
Parallel in Python: The Hard Way
threading
multiprocessing
asyncio
…Joblib
In Action
Code
= time.time()
serial_start = [
result
(i,find_prime_factors(i))for i in range(start, end+1)
]= time.time()
serial_end disp_time(serial_start, serial_end)
0.9951 s
Code
= time.time()
par_start = parallel_runner(
result
delayed(find_prime_factors)(i)for i in range(start, end+1)
)= time.time()
par_end disp_time(par_start, par_end)
2.1488 s
Python \(\rightarrow\) Prefect
- Can think of code on prev slide as (implicitly) submitting numbers to the
find_prime_factors()
task… - So, let’s make this explicit by using Prefect’s
@task
annotation! - Lab Time!
HW4
- Do that but for quotes!
- Scrape (one, and then many) page(s) from the default interface, in parallel!