Final Project Details
DSAN 5500: Data Structures, Objects, and Algorithms in Python
- [Sunday, April 14, 11pm] Added some additional examples to the cards in the Overview section
- [Thursday, April 4, 6:15pm] Added section with details on Individual vs. Group Projects
- [Thursday, April 4, 4:30pm] Updated Timeline section to specify that proposals should be approved by April 10th, not April 3rd
Overview
The idea for your final project is as follows:
In pretty much any course on algorithms and data structures, whether in a Data Science, Computer Science, or Applied Math context, there’s only so much that can be covered in a given unit. This means that, for all of the topics we’ve covered thus far, there’s a next step that could be taken in learning about the topic. In general this could mean one of the following four options, where I’ve provided a concrete example specific to the class under each heading:
Diving 'one level deeper' into the details of an algorithm/data structure
- Example 1.1: Efficient implementations of deletion in linked lists, rather than just the insertion and retrieval operations on linked lists that we looked at.
- Example 1.2: Expanding the Binary Search Tree into a Quadtree
- Example 1.3: Creating an interactive visualization (using Streamlit, for example) of a data structure or algorithm.
Figuring out how to address some existing drawback of the algorithm/data structure
- Example 2.1: Binary Search Trees lose their logarithmic efficiency if items are added to them in order (since newly-added items are always added to the right subtree of all existing items, thus "degenerating" into a linear linked list structure), an issue which fancier versions of the BST like AVL Trees and Red-Black Trees resolve.
- Example 2.2: As you saw in the Midterm, hash tables "degenerate" from (approximate) $O(1)$ efficiency to $O(\log_2(n))$ efficiency if the **hashing function** we use is not **efficient** or not **collision-resistant**. So, your project could be to explore and summarize how efficient and collision-resistant hash functions work, and then implement one such hash function in Python.
Examining different applications of the data structure that make it helpful for particular contexts
- Example 3.1: In bioinformatics, the sorting algorithms we did learn are helpful for genetic sequence searching, but can be made even more efficient for genetic sequence searches by applying the Burrows-Wheeler Transform to a collection of genetic sequences before applying the sorting algorithms.
- Example 3.2: We learned Breadth-First Search (BFS) and Depth-First Search (DFS) in somewhat of a hurry, as a way to traverse over the nodes in a Binary Search Tree, but they both have many more exciting applications! For example, if you're interested in web scraping, you could adapt the code from HW2 to create a polymorphic web-scraper, where each node is a webpage, and processing a node means adding each link in the page to the stack/queue.
Carrying out a more robust analysis of the correctness and/or asymptotic complexity of the algorithms and data structures we looked at
- Example 4.1: In the lecture on Insertion-Sort, I hinted at the notion of a correctness proof using invariants---how it would proceed along the same lines as a proof by induction that you might have seen in a previous math class---but never formally walked you through how to do one.
- Example 4.2: Over the first half of the course, we were able to gradually improve retrieval of elements from data structures in a progression like:
Linear Logarithmic Constant LinkedList
\(O(n)\)\(\rightarrow\) BST
\(O(\log_2(n))\)\(\rightarrow\) HashTable
\(O(1 + \varepsilon)\)Linear Logarithmic Constant Insertion-Sort
\(O(n^2)\)\(\rightarrow\) Merge-Sort
\(O(n\log_2(n))\)\(\rightarrow\) ???
Using data validation and pipeline orchestration libraries to build a pipeline useful for your life or your job
- Example 5.1: If there is some task you currently handle manually---say, entering receipts from your email into a spreadsheet, or (the example I mentioned in class) making a fulltext index of ebooks in your library---your project can explore automating this task using Pydantic and Prefect
- Example 5.2: If there are particular APIs and Database solutions that you've been hoping to explore, now's your chance! Create an ETL pipeline which Extracts data from the API, Transforms it in some interesting way, then Loads it into (for example) an SQL or MongoDB database.
So, your task for the final project will be to
- Choose one of the topics we covered
- Determine or work with your mentor to find an interesting “next step” with respect to this topic
- Carry out this next step (details below, since this will look different for different choices from the above options!)
- Write up what you did in a report, including any of the code or math or data analysis that you did in Step 3
Concrete Requirements
For each of the options described in the Overview section above, the following info boxes describe the structure of the deliverable(s) you would be responsible for (click a header to expand the box; they’re collapsed by default to avoid overloading you with info if you’ve already chosen!)
Diving ‘one level deeper’ into the details of the algorithm/data structure
Here, the deliverable should be structured like:
- State which algorithm or data structure you chose, from among those covered in class
- Identify a substantive implementation detail that we didn’t cover in class
- Implement it in Python, describing in your own words (through comments or Markdown text above/below the code) how your implementation works
- Write a set of unit tests demonstrating that your implementation resolves/fixes the drawback that you identified in Step 2
Figuring out how to address some existing drawback of the algorithm/data structure
Here, the deliverable should be structured like:
- State which algorithm or data structure you chose, from among those covered in class
- Identify a substantive drawback present in the implementation of that topic that we learned in class
- Implement it in Python, describing in your own words (through comments or Markdown text above/below the code) how your implementation works
- Write a set of unit tests demonstrating that your implementation handles a range of possible inputs
Examining different applications of the data structure that make it helpful for particular fields
- State which course topic you’ve chosen
- Identify a substantive application of that topic in a field you are interested in: e.g., bioinformatics, marketing, finance, NLP, image processing, operations management, social science
- Write a Literature Review section containing a few examples of the use of the topic in your chosen field that are particularly interesting to you
- Choose one of the examples from your literature review and create a Quarto document or Jupyter notebook containing code that you wrote yourself which illustrates the application of the data structure or algorithm within your domain of interest, describing in your own words (through comments or Markdown text above/below the code) how the code works and what it does
Carrying out a more robust analysis of the correctness and/or asymptotic complexity of the algorithms and data structures we looked at
- State which algorithm or data structure you chose, from among those covered in class, and whether your focus is on analyzing correctness or efficiency of the algorithm/data structure (or both!)
- Write out the chosen proof, which should be your own work, though it can draw on existing proofs
- Throughout the proof in Step 2, also provide an explanation of each step—for guidance on what we’re looking for, imagine not only writing the proof but also walking someone through it who has never seen it before (though you can assume they have some level of mathematical background)
(Basically, for this option, we can’t prevent you from looking at an existing proof—and we don’t want to!—so instead we are looking at your ability to convey your understanding of how/why the proof works)
Using data validation and pipeline orchestration libraries to build a pipeline useful for your life or your job
- Give an outline of the problem(s) with how you currently perform the task (for example, inefficiency or difficulty in tracking progress), and why you think they could be helped or resolved with the aid of an ETL pipeline (NOTE: If the pipeline would be too ambitious to complete during the time remaining in the semester, then in this step you should describe a Minimum Viable Product that you are able to do within the remaining time)
- Explicitly state and describe the Extract, Transform, and Load steps that you will implement, making note of the points where you will use Pydantic/Pandera for data validation (you must incorporate at least one data validation step somewhere in your pipeline—for example, a custom class you create which extends Pydantic’s
BaseModel
class) - Include the actual code from your pipeline, along with substantive comments and descriptions of how it works (at minimum, for each task within your flow, you should describe each of its arguments and what it returns—if you’re comfortable doing so, the best way to do this would to include a Sphinx Docstring within each of your tasks!)
- Finally, include screenshots from the Prefect UI showing an example run of the pipeline, as well as any artifacts that your pipeline generates upon completion (your pipeline must produce at least one artifact as a completion report, though it doesn’t need to be fancy—for example, it could just be a short Markdown-formatted document which displays the total runtime of the pipeline from start to finish)
Individual vs. Group Projects
It is totally up to you whether you’d like to do the project individually or in a group with other students. However, if you are pursuing the project as a group, please choose one member of the group to serve as the “project lead”, and include this detail in an email to your mentor.
The mentor for the group project will then be whoever was assigned as the individual mentor for the project leader (this choice doesn’t have to be related to the actual work on the project, it is just for us to be able to allocate mentees fairly between the course staff!).
Expectations for group projects will scale based on the number of members in the group: for example, a group with two members will be expected to carry out a more substantive project, such that it requires approximately two times the amount of work that would be expected for individual projects1.
Timeline
In general, the only “true” due date is the due date for the final submission, but the project will go most smoothly if you are able to hold yourself to the following schedule:
- Proposal: Approved by mentor by Wednesday, April 10th
- Final Draft: Sent to mentor for review by Monday, April 29th
- Submission: Completed project submitted via Google Classroom by Friday, May 6th, 5:59pm EDT
Submission Format
There is now an assignment page for the final project on Google Classroom, where you will upload your final submission for grading. The following is a rough sketch of what we’re looking for in terms of the structure of your submission:
- HTML format, as a rendered Quarto manuscript, would be optimal, but can be PDF if preferred—for example, if you choose Option 4 (involving mathematical proofs), you might instead want to use LaTeX rendered to PDF.
- A requirement in terms of number of pages is difficult, but a reasonable range for the PDF format would be 5-20 pages double-spaced. Therefore, for a Quarto document or Jupyter notebook, the length can be the equivalent of this (for example, you can print-preview the Quarto doc to see how many pages it would produce if printed)
- It should have an abstract, a 250-500 word paragraph at the very top of the manuscript, summarizing what you did
- Citations, if used, should be set up so that they’re handled automatically. By Quarto’s citation manager for example, or by Bibtex/Biber if you’re using LaTeX.
Footnotes
This detail is not something we’re trying to explicitly measure or be harsh about, but is included here since otherwise (if the expectations for individual and group projects were the exact same) the work would scale the other way: that each person in a two-person project would be doing half the amount of work that a person doing an individual project is doing… Hopefully that makes sense from a fairness perspective!↩︎