Final Project Details

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Author

DSAN 5500 Course Staff

Published

October 10, 2025

Overview

The idea for your final project is as follows:

In pretty much any course on algorithms and data structures, whether in a Data Science, Computer Science, or Applied Math context, there’s only so much that can be covered in a given unit. This means that, for all of the topics we’ve covered thus far, there’s a next step that could be taken in learning about the topic. In general this could mean one of the following four options, where I’ve provided a concrete example specific to the class under each heading:

Option 1: A Deeper Dive

Diving 'one level deeper' into the details of an algorithm/data structure

Example 1.1: Efficient implementations of deletion in linked lists, rather than just the insertion and retrieval operations on linked lists that we looked at.
Example 1.2: Expanding the Binary Search Tree into a Quadtree
Example 1.3: Creating an interactive visualization (using Streamlit, for example) of a data structure or algorithm.

Option 2: Fixing an Issue

Figuring out how to address some existing drawback of the algorithm/data structure

Example 2.1: Binary Search Trees lose their logarithmic efficiency if items are added to them in order (since newly-added items are always added to the right subtree of all existing items, thus "degenerating" into a linear linked list structure), an issue which fancier versions of the BST like AVL Trees and Red-Black Trees resolve.
Example 2.2: As you saw in the Midterm, hash tables "degenerate" from (approximate) \( O(1) \) efficiency to \( O(\log_2(n)) \) efficiency if the hashing function we use is not efficient or not collision-resistant. So, your project could be to explore and summarize how efficient and collision-resistant hash functions work, and then implement one such hash function in Python.

Option 3: Exploring Applications

Examining different applications of the data structure that make it helpful for particular contexts

Example 3.1: In bioinformatics, the sorting algorithms we did learn are helpful for genetic sequence searching, but can be made even more efficient for genetic sequence searches by applying the Burrows-Wheeler Transform to a collection of genetic sequences before applying the sorting algorithms.
Example 3.2: We learned Breadth-First Search (BFS) and Depth-First Search (DFS) in somewhat of a hurry, as a way to traverse over the nodes in a Binary Search Tree, but they both have many more exciting applications! For example, if you're interested in web scraping, you could adapt the code from HW2 to create a polymorphic web-scraper, where each node is a webpage, and processing a node means adding each link in the page to the stack/queue.

Option 4: Proving Correctness and/or Efficiency

Carrying out a more robust analysis of the correctness and/or asymptotic complexity of the algorithms and data structures we looked at

Example 4.1: In the lecture on Insertion-Sort, I hinted at the notion of a correctness proof using invariants---how it would proceed along the same lines as a proof by induction that you might have seen in a previous math class---but never formally walked you through how to do one.
Example 4.2: Over the first half of the course, we were able to gradually improve retrieval of elements from data structures in a progression like:

Linear Logarithmic Constant

LinkedList
\(O(n)\) \(\rightarrow\) BST
\(O(\log_2(n))\) \(\rightarrow\) HashTable
\(O(1 + \varepsilon)\)

But then, for search, something was missing 🧐:

Linear Logarithmic Constant

Insertion-Sort
\(O(n^2)\) \(\rightarrow\) Merge-Sort
\(O(n\log_2(n))\) \(\rightarrow\) ???

Prove that it's in fact not possible to make that additional "jump" to constant-time search in \(O(n)\), except in very special cases.

Linear		Logarithmic		Constant
`LinkedList` \(O(n)\)	\(\rightarrow\)	`BST` \(O(\log_2(n))\)	\(\rightarrow\)	`HashTable` \(O(1 + \varepsilon)\)

Linear		Logarithmic		Constant
Insertion-Sort \(O(n^2)\)	\(\rightarrow\)	Merge-Sort \(O(n\log_2(n))\)	\(\rightarrow\)	???

Option 5: Building an ETL Pipeline

Using data validation and pipeline orchestration libraries to build a pipeline useful for your life or your job

Example 5.1: If there is some task you currently handle manually---say, entering receipts from your email into a spreadsheet, or (the example I mentioned in class) making a fulltext index of ebooks in your library---your project can explore automating this task using Pydantic and Prefect
Example 5.2: If there are particular APIs and Database solutions that you've been hoping to explore, now's your chance! Create an ETL pipeline which Extracts data from the API, Transforms it in some interesting way, then Loads it into (for example) an SQL or MongoDB database.

(For this option, note that we used Pydantic in the course because of its OOP orientation, but if your task is more DataFrame-heavy, you can use Pandera for DataFrame validation in place of Pydantic! Feel free to schedule an office hour with Jeff for more on Pandera)

So, your task for the final project will be to

Choose one of the topics we covered
Determine or work with your mentor to find an interesting “next step” with respect to this topic
Carry out this next step (details below, since this will look different for different choices from the above options!)
Write up what you did in a report, including any of the code or math or data analysis that you did in Step 3

Concrete Requirements

For each of the options described in the Overview section above, the following info boxes describe the structure of the deliverable(s) you would be responsible for (click a header to expand the box; they’re collapsed by default to avoid overloading you with info if you’ve already chosen!)

Option 1: A Deeper Dive

Diving ‘one level deeper’ into the details of the algorithm/data structure

Here, the deliverable should be structured like:

State which algorithm or data structure you chose, from among those covered in class
Identify a substantive implementation detail that we didn’t cover in class
Implement it in Python, describing in your own words (through comments or Markdown text above/below the code) how your implementation works
Write a set of unit tests demonstrating that your implementation resolves/fixes the drawback that you identified in Step 2

Option 2: Fixing an Issue

Figuring out how to address some existing drawback of the algorithm/data structure

Here, the deliverable should be structured like:

State which algorithm or data structure you chose, from among those covered in class
Identify a substantive drawback present in the implementation of that topic that we learned in class
Implement it in Python, describing in your own words (through comments or Markdown text above/below the code) how your implementation works
Write a set of unit tests demonstrating that your implementation handles a range of possible inputs

Option 3: Exploring Applications

Examining different applications of the data structure that make it helpful for particular fields

State which course topic you’ve chosen
Identify a substantive application of that topic in a field you are interested in: e.g., bioinformatics, marketing, finance, NLP, image processing, operations management, social science
Write a Literature Review section containing a few examples of the use of the topic in your chosen field that are particularly interesting to you
Choose one of the examples from your literature review and create a Quarto document or Jupyter notebook containing code that you wrote yourself which illustrates the application of the data structure or algorithm within your domain of interest, describing in your own words (through comments or Markdown text above/below the code) how the code works and what it does

Option 4: Proving Correctness and/or Efficiency

Carrying out a more robust analysis of the correctness and/or asymptotic complexity of the algorithms and data structures we looked at

State which algorithm or data structure you chose, from among those covered in class, and whether your focus is on analyzing correctness or efficiency of the algorithm/data structure (or both!)
Write out the chosen proof, which should be your own work, though it can draw on existing proofs
Throughout the proof in Step 2, also provide an explanation of each step—for guidance on what we’re looking for, imagine not only writing the proof but also walking someone through it who has never seen it before (though you can assume they have some level of mathematical background)

(Basically, for this option, we can’t prevent you from looking at an existing proof—and we don’t want to!—so instead we are looking at your ability to convey your understanding of how/why the proof works)

Option 5: Building an ETL Pipeline

Using data validation and pipeline orchestration libraries to build a pipeline useful for your life or your job

Give an outline of the problem(s) with how you currently perform the task (for example, inefficiency or difficulty in tracking progress), and why you think they could be helped or resolved with the aid of an ETL pipeline (NOTE: If the pipeline would be too ambitious to complete during the time remaining in the semester, then in this step you should describe a Minimum Viable Product that you are able to do within the remaining time)
Explicitly state and describe the Extract, Transform, and Load steps that you will implement, making note of the points where you will use Pydantic/Pandera for data validation (you must incorporate at least one data validation step somewhere in your pipeline—for example, a custom class you create which extends Pydantic’s BaseModel class)
Include the actual code from your pipeline, along with substantive comments and descriptions of how it works (at minimum, for each task within your flow, you should describe each of its arguments and what it returns—if you’re comfortable doing so, the best way to do this would to include a Sphinx Docstring within each of your tasks!)
Finally, include screenshots from the Prefect UI showing an example run of the pipeline, as well as any artifacts that your pipeline generates upon completion (your pipeline must produce at least one artifact as a completion report, though it doesn’t need to be fancy—for example, it could just be a short Markdown-formatted document which displays the total runtime of the pipeline from start to finish)

Individual vs. Group Projects

It is totally up to you whether you’d like to do the project individually or in a group with other students. However, if you are pursuing the project as a group, please choose one member of the group to serve as the “project lead”, and include this detail in an email to your mentor.

The mentor for the group project will then be whoever was assigned as the individual mentor for the project leader (this choice doesn’t have to be related to the actual work on the project, it is just for us to be able to allocate mentees fairly between the course staff!).

Expectations for group projects will scale based on the number of members in the group: for example, a group with two members will be expected to carry out a more substantive project, such that it requires approximately two times the amount of work that would be expected for individual projects¹.

Timeline

In general, the only “true” due date is the due date for the final submission, but the project will go most smoothly if you are able to hold yourself to the following schedule:

Basic Idea Proposal: Approved by instructor (Jeff, Renee, or Jiatong) by end of day Thursday, April 10th (in-class office hour session, 7:30-9pm)
Final Draft: Sent to instructors (dsan5500-instructors@georgetown.edu) or brought into class for review on Thursday, April 24th (in-class office hour session, 7:30-9pm)
Submission: Completed project submitted via Google Classroom by Thursday, May 1st, 11:59pm EDT

Submission Format

There is now an assignment page for the final project on Google Classroom, where you will upload your final submission for grading. The following is a rough sketch of what we’re looking for in terms of the structure of your submission:

HTML format, as a rendered Quarto manuscript, would be optimal, but can be PDF if preferred—for example, if you choose Option 4 (involving mathematical proofs), you might instead want to use LaTeX rendered to PDF.
A requirement in terms of number of pages is difficult, but a reasonable range for the PDF format would be 3-10 pages double-spaced. Therefore, for a Quarto document or Jupyter notebook, the length can be the equivalent of this (for example, you can print-preview the Quarto doc to see how many pages it would produce if printed)
It should have an abstract, a 250-500 word paragraph at the very top of the manuscript, summarizing what you did
Citations, if used, should be set up so that they’re handled automatically. By Quarto’s citation manager for example, or by Bibtex/Biber if you’re using LaTeX.

Footnotes

This detail is not something we’re trying to explicitly measure or be harsh about, but is included here since otherwise (if the expectations for individual and group projects were the exact same) the work would scale the other way: that each person in a two-person project would be doing half the amount of work that a person doing an individual project is doing… Hopefully that makes sense from a fairness perspective!↩︎