Week 4: Data Structures from Scratch with OOP

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Class Sessions

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

Thursday, January 30, 2025

Open slides in new window →

Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	7:00pm	Object-Oriented Programming (OOP) →
	7:00pm	7:30pm	`LinkedList`: Lists from Scratch →
	7:30pm	8:00pm	Other Data Structures = Fancy Linked Lists →
Break!	8:00pm	8:10pm
	8:10pm	9:00pm	`HashTable`: Logarithmic to Constant Efficiency 🤯 →

Object-Oriented Programming

Breaking a Problem into (Interacting) Parts

Python so far: “Data science mode”
- Start at top of file with raw data
- Write lines of code until problem solved
Python in this class: “Software engineering mode”
- Break system down into parts
- Write each part separately
- Link parts together to create the whole
(One implication: .py files may be easier than .ipynb for development! Can you think of why?)

How Does A Calculator Work?

(Calculator image from Wikimedia Commons)

Key OOP Feature #1: Encapsulation

Imagine you’re on a team trying to make a calculator
One person can write the Screen class, another person can write the Button class, and so on
Natural division of labor! (May seem silly for a calculator, but imagine as your app scales up)

Key OOP Feature #2: Abstraction

Abstraction complements this Encapsulation: the Screen team doesn’t need to know the internal details of Button (just it’s API), and vice-versa
Relevant data and functions can be “public”, irrelevant internal data and functions “private”
(Like with type hints), Python doesn’t enforce this distinction, but (unlike with type hints) most libraries do separate public from private by a variable-naming convention…

Public, Protected, Private Attributes in Python

[Public (default)] No underscores: public_var
[Protected] One underscore: _protected_var
[Private] Two underscores: __private_var

Code

class MySecretInfo:
  __the_info = "I love Carly Rae Jepsen"

info_obj = MySecretInfo()
info_obj.__the_info

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 5
      2   __the_info = "I love Carly Rae Jepsen"
      4 info_obj = MySecretInfo()
----> 5 info_obj.__the_info

AttributeError: 'MySecretInfo' object has no attribute '__the_info'

Guess we can’t access it then, right? 😮‍💨

Code

info_obj._MySecretInfo__the_info

'I love Carly Rae Jepsen'

NOO MY SECRET!!! 😭
Pls don’t tell anyone

Key OOP Features #3-4: Inheritance, Polymorphism

Better explained in diagrams than words (next 10 slides!), but we can get a sense by thinking about their etymology:
“Inheritance” comes from “heir”, like “heir to the throne”
- Parent passes on [things they possess] to their children
“Polymorphism”: Poly = “many”, Morphe = “forms”
- How does Python know what to do when we print()?
- It “just works” because print() (through __str__()) takes on many (!) forms (!): each type of object has its own implementation of __str__()

Use Case: Bookstore Inventory Management

In Pictures

Creating Classes

Use case: Creating an inventory system for a Bookstore

Code

class Bookstore:
    def __init__(self, name, location):
        self.name = name
        self.location = location
        self.books = []

    def __getitem__(self, index):
        return self.books[index]

    def __repr__(self):
        return self.__str__()

    def __str__(self):
        return f"Bookstore[{self.get_num_books()} books]"

    def add_books(self, book_list):
        self.books.extend(book_list)

    def get_books(self):
        return self.books

    def get_inventory(self):
        book_lines = []
        for book_index, book in enumerate(self.get_books()):
            cur_book_line = f"{book_index}. {str(book)}"
            book_lines.append(cur_book_line)
        return "\n".join(book_lines)

    def get_num_books(self):
        return len(self.get_books())

    def sort_books(self, sort_key):
        self.books.sort(key=sort_key)

class Book:
    def __init__(self, title, authors, num_pages):
        self.title = title
        self.authors = authors
        self.num_pages = num_pages

    def __str__(self):
        return f"Book[title={self.get_title()}, authors={self.get_authors()}, pages={self.get_num_pages()}]"

    def get_authors(self):
        return self.authors

    def get_first_author(self):
        return self.authors[0]

    def get_num_pages(self):
        return self.num_pages

    def get_title(self):
        return self.title

class Person:
    def __init__(self, family_name, given_name):
        self.family_name = family_name
        self.given_name = given_name

    def __repr__(self):
        return self.__str__()

    def __str__(self):
        return f"Person[{self.get_family_name()}, {self.get_given_name()}]"

    def get_family_name(self):
        return self.family_name

    def get_given_name(self):
        return self.given_name

Code

my_bookstore = Bookstore("Bookland", "Washington, DC")
plath = Person("Plath", "Sylvia")
bell_jar = Book("The Bell Jar", [plath], 244)
marx = Person("Marx", "Karl")
engels = Person("Engels", "Friedrich")
manifesto = Book("The Communist Manifesto", [marx, engels], 43)
elster = Person("Elster", "Jon")
cement = Book("The Cement of Society", [elster], 311)
my_bookstore.add_books([bell_jar, manifesto, cement])
print(my_bookstore)
print(my_bookstore[0])
print("Inventory:")
print(my_bookstore.get_inventory())

Bookstore[3 books]
Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
Inventory:
0. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
1. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
2. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]

Doing Things With Classes

Now we can use our OOP structure, for example to sort the inventory in different ways!

Alphabetical (By First Author)

Code

sort_alpha = lambda x: x.get_first_author().get_family_name()
my_bookstore.sort_books(sort_key = sort_alpha)
print(my_bookstore.get_inventory())

0. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]
1. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
2. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]

By Page Count

Code

sort_pages = lambda x: x.get_num_pages()
my_bookstore.sort_books(sort_key = sort_pages)
print(my_bookstore.get_inventory())

0. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
1. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
2. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]

Inheritance and Polymorphism

Encapsulate general properties in parent class, specific properties in child classes

(You can edit this or make your own UML diagrams in nomnoml!)

Or… Is This Better?

Design Choices

Goal is encapsulation: which objects should have which properties/methods?
Example: Fiction vs. Non-Fiction. How important is this distinction for your use case?

Option 1: As Property of Book

Code

from enum import Enum
class BookType(Enum):
    NONFICTION = 0
    FICTION = 1

class Book:
    def __init__(self, title: str, authors: list[Person], num_pages: int, type: BookType):
        self.title = title
        self.authors = authors
        self.num_pages = num_pages
        self.type = type

    def __str__(self):
        return f"Book[title={self.title}, authors={self.authors}, pages={self.num_pages}, type={self.type}]"

Code

joyce = Person("Joyce", "James")
ulysses = Book("Ulysses", [joyce], 732, BookType.FICTION)
schelling = Person("Schelling", "Thomas")
micromotives = Book("Micromotives and Macrobehavior", [schelling], 252, BookType.NONFICTION)
print(ulysses)
print(micromotives)

Book[title=Ulysses, authors=[Person[Joyce, James]], pages=732, type=BookType.FICTION]
Book[title=Micromotives and Macrobehavior, authors=[Person[Schelling, Thomas]], pages=252, type=BookType.NONFICTION]

Option 2: Separate Classes

Code

# class Book defined as earlier
class FictionBook(Book):
    def __init__(self, title, authors, num_pages, characters):
        super().__init__(title, authors, num_pages)
        self.characters = characters

class NonfictionBook(Book):
    def __init__(self, title, authors, num_pages, topic):
        super().__init__(title, authors, num_pages)
        self.topic = topic

Code

joyce = Person("Joyce", "James")
ulysses = FictionBook("Ulysses", [joyce], 732, ["Daedalus"])
schelling = Person("Schelling", "Thomas")
micromotives = NonfictionBook("Micromotives and Macrobehavior", [schelling], 252, "Economics")
print(ulysses)
print(micromotives)

Book[title=Ulysses, authors=[Person[Joyce, James]], pages=732]
Book[title=Micromotives and Macrobehavior, authors=[Person[Schelling, Thomas]], pages=252]

Core Data Structure: `LinkedList`

Looking Under the Hood of a Data Structure

Last week we saw the math for why we can “abstract away from” the details of how a particular language works
We want to understand these structures independently of the specifics of their implementation in Python (for now)
So, let’s construct our own simplified versions of the basic structures, and use these simplified versions to get a sense for their efficiency
- (The “true” Python versions may be hyper-optimized but, as we’ll see, there are fundamental constraints on runtime, assuming $P \neq NP$)

Tuples

Code

class MyTuple:
  def __init__(self, thing1, thing2):
    self.thing1 = thing1
    self.thing2 = thing2

  def __repr__(self):
    return f"({self.thing1}, {self.thing2})"

  def __str__(self):
    return self.__repr__()

t1 = MyTuple('a','b')
t2 = MyTuple(111, 222)
print(t1)
print(t2)

(a, b)
(111, 222)

Lists

The list itself just points to a root item:

Code

class MyList:
  def __init__(self):
    self.root = None

  def append(self, new_item):
    if self.root is None:
      self.root = MyListItem(new_item)
    else:
      self.root.append(new_item)

  def __repr__(self):
    return self.root.__repr__()

An item has contents, pointer to next item:

Code

class MyListItem:
  def __init__(self, content):
    self.content = content
    self.next = None

  def append(self, new_item):
    if self.next is None:
      self.next = MyListItem(new_item)
    else:
      self.next.append(new_item)

  def __repr__(self):
    my_content = self.content
    return my_content if self.next is None else f"{my_content}, {self.next.__repr__()}"

Code

users = MyList()
users.append('Jeff')
users.append('Alma')
users.append('Bo')
print(users)

Jeff, Alma, Bo

So, How Many “Steps” Are Required…

To retrieve the first element in a MyTuple?
To retrieve the last element in a MyTuple?
To retrieve the first element in a MyList?
To retrieve the last element in a MyList?

How Many Steps?

With a MyTuple:

Code

t1.thing1

'a'

$\implies$ 1 step

Code

t1.thing2

'b'

$\implies$ 1 step

With a MyList:

Code

print(users.root.content)

Jeff

$\implies$ 1 step

Code

current_node = users.root
while current_node.next is not None:
  current_node = current_node.next
print(current_node.content)

Bo

$\implies$ (3 steps)

…But why 3? How many steps if the list contained 5 elements? $N$ elements?

Visualizing `LinkedList`s

Code

from hw2 import LinkedList, InventoryItem
ll = LinkedList()
item1 = InventoryItem('Mango', 50)
ll.append(item1)
item2 = InventoryItem('Pickle', 60)
ll.append(item2)
item3 = InventoryItem('Artichoke', 55)
ll.append(item3)
item5 = InventoryItem('Banana', 123)
ll.append(item5)
item6 = InventoryItem('Aardvark', 11)
ll.append(item6)
HTML(visualize_ll(ll))

Onwards and Upwards: Fancier Algorithms

LinkedList: Foundation for Most(?) Data Structures!

class LinkedList(BaseModel):
  root: LinkedListNode | None

class LinkedListNode(BaseModel):
  content: object
  next: LinkedListNode | None

class BinaryTree(BaseModel):
  root: BinaryTreeNode | None

class BinaryTreeNode(BaseModel):
  content: object
  left: BinaryTreeNode | None
  right: BinaryTreeNode | None

class QuadTree(BaseModel):
  root: QuadTreeNode | None

class QuadTreeNode:
  content: object
  nw: QuadTreeNode | None
  ne: QuadTreeNode | None
  sw: QuadTreeNode | None
  se: QuadTreeNode | None

Visualizing `BinarySearchTree`

Code

from hw2 import BinarySearchTree
bst = BinarySearchTree()
item1 = InventoryItem('Mango', 50)
bst.add(item1)
item2 = InventoryItem('Pickle', 60)
bst.add(item2)
item3 = InventoryItem('Artichoke', 55)
bst.add(item3)
item5 = InventoryItem('Banana', 123)
bst.add(item5)
item6 = InventoryItem('Aardvark', 11)
bst.add(item6)
HTML(visualize(bst))

Warning: node None_0, port f1 unrecognized
Warning: node None_1, port f1 unrecognized
Warning: node None_2, port f1 unrecognized
Warning: node None_3, port f1 unrecognized
Warning: node None_4, port f1 unrecognized
Warning: node None_5, port f1 unrecognized

So Then… Why Is This a Whole DSAN Class?

The core structures are identical, but we can optimize different goals (efficient insertion, sorting, retrieval, deletion, …) by changing the invariants maintained by the algorithms internal to our structure
Crucial Insertion-Sort invariant: $\textsf{Sorted}(1,i)$ true when we move to entry $i + 1$ (key)
Crucial HW2(!) invariant: $\textsf{Up-To-Date-Favorite}(1,i-1)$ true when entry $i + 1$ (next result in dataset) arrives
$\implies$ Efficiency of obtaining favorite style guaranteed to be constant-time, $\overline{O}(1)$!
Otherwise, would be $\overline{O}(n) > \overline{O}(1)$ (linear approach) or at best $\overline{O}(\log_2(n)) > \overline{O}(1)$ (divide-and-conquer)

Hash Tables

*(Spoiler alert, so you know I’m not lying to you: this is a LinkedList with some additional structure!)
You just got hired as a cashier (Safeway cashiering alum myself 🫡)
The scanner is broken (spoiler #2: the scanner uses a hash table), so you start writing down items along with their prices, one-by-one, as items come in…

Our List of (Item, Price) Pairs

Code

price_list = []
price_list.append(('Banana', 10))
price_list.append(('Apple', 2))
price_list.append(('Four Loko', 5))
price_list

[('Banana', 10), ('Apple', 2), ('Four Loko', 5)]

As the list gets longer, it gets harder and harder to find where you wrote down a specific item and its price
As you now know, you could use linear search, $\overline{O}(n)$, or if you ensure alphabetical order (an invariant!), you could use binary, divide-and-conquer search, $\overline{O}(\log_2(n))$
We can do even better: $\overline{O}(1)$. First w/magic, but then math

Strange-At-First Technique for Algorithm Analysis: Oracles

What if we had a magical wizard who could just tell us where to find an item we were looking for?
Sounds like I’m joking or saying “what if we had a billion $ and infinite aura and we could fly and walk through walls”
And yet, through the magic of math and computer science, there are concrete hashing algorithms which ensure (in a mathematically-provable way!) “almost certain” $\overline{O}(1)$ lookup time

Mathematical Strategy of Oracles

We’ll use a concrete, simplified hash function to illustrate
Mathematically we’ll be able to get something like

\[ T(n) = \overline{O}(1 + \underbrace{\epsilon}_{\mathclap{\text{Collision rate}}} \cdot n) \]

Which tells us: if we had an oracle who could ensure near-0 collision rates, then $T(n) = \overline{O}(1)$.
And, by a beautiful division-of-labor, other computer scientists figure out the near-0 collision rates part, giving us

\[ p^{✅} = [T(n) = \overline{O}(1 + \epsilon n)], q^{✅} = [\epsilon \approx 0],\text{ so } p \wedge q \implies T(n) \overset{✅}{=} \overline{O}(1). \]

Back to the Price List

Our hash function: hash(item) = first letter of item

\[ h(\texttt{x}) = \texttt{x[0]} \]

h('Banana') = 'B', h('Monkey') = 'M'
With this function in hand, we can create a length-26 array, one slot for each letter in alphabet, and then write down (item, price) pairs in whatever slot item hashes to

The Importance of Differentiating Operations: Insertion vs. Lookup

So far, we have $\overline{O}(1)$ insertion via hashing
We also get $\overline{O}(1)$ lookup!
When customer hands us an item (say, 'Banana'), we compute the hash (B), look in that slot, and obtain the price for bananas.
We also get $\overline{O}(1)$ updating (hash to find the old price, update it to have new price) and $\overline{O}(1)$ deletion (hash to find the slot containing the item, then erase it from that slot)

So What’s the Catch???

BLUEBERRIES show up to ruin our day (as usual 😞)
We hash, so far so good: h('Blueberries') = 'B'
But then we go to the B slot and see that (Bananas, 10) is already there!!! Wtf do we do here… don’t panic!
The answer? We open our HW2 from DSAN 5500 and remember that we have our lovely friend the LinkedList that we can use whenever and however we want!

Arrays vs. Linked Lists

Jeff is hiding something here… Why jump to LinkedList? Why not just… another length-26 array, for example?
For this we open up our Week 1 slides and remember the stack vs. heap distinction: we know how many letters in the alphabet, we don’t know how many items starting with B (or, if we do, we want to be able to expand/contract our price list to handle new/discontinued items)
Terminology for this kind of “hybrid” data structure: HashTable is an Array that “degenerates into” a LinkedList (when there are collisions)

Look How Far We Came!

Beginning of class: Only structure we knew allowing insertion (LinkedList) was $\overline{O}(n)$ for everythihg
End of class: New structure where suddenly everything is $\overline{O}(1)$, except in “unlucky” cases, in which it partially “degenerates” into a LinkedList
$\implies$ The “inevitable” $\overline{O}(n)$ runtime has transformed into the unlucky worst-case upper bound
$\implies$ By taking core data structures/algorithms from your toolkit, you can “piece together” hybrid structures whose whole (runtime) is better than the sum of its parts

Taking This Idea and Running With It

Next week we’ll look at BinarySearchTree (BST)
Since it’s just a glorified LinkedList, we’ll be able to take our HashMap from today and “drop in” the BST to play the role the LinkedList is playing right now
If collision, we’ll create a BST with its $\overline{O}(\log(n))$ operations, rather than a LinkedList with its $\overline{O}(n)$ operations
$\implies$ HashMap will go from:
- $\overline{O}(1)$ best-case but $\overline{O}(n)$ worst-case to
- $\overline{O}(1)$ best-case but $\overline{O}(\log_2(n))$ worst-case!

Schedule

Object-Oriented Programming

Breaking a Problem into (Interacting) Parts

How Does A Calculator Work?

Key OOP Feature #1: Encapsulation

Key OOP Feature #2: Abstraction

Public, Protected, Private Attributes in Python

Key OOP Features #3-4: Inheritance, Polymorphism

Use Case: Bookstore Inventory Management

In Pictures

Creating Classes

Doing Things With Classes

Inheritance and Polymorphism

Or… Is This Better?

Design Choices

Core Data Structure: LinkedList

Looking Under the Hood of a Data Structure

Tuples

Lists

So, How Many “Steps” Are Required…

How Many Steps?

Visualizing LinkedLists

Onwards and Upwards: Fancier Algorithms

LinkedList: Foundation for Most(?) Data Structures!

Visualizing BinarySearchTree

So Then… Why Is This a Whole DSAN Class?

Hash Tables

Our List of (Item, Price) Pairs

Strange-At-First Technique for Algorithm Analysis: Oracles

Mathematical Strategy of Oracles

Back to the Price List

The Importance of Differentiating Operations: Insertion vs. Lookup

So What’s the Catch???

Arrays vs. Linked Lists

Look How Far We Came!

Taking This Idea and Running With It

Core Data Structure: `LinkedList`

Visualizing `LinkedList`s

Visualizing `BinarySearchTree`