Week 4: Data Structures from Scratch with OOP

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Thursday, January 30, 2025

Open slides in new window →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 7:00pm Object-Oriented Programming (OOP) →
7:00pm 7:30pm LinkedList: Lists from Scratch →
7:30pm 8:00pm Other Data Structures = Fancy Linked Lists →
Break! 8:00pm 8:10pm
8:10pm 9:00pm HashTable: Logarithmic to Constant Efficiency 🤯 →

Object-Oriented Programming

Breaking a Problem into (Interacting) Parts

  • Python so far: “Data science mode”
    • Start at top of file with raw data
    • Write lines of code until problem solved
  • Python in this class: “Software engineering mode”
    • Break system down into parts
    • Write each part separately
    • Link parts together to create the whole
  • (One implication: .py files may be easier than .ipynb for development! Can you think of why?)

How Does A Calculator Work?

(Calculator image from Wikimedia Commons)

Key OOP Feature #1: Encapsulation

  • Imagine you’re on a team trying to make a calculator
  • One person can write the Screen class, another person can write the Button class, and so on
  • Natural division of labor! (May seem silly for a calculator, but imagine as your app scales up)

Key OOP Feature #2: Abstraction

  • Abstraction complements this Encapsulation: the Screen team doesn’t need to know the internal details of Button (just it’s API), and vice-versa
  • Relevant data and functions can be “public”, irrelevant internal data and functions “private”
  • (Like with type hints), Python doesn’t enforce this distinction, but (unlike with type hints) most libraries do separate public from private by a variable-naming convention

Public, Protected, Private Attributes in Python

  • [Public (default)] No underscores: public_var
  • [Protected] One underscore: _protected_var
  • [Private] Two underscores: __private_var
Code
class MySecretInfo:
  __the_info = "I love Carly Rae Jepsen"

info_obj = MySecretInfo()
info_obj.__the_info
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 5
      2   __the_info = "I love Carly Rae Jepsen"
      4 info_obj = MySecretInfo()
----> 5 info_obj.__the_info

AttributeError: 'MySecretInfo' object has no attribute '__the_info'

Guess we can’t access it then, right? 😮‍💨

Code
info_obj._MySecretInfo__the_info
'I love Carly Rae Jepsen'
  • NOO MY SECRET!!! 😭
  • Pls don’t tell anyone

Key OOP Features #3-4: Inheritance, Polymorphism

  • Better explained in diagrams than words (next 10 slides!), but we can get a sense by thinking about their etymology:
  • “Inheritance” comes from “heir”, like “heir to the throne”
    • Parent passes on [things they possess] to their children
  • “Polymorphism”: Poly = “many”, Morphe = “forms”
    • How does Python know what to do when we print()?
    • It “just works” because print() (through __str__()) takes on many (!) forms (!): each type of object has its own implementation of __str__()

Use Case: Bookstore Inventory Management

In Pictures

G Bookstore Relational Diagram Bookstore Bookstore Name Location Booklist Get_Inventory() Sort_Inventory() Place Place City State Country Print_Map() Bookstore:loc->Place:placehead Has One Book Book Title Authors Num Pages Preview() Bookstore:bl->Book:bookhead Has Multiple Person Person Family Name Given Name Book:auths->Person:personhead Has Multiple

Creating Classes

  • Use case: Creating an inventory system for a Bookstore
Code
class Bookstore:
    def __init__(self, name, location):
        self.name = name
        self.location = location
        self.books = []

    def __getitem__(self, index):
        return self.books[index]

    def __repr__(self):
        return self.__str__()

    def __str__(self):
        return f"Bookstore[{self.get_num_books()} books]"

    def add_books(self, book_list):
        self.books.extend(book_list)

    def get_books(self):
        return self.books

    def get_inventory(self):
        book_lines = []
        for book_index, book in enumerate(self.get_books()):
            cur_book_line = f"{book_index}. {str(book)}"
            book_lines.append(cur_book_line)
        return "\n".join(book_lines)

    def get_num_books(self):
        return len(self.get_books())

    def sort_books(self, sort_key):
        self.books.sort(key=sort_key)

class Book:
    def __init__(self, title, authors, num_pages):
        self.title = title
        self.authors = authors
        self.num_pages = num_pages

    def __str__(self):
        return f"Book[title={self.get_title()}, authors={self.get_authors()}, pages={self.get_num_pages()}]"

    def get_authors(self):
        return self.authors

    def get_first_author(self):
        return self.authors[0]

    def get_num_pages(self):
        return self.num_pages

    def get_title(self):
        return self.title

class Person:
    def __init__(self, family_name, given_name):
        self.family_name = family_name
        self.given_name = given_name

    def __repr__(self):
        return self.__str__()

    def __str__(self):
        return f"Person[{self.get_family_name()}, {self.get_given_name()}]"

    def get_family_name(self):
        return self.family_name

    def get_given_name(self):
        return self.given_name
Code
my_bookstore = Bookstore("Bookland", "Washington, DC")
plath = Person("Plath", "Sylvia")
bell_jar = Book("The Bell Jar", [plath], 244)
marx = Person("Marx", "Karl")
engels = Person("Engels", "Friedrich")
manifesto = Book("The Communist Manifesto", [marx, engels], 43)
elster = Person("Elster", "Jon")
cement = Book("The Cement of Society", [elster], 311)
my_bookstore.add_books([bell_jar, manifesto, cement])
print(my_bookstore)
print(my_bookstore[0])
print("Inventory:")
print(my_bookstore.get_inventory())
Bookstore[3 books]
Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
Inventory:
0. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
1. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
2. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]

Doing Things With Classes

  • Now we can use our OOP structure, for example to sort the inventory in different ways!
Alphabetical (By First Author)
Code
sort_alpha = lambda x: x.get_first_author().get_family_name()
my_bookstore.sort_books(sort_key = sort_alpha)
print(my_bookstore.get_inventory())
0. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]
1. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
2. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
By Page Count
Code
sort_pages = lambda x: x.get_num_pages()
my_bookstore.sort_books(sort_key = sort_pages)
print(my_bookstore.get_inventory())
0. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
1. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
2. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]

Inheritance and Polymorphism

  • Encapsulate general properties in parent class, specific properties in child classes

(You can edit this or make your own UML diagrams in nomnoml!)

Or… Is This Better?

Edit in nomnoml

Design Choices

  • Goal is encapsulation: which objects should have which properties/methods?
  • Example: Fiction vs. Non-Fiction. How important is this distinction for your use case?
Option 1: As Property of Book
Code
from enum import Enum
class BookType(Enum):
    NONFICTION = 0
    FICTION = 1

class Book:
    def __init__(self, title: str, authors: list[Person], num_pages: int, type: BookType):
        self.title = title
        self.authors = authors
        self.num_pages = num_pages
        self.type = type

    def __str__(self):
        return f"Book[title={self.title}, authors={self.authors}, pages={self.num_pages}, type={self.type}]"
Code
joyce = Person("Joyce", "James")
ulysses = Book("Ulysses", [joyce], 732, BookType.FICTION)
schelling = Person("Schelling", "Thomas")
micromotives = Book("Micromotives and Macrobehavior", [schelling], 252, BookType.NONFICTION)
print(ulysses)
print(micromotives)
Book[title=Ulysses, authors=[Person[Joyce, James]], pages=732, type=BookType.FICTION]
Book[title=Micromotives and Macrobehavior, authors=[Person[Schelling, Thomas]], pages=252, type=BookType.NONFICTION]
Option 2: Separate Classes
Code
# class Book defined as earlier
class FictionBook(Book):
    def __init__(self, title, authors, num_pages, characters):
        super().__init__(title, authors, num_pages)
        self.characters = characters

class NonfictionBook(Book):
    def __init__(self, title, authors, num_pages, topic):
        super().__init__(title, authors, num_pages)
        self.topic = topic
Code
joyce = Person("Joyce", "James")
ulysses = FictionBook("Ulysses", [joyce], 732, ["Daedalus"])
schelling = Person("Schelling", "Thomas")
micromotives = NonfictionBook("Micromotives and Macrobehavior", [schelling], 252, "Economics")
print(ulysses)
print(micromotives)
Book[title=Ulysses, authors=[Person[Joyce, James]], pages=732]
Book[title=Micromotives and Macrobehavior, authors=[Person[Schelling, Thomas]], pages=252]

Core Data Structure: LinkedList

Looking Under the Hood of a Data Structure

  • Last week we saw the math for why we can “abstract away from” the details of how a particular language works
  • We want to understand these structures independently of the specifics of their implementation in Python (for now)
  • So, let’s construct our own simplified versions of the basic structures, and use these simplified versions to get a sense for their efficiency
    • (The “true” Python versions may be hyper-optimized but, as we’ll see, there are fundamental constraints on runtime, assuming \(P \neq NP\))

Tuples

Code
class MyTuple:
  def __init__(self, thing1, thing2):
    self.thing1 = thing1
    self.thing2 = thing2

  def __repr__(self):
    return f"({self.thing1}, {self.thing2})"

  def __str__(self):
    return self.__repr__()

t1 = MyTuple('a','b')
t2 = MyTuple(111, 222)
print(t1)
print(t2)
(a, b)
(111, 222)

Lists

The list itself just points to a root item:

Code
class MyList:
  def __init__(self):
    self.root = None

  def append(self, new_item):
    if self.root is None:
      self.root = MyListItem(new_item)
    else:
      self.root.append(new_item)

  def __repr__(self):
    return self.root.__repr__()

An item has contents, pointer to next item:

Code
class MyListItem:
  def __init__(self, content):
    self.content = content
    self.next = None

  def append(self, new_item):
    if self.next is None:
      self.next = MyListItem(new_item)
    else:
      self.next.append(new_item)

  def __repr__(self):
    my_content = self.content
    return my_content if self.next is None else f"{my_content}, {self.next.__repr__()}"
Code
users = MyList()
users.append('Jeff')
users.append('Alma')
users.append('Bo')
print(users)
Jeff, Alma, Bo

So, How Many “Steps” Are Required…

  • To retrieve the first element in a MyTuple?
  • To retrieve the last element in a MyTuple?
  • To retrieve the first element in a MyList?
  • To retrieve the last element in a MyList?

How Many Steps?

With a MyTuple:

Code
t1.thing1
'a'

\(\implies\) 1 step

Code
t1.thing2
'b'

\(\implies\) 1 step

With a MyList:

Code
print(users.root.content)
Jeff

\(\implies\) 1 step

Code
current_node = users.root
while current_node.next is not None:
  current_node = current_node.next
print(current_node.content)
Bo

\(\implies\) (3 steps)

…But why 3? How many steps if the list contained 5 elements? \(N\) elements?

Visualizing LinkedLists

Code
from hw2 import LinkedList, InventoryItem
ll = LinkedList()
item1 = InventoryItem('Mango', 50)
ll.append(item1)
item2 = InventoryItem('Pickle', 60)
ll.append(item2)
item3 = InventoryItem('Artichoke', 55)
ll.append(item3)
item5 = InventoryItem('Banana', 123)
ll.append(item5)
item6 = InventoryItem('Aardvark', 11)
ll.append(item6)
HTML(visualize_ll(ll))

Onwards and Upwards: Fancier Algorithms

LinkedList: Foundation for Most(?) Data Structures!

class LinkedList(BaseModel):
  root: LinkedListNode | None
class LinkedListNode(BaseModel):
  content: object
  next: LinkedListNode | None

class BinaryTree(BaseModel):
  root: BinaryTreeNode | None
class BinaryTreeNode(BaseModel):
  content: object
  left: BinaryTreeNode | None
  right: BinaryTreeNode | None

class QuadTree(BaseModel):
  root: QuadTreeNode | None
class QuadTreeNode:
  content: object
  nw: QuadTreeNode | None
  ne: QuadTreeNode | None
  sw: QuadTreeNode | None
  se: QuadTreeNode | None

Visualizing BinarySearchTree

Code
from hw2 import BinarySearchTree
bst = BinarySearchTree()
item1 = InventoryItem('Mango', 50)
bst.add(item1)
item2 = InventoryItem('Pickle', 60)
bst.add(item2)
item3 = InventoryItem('Artichoke', 55)
bst.add(item3)
item5 = InventoryItem('Banana', 123)
bst.add(item5)
item6 = InventoryItem('Aardvark', 11)
bst.add(item6)
HTML(visualize(bst))
Warning: node None_0, port f1 unrecognized
Warning: node None_1, port f1 unrecognized
Warning: node None_2, port f1 unrecognized
Warning: node None_3, port f1 unrecognized
Warning: node None_4, port f1 unrecognized
Warning: node None_5, port f1 unrecognized

So Then… Why Is This a Whole DSAN Class?

  • The core structures are identical, but we can optimize different goals (efficient insertion, sorting, retrieval, deletion, …) by changing the invariants maintained by the algorithms internal to our structure
  • Crucial Insertion-Sort invariant: \(\textsf{Sorted}(1,i)\) true when we move to entry \(i + 1\) (key)
  • Crucial HW2(!) invariant: \(\textsf{Up-To-Date-Favorite}(1,i-1)\) true when entry \(i + 1\) (next result in dataset) arrives
  • \(\implies\) Efficiency of obtaining favorite style guaranteed to be constant-time, \(\overline{O}(1)\)!
  • Otherwise, would be \(\overline{O}(n) > \overline{O}(1)\) (linear approach) or at best \(\overline{O}(\log_2(n)) > \overline{O}(1)\) (divide-and-conquer)

Hash Tables

  • *(Spoiler alert, so you know I’m not lying to you: this is a LinkedList with some additional structure!)
  • You just got hired as a cashier (Safeway cashiering alum myself 🫡)
  • The scanner is broken (spoiler #2: the scanner uses a hash table), so you start writing down items along with their prices, one-by-one, as items come in…

Our List of (Item, Price) Pairs

Code
price_list = []
price_list.append(('Banana', 10))
price_list.append(('Apple', 2))
price_list.append(('Four Loko', 5))
price_list
[('Banana', 10), ('Apple', 2), ('Four Loko', 5)]
  • As the list gets longer, it gets harder and harder to find where you wrote down a specific item and its price
  • As you now know, you could use linear search, \(\overline{O}(n)\), or if you ensure alphabetical order (an invariant!), you could use binary, divide-and-conquer search, \(\overline{O}(\log_2(n))\)
  • We can do even better: \(\overline{O}(1)\). First w/magic, but then math

Strange-At-First Technique for Algorithm Analysis: Oracles

  • What if we had a magical wizard who could just tell us where to find an item we were looking for?
  • Sounds like I’m joking or saying “what if we had a billion $ and infinite aura and we could fly and walk through walls”
  • And yet, through the magic of math and computer science, there are concrete hashing algorithms which ensure (in a mathematically-provable way!) “almost certain” \(\overline{O}(1)\) lookup time

Mathematical Strategy of Oracles

  • We’ll use a concrete, simplified hash function to illustrate
  • Mathematically we’ll be able to get something like

\[ T(n) = \overline{O}(1 + \underbrace{\epsilon}_{\mathclap{\text{Collision rate}}} \cdot n) \]

  • Which tells us: if we had an oracle who could ensure near-0 collision rates, then \(T(n) = \overline{O}(1)\).
  • And, by a beautiful division-of-labor, other computer scientists figure out the near-0 collision rates part, giving us

\[ p^{✅} = [T(n) = \overline{O}(1 + \epsilon n)], q^{✅} = [\epsilon \approx 0],\text{ so } p \wedge q \implies T(n) \overset{✅}{=} \overline{O}(1). \]

Back to the Price List

  • Our hash function: hash(item) = first letter of item

\[ h(\texttt{x}) = \texttt{x[0]} \]

  • h('Banana') = 'B', h('Monkey') = 'M'
  • With this function in hand, we can create a length-26 array, one slot for each letter in alphabet, and then write down (item, price) pairs in whatever slot item hashes to

The Importance of Differentiating Operations: Insertion vs. Lookup

  • So far, we have \(\overline{O}(1)\) insertion via hashing
  • We also get \(\overline{O}(1)\) lookup!
  • When customer hands us an item (say, 'Banana'), we compute the hash (B), look in that slot, and obtain the price for bananas.
  • We also get \(\overline{O}(1)\) updating (hash to find the old price, update it to have new price) and \(\overline{O}(1)\) deletion (hash to find the slot containing the item, then erase it from that slot)

So What’s the Catch???

  • BLUEBERRIES show up to ruin our day (as usual 😞)
  • We hash, so far so good: h('Blueberries') = 'B'
  • But then we go to the B slot and see that (Bananas, 10) is already there!!! Wtf do we do here… don’t panic!
  • The answer? We open our HW2 from DSAN 5500 and remember that we have our lovely friend the LinkedList that we can use whenever and however we want!

Arrays vs. Linked Lists

  • Jeff is hiding something here… Why jump to LinkedList? Why not just… another length-26 array, for example?
  • For this we open up our Week 1 slides and remember the stack vs. heap distinction: we know how many letters in the alphabet, we don’t know how many items starting with B (or, if we do, we want to be able to expand/contract our price list to handle new/discontinued items)
  • Terminology for this kind of “hybrid” data structure: HashTable is an Array that “degenerates into” a LinkedList (when there are collisions)

Look How Far We Came!

  • Beginning of class: Only structure we knew allowing insertion (LinkedList) was \(\overline{O}(n)\) for everythihg
  • End of class: New structure where suddenly everything is \(\overline{O}(1)\), except in “unlucky” cases, in which it partially “degenerates” into a LinkedList
  • \(\implies\) The “inevitable” \(\overline{O}(n)\) runtime has transformed into the unlucky worst-case upper bound
  • \(\implies\) By taking core data structures/algorithms from your toolkit, you can “piece together” hybrid structures whose whole (runtime) is better than the sum of its parts

Taking This Idea and Running With It

  • Next week we’ll look at BinarySearchTree (BST)
  • Since it’s just a glorified LinkedList, we’ll be able to take our HashMap from today and “drop in” the BST to play the role the LinkedList is playing right now
  • If collision, we’ll create a BST with its \(\overline{O}(\log(n))\) operations, rather than a LinkedList with its \(\overline{O}(n)\) operations
  • \(\implies\) HashMap will go from:
    • \(\overline{O}(1)\) best-case but \(\overline{O}(n)\) worst-case to
    • \(\overline{O}(1)\) best-case but \(\overline{O}(\log_2(n))\) worst-case!