Week 3: Data Structures and Computational Complexity

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Jeff Jacobs

jj1088@georgetown.edu

Thursday, January 23, 2025

Why We Can’t Learn These Topics Separately

Data Structure Choice \(\Leftrightarrow\) Efficiency for Task

  • Do we need to be able to insert quickly?
  • Do we need to be able to sort quickly?
  • Do we need to be able to search quickly?
  • Are we searching for individual items or for ranges?

Basic Data Structures

Recall: Primitives

  • bool
  • int
  • float
  • None
  • Now we want to put these together, to form… structures! 👀
  • Structures are the things that live in the heap; the stack just points to them

Tuples

  • Fixed-size collection of \(N\) objects
  • Unless otherwise specified, we’re talking about \(2\)-tuples
  • Example: We can locate something on the Earth by specifying two floats: latitude and longitude!
Code
gtown_loc = (38.9076, -77.0723)
gtown_loc
(38.9076, -77.0723)
  • But what if we don’t know in advance how many items we want to store? Ex: how can we store users for a new app?

Sequences

  • In General: Mapping of integer indices to objects
  • x = ['a','b','c']
    • \(\implies\) x[0] = 'a'
    • \(\implies\) x[1] = 'b'
    • \(\implies\) x[2] = 'c'
  • In Python: list
  • Nice built-in language constructs for looping over lists, and especially for performing operations on each element

Looping Over Sequences

  • C/C++/Java:
List<String> myList = Arrays.asList("a", "b", "c");
for (int i = 0; i < x.size(); i++) {
    System.out.println(myList.get(i));
}

a
b
c
  • Python:
Code
my_list = ['a','b','c']
for list_element in my_list:
  print(list_element)
a
b
c

List Comprehensions: Apply Function to Each Element

  • Construct new list by applying operation to each element:
Code
my_nums = [4,5,6,7]
my_squares = [num ** 2 for num in my_nums]
my_squares
[16, 25, 36, 49]
  • Can also filter the elements of the list with if:
Code
my_odd_squares = [num ** 2 for num in my_nums if num % 2 == 1]
my_odd_squares
[25, 49]

Sets

  • Extremely helpful + efficient for finding unique elements:
Code
animals_i_saw = ['bird','bird','fish','bird','cat','bird','lizard']
print(f"Number of animals I saw: {len(animals_i_saw)}")
Number of animals I saw: 7
Code
unique_animals_me = set(animals_i_saw)
print(f"Set of unique animals: {unique_animals_me}")
print(f"Number of unique animals: {len(unique_animals_me)}")
Set of unique animals: {'fish', 'bird', 'cat', 'lizard'}
Number of unique animals: 4
  • Supports all set operators from math:
Code
animals_you_saw = ['lizard','dog','bird','bird','bird']
unique_animals_you = set(animals_you_saw)
unique_animals_both = unique_animals_me.intersection(unique_animals_you)
print(f"Animals we both saw: {unique_animals_both}")
Animals we both saw: {'bird', 'lizard'}
Code
unique_animals_either = unique_animals_me.union(unique_animals_you)
print(f"Animals either of us saw: {unique_animals_either}")
Animals either of us saw: {'fish', 'dog', 'cat', 'bird', 'lizard'}
Code
unique_animals_meonly = unique_animals_me - unique_animals_you
print(f"Animals I saw that you didn't see: {unique_animals_meonly}")
unique_animals_youonly = unique_animals_you - unique_animals_me
print(f"Animals you saw that I didn't see: {unique_animals_youonly}")
Animals I saw that you didn't see: {'fish', 'cat'}
Animals you saw that I didn't see: {'dog'}

Maps / Dictionaries

  • While other language like Java have lots of fancy types of Map, Python has a single type, the dictionary:
Code
gtown_data = {
  'name': 'Georgetown University',
  'founded': 1789,
  'coordinates': (38.9076, -77.0723),
  'location': {
    'city': 'Washington',
    'state': 'DC', # <__<
    'country': 'USA'
  }
}
print(gtown_data.keys())
print(gtown_data.values())
dict_keys(['name', 'founded', 'coordinates', 'location'])
dict_values(['Georgetown University', 1789, (38.9076, -77.0723), {'city': 'Washington', 'state': 'DC', 'country': 'USA'}])
  • Be careful when looping! Default behavior is iteration over keys:
Code
for k in gtown_data:
  print(k)
name
founded
coordinates
location
  • For key-value pairs use .items():
Code
for k, v in gtown_data.items():
  print(k, v)
name Georgetown University
founded 1789
coordinates (38.9076, -77.0723)
location {'city': 'Washington', 'state': 'DC', 'country': 'USA'}

Complexity With Respect To A Structure

  • Last week: analyzing complexity of an algorithm (no broader context)
  • This week: analyzing complexity of a structure as a collection of variables + algorithms
  • Object-Oriented Programming: Design pattern for “organizing” data and algorithms into structures

Looking Under the Hood of a Data Structure

  • Last week we saw the math for why we can “abstract away from” the details of how a particular language works
  • We want to understand these structures independently of the specifics of their implementation in Python (for now)
  • So, let’s construct our own simplified versions of the basic structures, and use these simplified versions to get a sense for their efficiency
    • (The “true” Python versions may be hyper-optimized, but as we saw, there are fundamental constraints on runtime, assuming \(P \neq NP\))

Tuples

Code
class MyTuple:
  def __init__(self, thing1, thing2):
    self.thing1 = thing1
    self.thing2 = thing2

  def __repr__(self):
    return f"({self.thing1}, {self.thing2})"

  def __str__(self):
    return self.__repr__()

t1 = MyTuple('a','b')
t2 = MyTuple(111, 222)
print(t1, t2)
(a, b) (111, 222)

Lists

  • The list itself just points to a root item:
Code
class MyList:
  def __init__(self):
    self.root = None

  def append(self, new_item):
    if self.root is None:
      self.root = MyListItem(new_item)
    else:
      self.root.append(new_item)

  def __repr__(self):
    return self.root.__repr__()
  • An item has contents, pointer to next item:
Code
class MyListItem:
  def __init__(self, content):
    self.content = content
    self.next = None

  def append(self, new_item):
    if self.next is None:
      self.next = MyListItem(new_item)
    else:
      self.next.append(new_item)

  def __repr__(self):
    my_content = self.content
    return my_content if self.next is None else f"{my_content}, {self.next.__repr__()}"
Code
users = MyList()
users.append('Jeff')
users.append('Alma')
users.append('Bo')
print(users)
Jeff, Alma, Bo

So, How Many “Steps” Are Required…

  • To retrieve the first element in a MyTuple?
  • To retrieve the last element in a MyTuple?
  • To retrieve the first element in a MyList?
  • To retrieve the last element in a MyList?

How Many Steps?

  • With a MyTuple:
Code
t1.thing1
'a'
  • \(\implies\) 1 step
Code
t1.thing2
'b'
  • \(\implies\) 1 step
  • With a MyList:
Code
print(users.root.content)
Jeff
  • \(\implies\) 1 step
Code
current_node = users.root
while current_node.next is not None:
  current_node = current_node.next
print(current_node.content)
Bo
  • \(\implies\) (3 steps)
  • …But why 3? How many steps if the list contained 5 elements? \(N\) elements?

Pairwise-Concatenating List Elements

  • Now rather than just printing, let’s pairwise concatenate:
Code
cur_pointer1 = users.root
while cur_pointer1 is not None:
  cur_pointer2 = users.root
  while cur_pointer2 is not None:
    print(cur_pointer1.content + cur_pointer2.content)
    cur_pointer2 = cur_pointer2.next
  cur_pointer1 = cur_pointer1.next
JeffJeff
JeffAlma
JeffBo
AlmaJeff
AlmaAlma
AlmaBo
BoJeff
BoAlma
BoBo
  • How many steps did this take? How about for a list with \(5\) elements? \(N\) elements?

Last Example: Pairwise-Concat + End Check

Code
printed_items = []
cur_pointer1 = users.root
while cur_pointer1 is not None:
  cur_pointer2 = users.root
  while cur_pointer2 is not None:
    print(cur_pointer1.content + cur_pointer2.content)
    printed_items.append(cur_pointer1.content)
    printed_items.append(cur_pointer2.content)
    cur_pointer2 = cur_pointer2.next
  cur_pointer1 = cur_pointer1.next
check_pointer = users.root
while check_pointer is not None:
  if check_pointer.content in printed_items:
    print(f"Phew. {check_pointer.content} printed at least once.")
  else:
    print(f"Oh no! {check_pointer.content} was never printed!!!")
  check_pointer = check_pointer.next
JeffJeff
JeffAlma
JeffBo
AlmaJeff
AlmaAlma
AlmaBo
BoJeff
BoAlma
BoBo
Phew. Jeff printed at least once.
Phew. Alma printed at least once.
Phew. Bo printed at least once.

Generalizing

  • Algorithms are “efficient” relative to how their runtime scales as the objects grow larger and larger!
  • Tuple operations take 1 step no matter what
  • For lists, retrieving the first element takes 1 step no matter what, but retrieving the last element takes \(n\) steps!
  • Pairwise concatenation requires \(n^2\) steps!

The Complexity of Our Examples

  • Tuple operations: \(O(1)\)
  • Retrieving the first element of a list: \(O(1)\)
  • Retrieving the last element of a list: \(O(n)\)
  • Pairwise concatenation: \(O(n^2)\)
  • Pairwise concatenation+check: \(O(n^2 + n) = O(n^2) \leftarrow !!!\)
  • Crucial to think asymptotically to wrap our heads around this!

Doing Better Than Insertion Sort

How Can Merge Sort Work That Much Better!?

  • With the linear approach, each time we check a word and it’s not our word we eliminate… one measly word 😞
  • But with the divide-and-conquer approach… we eliminate 🔥HALF OF THE REMAINING WORDS🔥

Merging Two Sorted Lists in \(O(n)\) Time

From Cormen et al. (2001), pg. 37

Merge Sort (Merging as Subroutine)

From Cormen et al. (2001), pg. 40

Complexity Analysis

  • Hard way: re-do the line-by-line analysis we did for Insertion-Sort 😣 Easy way: stand on shoulders of giants!
  • Using a famous+fun theorem (the Master Theorem): Given a recurrence \(T(n) = aT(n/b) + f(n)\), compute its:
    • Watershed function \(W(n) = n^{\log_b(a)}\) and
    • Driving function \(D(n) = f(n)\)
  • The Master Theorem gives closed-form asymptotic solution for \(T(n)\), split into three cases: (1) \(W(n)\) grows faster than \(D(n)\), (2) grows at same rate as \(D(n)\), or (3) grows slower than \(D(n)\)

Bounding the Runtime of Merge Sort

  • How about Merge-Sort? \(T(n) = 2T(n/2) + \Theta(n)\)
    • \(a = b = 2\), \(W(n) = n^{\log_2(2)} = n\), \(D(n) = \Theta(n)\)
  • \(W(n)\) and \(D(n)\) grow at same rate \(\implies\) Case 21:

Applying the Master Theorem When \(W(n) = \Theta(D(n))\) (Case 2)

  1. Is there a \(k \geq 0\) satisfying \(D(n) = \Theta(n^{\log_b(a)}\log_2^k(n))\)?
  2. If so, your solution is \(T(n) = \Theta(n^{\log_b(a)}\log_2^{k+1}(n))\)
  • Merge-Sort: \(k = 0\) works! \(\Theta(n^{\log_2(2)}\log_2^0(n)) = \Theta(n)\)
  • Thus \(T(n) = \Theta(n^{\log_b(a)}\log_2^{k+1}(n)) = \boxed{\Theta(n\log_2n)}\) 😎

Object-Oriented Programming

Breaking a Problem into (Interacting) Parts

  • Python so far: “Data science mode”
    • Start at top of file with raw data
    • Write lines of code until problem solved
  • Python in this class: “Software engineering mode”
    • Break system down into parts
    • Write each part separately
    • Link parts together to create the whole
  • (One implication: .py files may be easier than .ipynb for development!)

How Does A Calculator Work?

(Calculator image from Wikimedia Commons)

Key OOP Feature #1: Encapsulation

  • Imagine you’re on a team trying to make a calculator
  • One person can write the Screen class, another person can write the Button class, and so on
  • Natural division of labor! (May seem silly for a calculator, but imagine as your app scales up)

Key OOP Feature #2: Abstraction

  • Abstraction complements this Encapsulation: the Screen team doesn’t need to know the internal details of Button (just it’s API), and vice-versa
  • Relevant data and functions can be “public”, irrelevant internal data and functions “private”
  • (Like with type hints), Python doesn’t enforce this distinction, but (unlike with type hints) most libraries do separate public from private by a variable-naming convention

Public, Protected, Private Attributes in Python

  • [Public (default)] No underscores: public_var
  • [Protected] One underscore: _protected_var
  • [Private] Two underscores: __private_var
Code
class MyTopSecretInfo:
  __the_info = "I love Carly Rae Jepsen"

info_obj = MyTopSecretInfo()
info_obj.__the_info
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[23], line 5
      2   __the_info = "I love Carly Rae Jepsen"
      4 info_obj = MyTopSecretInfo()
----> 5 info_obj.__the_info

AttributeError: 'MyTopSecretInfo' object has no attribute '__the_info'

Guess we can’t access it then, right? 😮‍💨

Code
info_obj._MyTopSecretInfo__the_info
'I love Carly Rae Jepsen'
  • NOO MY SECRET!!! 😭
  • Pls don’t tell anyone

Key OOP Features #3-4: Inheritance, Polymorphism

  • Better explained in diagrams than words (next 10 slides!), but we can get a sense by thinking about their etymology:
  • “Inheritance” comes from “heir”, like “heir to the throne”
    • Parent passes on [things they possess] to their children
  • “Polymorphism”: Poly = “many”, Morphe = “forms”
    • How does Python know what to do when we print()?
    • It “just works” because print() (through __str__()) takes on many (!) forms (!): each type of object has its own implementation of __str__()

Use Case: Bookstore Inventory Management

Image source

In Pictures

G Bookstore Relational Diagram Bookstore Bookstore Name Location Booklist Get_Inventory() Sort_Inventory() Place Place City State Country Print_Map() Bookstore:loc->Place:placehead Has One Book Book Title Authors Num Pages Preview() Bookstore:bl->Book:bookhead Has Multiple Person Person Family Name Given Name Book:auths->Person:personhead Has Multiple

Creating Classes

  • Use case: Creating an inventory system for a Bookstore
Code
class Bookstore:
    def __init__(self, name, location):
        self.name = name
        self.location = location
        self.books = []

    def __getitem__(self, index):
        return self.books[index]

    def __repr__(self):
        return self.__str__()

    def __str__(self):
        return f"Bookstore[{self.get_num_books()} books]"

    def add_books(self, book_list):
        self.books.extend(book_list)

    def get_books(self):
        return self.books

    def get_inventory(self):
        book_lines = []
        for book_index, book in enumerate(self.get_books()):
            cur_book_line = f"{book_index}. {str(book)}"
            book_lines.append(cur_book_line)
        return "\n".join(book_lines)

    def get_num_books(self):
        return len(self.get_books())

    def sort_books(self, sort_key):
        self.books.sort(key=sort_key)

class Book:
    def __init__(self, title, authors, num_pages):
        self.title = title
        self.authors = authors
        self.num_pages = num_pages

    def __str__(self):
        return f"Book[title={self.get_title()}, authors={self.get_authors()}, pages={self.get_num_pages()}]"

    def get_authors(self):
        return self.authors

    def get_first_author(self):
        return self.authors[0]

    def get_num_pages(self):
        return self.num_pages

    def get_title(self):
        return self.title

class Person:
    def __init__(self, family_name, given_name):
        self.family_name = family_name
        self.given_name = given_name

    def __repr__(self):
        return self.__str__()

    def __str__(self):
        return f"Person[{self.get_family_name()}, {self.get_given_name()}]"

    def get_family_name(self):
        return self.family_name

    def get_given_name(self):
        return self.given_name
Code
my_bookstore = Bookstore("Bookland", "Washington, DC")
plath = Person("Plath", "Sylvia")
bell_jar = Book("The Bell Jar", [plath], 244)
marx = Person("Marx", "Karl")
engels = Person("Engels", "Friedrich")
manifesto = Book("The Communist Manifesto", [marx, engels], 43)
elster = Person("Elster", "Jon")
cement = Book("The Cement of Society", [elster], 311)
my_bookstore.add_books([bell_jar, manifesto, cement])
print(my_bookstore)
print(my_bookstore[0])
print("Inventory:")
print(my_bookstore.get_inventory())
Bookstore[3 books]
Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
Inventory:
0. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
1. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
2. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]

Doing Things With Classes

  • Now we can use our OOP structure, for example to sort the inventory in different ways!
Alphabetical (By First Author)
Code
sort_alpha = lambda x: x.get_first_author().get_family_name()
my_bookstore.sort_books(sort_key = sort_alpha)
print(my_bookstore.get_inventory())
0. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]
1. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
2. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
By Page Count
Code
sort_pages = lambda x: x.get_num_pages()
my_bookstore.sort_books(sort_key = sort_pages)
print(my_bookstore.get_inventory())
0. Book[title=The Communist Manifesto, authors=[Person[Marx, Karl], Person[Engels, Friedrich]], pages=43]
1. Book[title=The Bell Jar, authors=[Person[Plath, Sylvia]], pages=244]
2. Book[title=The Cement of Society, authors=[Person[Elster, Jon]], pages=311]

Inheritance and Polymorphism

  • Encapsulate general properties in parent class, specific properties in child classes

(You can edit this or make your own UML diagrams in nomnoml!)

Or… Is This Better?

Edit in nomnoml

Design Choices

  • The goal is to encapsulate as best as possible: which objects should have which properties, and which methods?
  • Example: Fiction vs. Non-Fiction. How important is this distinction for your use case?
Option 1: As Property of Book
Code
from enum import Enum
class BookType(Enum):
    NONFICTION = 0
    FICTION = 1

class Book:
    def __init__(self, title: str, authors: list[Person], num_pages: int, type: BookType):
        self.title = title
        self.authors = authors
        self.num_pages = num_pages
        self.type = type

    def __str__(self):
        return f"Book[title={self.title}, authors={self.authors}, pages={self.num_pages}, type={self.type}]"
Code
joyce = Person("Joyce", "James")
ulysses = Book("Ulysses", [joyce], 732, BookType.FICTION)
schelling = Person("Schelling", "Thomas")
micromotives = Book("Micromotives and Macrobehavior", [schelling], 252, BookType.NONFICTION)
print(ulysses)
print(micromotives)
Book[title=Ulysses, authors=[Person[Joyce, James]], pages=732, type=BookType.FICTION]
Book[title=Micromotives and Macrobehavior, authors=[Person[Schelling, Thomas]], pages=252, type=BookType.NONFICTION]
Option 2: Separate Classes
Code
# class Book defined as earlier
class FictionBook(Book):
    def __init__(self, title, authors, num_pages, characters):
        super().__init__(title, authors, num_pages)
        self.characters = characters

class NonfictionBook(Book):
    def __init__(self, title, authors, num_pages, topic):
        super().__init__(title, authors, num_pages)
        self.topic = topic
Code
joyce = Person("Joyce", "James")
ulysses = FictionBook("Ulysses", [joyce], 732, ["Daedalus"])
schelling = Person("Schelling", "Thomas")
micromotives = NonfictionBook("Micromotives and Macrobehavior", [schelling], 252, "Economics")
print(ulysses)
print(micromotives)
Book[title=Ulysses, authors=[Person[Joyce, James]], pages=732]
Book[title=Micromotives and Macrobehavior, authors=[Person[Schelling, Thomas]], pages=252]

HW1: Python Fundamentals

Appendix: The Full Master Theorem

Master Theorem: Let \(a > 0\) and \(b > 1\) be constants, and let \(f(n)\) be a driving function defined and nonnegative on all sufficiently large reals. Define \(T(n)\) on \(n \in \mathbb{N}\) by

\[ T(n) = aT(n/b) + f(n) \]

where \(aT(n/b) = a'T(\lfloor n/b \rfloor) + a''T(\lceil n/b \rceil)\) for some \(a' \geq 0\) and \(a'' \geq 0\) satisfying \(a = a' + a''\). Then the asymptotic behavior of \(T(n)\) can be characterized as follows:

  1. If there exists \(\epsilon > 0\) such that \(f(n) = O(n^{\log_b(a) - \epsilon})\), then \(T(n) = \Theta(n^{\log_b(a)})\)
  2. If there exists \(k \geq 0\) such that \(f(n) = \Theta(n^{\log_b(a)}\log_2^k(n))\), then \(T(n) = \Theta(n^{\log_b(a)}\log_2^{k+1}(n))\).
  3. If there exists \(\epsilon > 0\) such that \(f(n) = \Omega(n^{\log_b(a) + \epsilon})\), and if \(f(n)\) satisfies the regularity condition \(af(n/b) \leq cf(n)\) for some constant \(c < 1\) and all sufficiently large \(n\), then \(T(n) = \Theta(f(n))\).

Proof. See Cormen et al. (2001), pg. 107-114.

(← Back to Merge Sort slides)

References

Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction To Algorithms. MIT Press.