Week 1: Course Intro and Motivation

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Wednesday, January 10, 2024

Open slides in new window →

Welcome to DSAN 5500!

  • Principles for the Course
  • Algorithmic Thinking vs. Coding
  • Data Structures: Motivation
  • Algorithms and Complexity: Motivation

Course Principles

  • Comparative Understanding
  • Learning Data Structures and Complexity Simultaneously

Before and After

Python before taking DSAN 5500

Python after taking DSAN 5500

Developing a Comparative Understanding

“We hardly know ourselves, if we know nobody else”

–(Blue Scholars, “Sagaba”)

  • The course focuses on Python, but part of understanding Python is understanding how Python does things differently from other languages!
  • Just as C was “overtaken” by Java, then Java was “overtaken” by Python, Python will someday be overtaken

The Numbers

Code
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
lang_df = pd.read_csv("assets/gh_issues.csv")
# The data for 2022 is essentially useless
lang_df = lang_df[lang_df['year'] <= 2021].copy()
lang_df['time'] = lang_df['year'].astype(str) + "_" + lang_df['quarter'].astype(str)
lang_df['prop'] = lang_df['count'] / lang_df.groupby('time')['count'].transform('sum')
lang_df.head()
#sns.lineplot(data=lang_df, x='year', y='count', color='name')
# Keep only most popular languages
keep_langs = ['Python','JavaScript','C','C++','C#','Java','Ruby']
pop_df = lang_df[lang_df['name'].isin(keep_langs)].copy()
fig = px.line(pop_df,
  x='time', y='prop', color='name',
  template='simple_white', title='Programming Language Popularity Since 2012',
  labels = {
    'time': 'Year',
    'prop': 'Proportion of GitHub Issues'
  }
)
fig.update_layout(
  xaxis = dict(
    tickmode = 'array',
    tickvals = [f"{year}_1" for year in range(2012,2022)],
    ticktext = [f"{year}" for year in range(2012,2022)]
  )
)
fig.show()

Research Particular Subfields!

  • For example, if you’re interested in pursuing Economics, you’ll want to learn Stata
  • Physics? You may want to learn MATLAB
  • For pure mathematics: Julia / Mathematica
  • Statistics, sociology, psychology, political science: R
  • Web development: JavaScript / TypeScript
  • The holy grail: you’re comfortable with Python but can also think in general, language-agnostic terms about algorithmic and data-structural efficiency!

Avoid Analysis Paralysis

  • (Easier said than done, admittedly…)

Image source: XKCD

Tie Yourself to the Mast

  • The exhausted 3am version of you will thank present you for writing useful comments, exceptions/errors, and type hints!

John William Waterhouse, Public domain, via Wikimedia Commons

Part 1: Coding in General

Types of Languages

  • Compiled
  • Interpreted

Primitive Types

  • Boolean (True or False)
  • Numbers (Integers, Decimals)
  • Strings
  • None

Stack and Heap

Let’s look at what happens, in the computer’s memory, when we run the following code:

Code
import datetime
import pandas as pd
country_df = pd.read_csv("assets/country_pop.csv")
pop_col = country_df['pop']
num_rows = len(country_df)
filled = all(~pd.isna(country_df))
alg_row = country_df.loc[country_df['name'] == "Algeria"]
num_cols = len(country_df.columns)
username = "Jeff"
cur_date = datetime.datetime.now()
i = 0
j = None
z = 314
country_df
name pop
0 Albania 2.8
1 Algeria 44.2
2 Angola 34.5

Algorithmic Thinking

  • What are the inputs?
  • What are the outputs?
  • Standard cases vs. edge cases
  • Adversarial development: brainstorm all of the ways an evil hacker might break your code!

Example: Finding An Item Within A List

  • Seems straightforward, right? Given a list l, and a value v, return the index of l which contains v
  • Corner cases galore…
  • What if l contains v more than once? What if it doesn’t contain v at all? What if l is None? What if v is None? What if l isn’t a list at all? What if v is itself a list?

Part 2: Python Specifically

#1 Sanity-Preserving Tip!

  • (For our purposes) the answer to “what is Python?” is: an executable file that runs .py files!
    • e.g., we can run python mycode.py in Terminal/PowerShell
  • Everything else: pip, Jupyter, Pandas, etc., is an add-on to this basic functionality!

Code Blocks via Indentation

Code
for i in range(5):
    print(i)
0
1
2
3
4
Code
for i in range(5):
print(i)
IndentationError: expected an indented block after 'for' statement on line 1 (3695896917.py, line 2)

Type Hints

  • Not a “standard” Python feature, not enforced by the Python interpreter, but can help you maintain sanity!
Code
def multiply(thing1, thing2):
  return thing1 * thing2
print(multiply(5, 3))
print(multiply("fiveee", 3))
15
fiveeefiveeefiveee
Code
from numbers import Number
def multiply(thing1: Number, thing2: Number) -> Number:
  return thing1 * thing2
print(multiply(5, 3))
print(multiply("fiveee", 3))
15
fiveeefiveeefiveee
Code
from mypy import api
result = api.run(['-c',_i])
print(result[0])
<string>:3: error: Unsupported left operand type for * ("Number")  [operator]
<string>:4: error: Argument 1 to "multiply" has incompatible type "int"; expected "Number"  [arg-type]
<string>:4: note: Types from "numbers" aren't supported for static type checking
<string>:4: note: See https://peps.python.org/pep-0484/#the-numeric-tower
<string>:4: note: Consider using a protocol instead, such as typing.SupportsFloat
<string>:4: error: Argument 2 to "multiply" has incompatible type "int"; expected "Number"  [arg-type]
<string>:5: error: Argument 1 to "multiply" has incompatible type "str"; expected "Number"  [arg-type]
<string>:5: note: Types from "numbers" aren't supported for static type checking
<string>:5: note: See https://peps.python.org/pep-0484/#the-numeric-tower
<string>:5: note: Consider using a protocol instead, such as typing.SupportsFloat
<string>:5: error: Argument 2 to "multiply" has incompatible type "int"; expected "Number"  [arg-type]
Found 5 errors in 1 file (checked 1 source file)

Unit Testing

  • For your life: test-driven development
    • If you’re coding a duck, you should test that it looks like a duck, quacks like a duck, etc.
  • For this class:
    • Public tests: Fully visible, see result plus full code
    • Hidden tests: See result + description of test, but no code
    • Secret tests: We run these after you submit, as a major portion of the total grade

Data Structures: Motivation

Why Does The NYC Subway Have Express Lines?

Why Stop At Two Levels?

From Skip List Data Structure Explained, Sumit’s Diary blog

How TF Does Google Maps Work?

  • A (mostly) full-on answer: soon to come! Data structures for spatial data
  • A step in that direction: Quadtrees! (Fractal DC)

Jim Kang’s Quadtree Visualizations

Algorithmic Complexity: Motivation

The Secretly Exciting World of Matrix Multiplication

  • Fun Fact 1: Most of modern Machine Learning is, at the processor level, just a bunch of matrix operations
  • Fun Fact 2: The way we’ve all learned how to multiply matrices requires \(O(N^3)\) operations, for two \(N \times N\) matrices \(A\) and \(B\)
  • Fun Fact 3: \(\underbrace{x^2 - y^2}_{\mathclap{\times\text{ twice, }\pm\text{ once}}} = \underbrace{(x+y)(x-y)}_{\times\text{once, }\pm\text{ twice}}\)
  • Fun Fact 4: These are not very fun facts at all

Why Is Jeff Rambling About Matrix Math From 300 Years Ago?

  • The way we all learned it in school (for \(N = 2\)):

\[ AB = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12} + a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12} + a_{22}b_{22} \end{bmatrix} \]

  • 12 operations: 8 multiplications, 4 additions \(\implies O(N^3) = O(2^3) = O(8)\)
  • Are we trapped? Like… what is there to do besides performing these \(N^3\) operations, if we want to multiply two \(N \times N\) matrices? Why are we about to move onto yet another slide about this?

Block-Partitioning Matrices

  • Now let’s consider big matrices, whose dimensions are a power of 2 (for ease of illustration): \(A\) and \(B\) are now \(N \times N = 2^n \times 2^n\) matrices
  • We can “decompose” the matrix product \(AB\) as:

\[ AB = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} = \begin{bmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{12} + A_{12}B_{22} \\ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{bmatrix} \]

  • Which gives us a recurrence relation representing the total number of computations required for this big-matrix multiplication: \(T(N) = \underbrace{8T(N/2)}_{\text{Multiplications}} + \underbrace{\Theta(1)}_{\text{Additions}}\)
  • It turns out that (using a method we’ll learn in Week 3), given this recurrence relation and our base case from the previous slide, this divide-and-conquer approach via block-partitioning doesn’t help us: we still get \(T(n) = O(n^3)\)
  • So why is Jeff still torturing us with this example?

Time For Some 🪄🔥MATRIX MAGIC!🔥🪄

  • If we define

\[ \begin{align*} m_1 &= (a_{11}+a_{22})(b_{11}+b_{22}) \\ m_2 &= (a_{21}+a_{22})b_{11} \\ m_3 &= a_{11}(b_{12}-b_{22}) \\ m_4 &= a_{22}(b_{21}-b_{11}) \\ m_5 &= (a_{11}+a_{12})b_{22} \\ m_6 &= (a_{21}-a_{11})(b_{11}+b_{12}) \\ m_7 &= (a_{12}-a_{22})(b_{21}+b_{22}) \end{align*} \]

  • Then we can combine these seven scalar products to obtain our matrix product:

\[ AB = \begin{bmatrix} m_1 + m_4 - m_5 + m_7 & m_3 + m_5 \\ m_2 + m_4 & m_1 - m_2 + m_3 + m_6 \end{bmatrix} \]

  • Total operations: 7 multiplications, 18 additions

Block-Partitioned Matrix Magic

  • Using the previous slide as our base case and applying this same method to the block-paritioned big matrices, we get the same result, but where the four entries in \(AB\) here are now matrices rather than scalars:

\[ AB = \begin{bmatrix} M_1 + M_4 - M_5 + M_7 & M_3 + M_5 \\ M_2 + M_4 & M_1 - M_2 + M_3 + M_6 \end{bmatrix} \]

  • We now have a different recurrence relation: \(T(N) = \underbrace{7T(N/2)}_{\text{Multiplications}} + \underbrace{\Theta(N^2)}_{\text{Additions}}\)
  • And it turns out, somewhat miraculously, that the additional time required for the increased number of additions is significantly less than the time savings we obtain by doing 7 instead of 8 multiplications, since this method now runs in \(T(N) = O(N^{\log_2(7)}) \approx O(N^{2.807}) < O(N^3)\) 🤯

Lab: Assignment Workflow