Week 8: Data Validation, Data Processing Pipelines

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Jeff Jacobs

jj1088@georgetown.edu

Monday, March 11, 2024

Micro \(\rightarrow\) Meso \(\rightarrow\) Macro

  • ✅ Micro-level: Individual, “core” algorithms/data structures and their big-\(O\) runtimes
    • e.g., LinkedList, Merge-Sort
  • ✅ Meso-level: Algorithms/data structures that “piece together” a small collection of core algorithms
    • Hash tables: Fixed-Length Array + Hashing Algorithm + Collision Handling Algorithm (BST)
  • 🤔 Macro-level: Pipelines of data structures and algorithms:
    • raw data source \(\rightarrow\) processing \(\rightarrow\) storage

Most Common HW1-to-Midterm Struggles

  • Ingesting data (HW1: Swimmer class)
    • \(\implies\) Data-processing pipelines!
  • Maintaining invariants as dataset grows (HW2: NoneInventoryItemBinarySearchTree)
    • \(\implies\) Data validation!

Data Validation

  • What you already know: Type hints
  • Type hints on steroids: Pydantic 😎

Pydantic

Pydantic in Action

Code
from pydantic import BaseModel
class InventoryItem(BaseModel):
    item_name: str
    price: float
my_item = InventoryItem(item_name="Banana", price=10)
my_item
InventoryItem(item_name='Banana', price=10.0)
Code
invalid_item = InventoryItem(item_name="Banana", price="100 dollar")
ValidationError: 1 validation error for InventoryItem
price
  Input should be a valid number, unable to parse string as a number [type=float_parsing, input_value='100 dollar', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/float_parsing

Pydantic in Even-Cooler Action

Code
from pydantic import BaseModel, EmailStr, PositiveInt
class Employee(BaseModel):
    name: str
    email: EmailStr
    age: PositiveInt
invalid_employee = Employee(
  name="Jeef",
  email="fakeemail!!!",
  age=50
)
ValidationError: 1 validation error for Employee
email
  value is not a valid email address: The email address is not valid. It must have exactly one @-sign. [type=value_error, input_value='fakeemail!!!', input_type=str]
Code
invalid_employee2 = Employee(
  name="Jeeferson",
  email="valid@email.com",
  age=-3
)
ValidationError: 1 validation error for Employee
age
  Input should be greater than 0 [type=greater_than, input_value=-3, input_type=int]
    For further information visit https://errors.pydantic.dev/2.6/v/greater_than

Data-Processing Pipelines

Two Generations of Python Pipeline Libraries

  • Gen 1: Airflow (AirBnB), Luigi (Spotify)
  • Gen 2: Dagster, Prefect

For Now: Prefect!

  • But, the principles (ETL) remain the same!

ETL = Extract, Transform, Load

From Analytics Vidhya

You Have Already Been Doing This!

  • HW1: Swim team example
    • Extract: Load data from .csv
    • Transform: Convert times to milliseconds
    • Load: Store into Team object
  • HW2 (Throughout)
    • Extract: Load products from .csv into Pandas
    • Transform: (Construct InventoryItem object for each row)
    • Load: Store into LogarithmicHashTable

(And You Have Already Run into the Need for Validation!)

  • HW1: Are the times in seconds or milliseconds?
  • HW2: Are users providing key and value separately, or together?
  • If together: as tuples or InventoryItems?
  • If InventoryItems: What type(s) for item_name? What type(s) for price? 😵
  • \(\implies\) One of the two main focuses for HW3 will be validation of input data!

Can’t We Just Skip to the Cool, Fun ML Part? (The T Step)

  • (Hopefully things from previous slide have taught you the dangers of this!)
  • It’s like Anscombe’s Quartet from DSAN 5000/5100/5200: We can be misled by fancy tools if we don’t take time for the (sometimes boring) step of looking at the data
  • Example: ETL for processing text data (books for now)

BookshelfScan (An Example I Use IRL!)

  • Extract: Dropbox folder containing ebooks
  • Transform: Compute word counts
  • Load: Save word counts into a database
  • Enables me to figure out which books talk about a topic \(t\) (data structures, algorithms, data ethics and policy, etc.)!
  • BUT (Devil is in the details): Can look very different for txt, epub (can contain images), pdf (sometimes text is embedded, sometimes not), images of scans, audiobooks
  • \(\implies\) (Sad but true) Pipleline will crash and burn if you don’t handle these before using Cool NLP things in T step

HW3 Possibility 1: Pet Store Inventory

HW3 Possibility 2: Books to Scrape

Lab(!): Data Validation

References