Week 9: ETL Pipelines with Prefect

DSAN 5500: Data Structures, Objects, and Algorithms in Python

Jeff Jacobs

jj1088@georgetown.edu

Thursday, March 13, 2025

Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	6:50pm	Key Concepts →
	6:50pm	7:00pm	Execution Graphs →
	7:00pm	7:15pm	Deployments (Preview) →
	7:15pm	8:00pm	Lab Part 1 →
Break!	8:00pm	8:10pm
	8:10pm	9:00pm	Lab Part 2 →

Key Concepts

The Three Main Buzzwords

(Underlined terms link to relevant section of docs)

Flow: The “main thing” your pipeline is doing!
- Except in simple cases, will consist of multiple Tasks
Flows and Tasks alone already provide much more functionality than “basic” functions…
Deployments: Flows + Tasks + Metadata about how and when you want them to run.
- “Deployments elevate workflows from [functions that you call manually] to [API-managed entities].”

Deployments \(\Rightarrow\) Run Flows Programmatically

The Power of `Deployment`s (More Next Week)

“Packaging” code as Deployments enables Triggers:
- On a particular Schedules: every 4 hours, every day at noon, once per week, etc.
- When important Events happen: pushes to GitHub, addition, removal, modification of files in Dropbox, etc.
Logging, Notifications (Slack, email, text messages)
Results as natural-language explanations (produced by Prefect) or custom summaries, called Artifacts, that you define as part of your flows

`Schedules`

Cron: Full-on scheduling language (used by computers since 1975!)

crontab.sh

# ┌───────────── minute (0–59)
# │ ┌───────────── hour (0–23)
# │ │ ┌───────────── day of the month (1–31)
# │ │ │ ┌───────────── month (1–12)
# │ │ │ │ ┌───────────── day of the week (0–6) (Sunday to Saturday)
# │ │ │ │ │
# │ │ │ │ │
# │ │ │ │ │
# * * * * * <command to execute>

Interval

my_interval.yml

schedule:
  interval: 600
  timezone: America/Chicago

RRule

my_rrule.yml

schedule:
  rrule: 'FREQ=WEEKLY;BYDAY=MO,WE,FR;UNTIL=20240730T040000Z'

`Events`

These integrations are nice, but in reality usually overkill: you can just use Webhooks

`Logging`

For most non-advanced use cases: literally just put log_prints=True as a parameters of your Flow:

flow_with_logging.py

from prefect import task, flow

@task
def my_task():
    print("we're logging print statements from a task")

@flow(log_prints=True)
def my_flow():
    print("we're logging print statements from a flow")
    my_task()

`Notifications`

Actually immensely powerful, because it uses a templating engine called Jinja which is VERY worth learning!
With your brain in pipeline mode, think of Jinja as the [?] in:

`Jinja` Example

homepage.jinja

<h3>{{ me['name'] }}'s Favorite Hobbies</h3>
<ul>
{%- for hobby in hobbies %}
  <li>{{ hobby }}</li>
{%- endfor %}
</ul>

render_jinja.py

from jinja2 import Template
tmpl = Template('homepage.jinja')
tmpl.render(
    me = {'name': 'Jeff'},
    hobbies = [
        "sleeping",
        "jetski",
        "getting sturdy"
    ]
)

↓

rendered.html

<h3>Jeff's Favorite Hobbies</h3>
<ul>
  <li>sleeping</li>
  <li>jetski</li>
  <li>getting sturdy</li>
</ul>

\(\leadsto\)

Jeff's Favorite Hobbies

sleeping
jetski
getting sturdy

Lab Time!

Week 9 Lab: Pipeline Orchestration with Prefect

Week 9: ETL Pipelines with Prefect

Schedule

Key Concepts

The Three Main Buzzwords

Deployments \(\Rightarrow\) Run Flows Programmatically

The Power of Deployments (More Next Week)

Schedules

Events

Logging

Notifications

Jinja Example

Jeff's Favorite Hobbies

Lab Time!

The Power of `Deployment`s (More Next Week)

`Schedules`

`Events`

`Logging`

`Notifications`

`Jinja` Example