DSAN 6000: Big Data and Cloud Computing
Fall 2025
Monday, November 3, 2025
…HEY! WAKE UP! NEW DATA JUST CAME IN!

You before this week:
.ipynb and/or Spark cluster
You after this week:
EmailOperator sends update
From Astronomer Academy’s Airflow 101
(Underlined words link to Airflow docs “Core Concepts” section)
Directed Acyclic Graph (DAG): Your pipeline as a whole!
DAGs consist of multiple tasks, which you “string together” using the control flow operators >> and <<
[Ex ] second_task, third_task can’t start until first_task completes:
[Ex ] fourth_task can’t start until third_task completes:
What kinds of tasks can we create? Brings us to another concept…
Operators: What Kind of task?Operators: BashOperator and PythonOperatorOperators:HttpOperatorS3FileTransformOperatorSQLExecuteQueryOperatorEmailOperatorSlackAPIOperator+ Jinja templating for managing how data “passes” from one step to the next:
Task vs. Operator: A Sanity-Preserving DistinctionFrom Operator docs:
When we talk about a
Task, we mean the generic “unit of execution” of aDAG; when we talk about anOperator, we mean a [specific] pre-madeTasktemplate, whose logic is all done for you and that just needs some arguments.
Task:
Operator:


start-airflow.sh
bash) commands!
db migrate: The Airflow Metastorestart-airflow.sh
start-airflow.sh
Web UI for managing Airflow (much more on this later!)
Default Login Info
Default db migrate generates admin password in ~/simple_auth_manager_passwords.json.generated

Scheduler → Executorstart-airflow.sh
LocalExecutorEcsExecutor (AWS ECS), KubernetesExecutorSparkSubmitOperator
DAG Processorstart-airflow.sh
From Ruiter et al. (2026)
schedule Argument(Airflow uses pendulum under the hood, rather than datetime!)
dag_scheduling.py
Cron: Full-on scheduling language (used by computers since 1975!)
crontab.sh
Cron Presets: None, "@once", "@continuous", "@hourly", "@daily", "@weekly"
| Service | Command |
|---|---|
| AWS | pip install 'apache-airflow[amazon]' |
| Azure | pip install 'apache-airflow[microsoft-azure]' |
| Databricks | pip install 'apache-airflow[databricks]' |
| GitHub | pip install 'apache-airflow[github]' |
| Google Cloud | pip install 'apache-airflow[google]' |
| MongoDB | pip install 'apache-airflow[mongo]' |
| OpenAI | pip install 'apache-airflow[openai]' |
| Slack | pip install 'apache-airflow[slack]' |
| Spark | pip install 'apache-airflow[apache-spark]' |
| Tableau | pip install 'apache-airflow[tableau]' |
(And many more:)
Jinja ExampleMain challenge: converting “intuitive” pipelines in our heads:

Into DAGs with concrete Tasks, dependencies, and triggers:

From Ruiter et al. (2026)
DSAN 6000 Week 10: ETL Pipelines with Airflow