Resource Hub

In-Class Demos

Lab .zips

Access the ZIP files here

Extra Writeups

Key Books By Week

Week 1: Course Overview and Week 2: Cloud Computing

  • The Boar Book: Kleppmann (2017), Designing Data-Intensive Applications

Week 3: Parallel Concepts

  • The Wolohan MapReduce Book: Wolohan (2020), Mastering Large Datasets with Python, Chapters 1-6

Week 4: DuckDB, Polars, File Formats

  • The Lynx Book: Janssens and Nieuwdorp (2025), Python Polars: The Definitive Guide
  • The I-Need-Ham, Hunger Book: Needham, Hunger, and Simons (2024), DuckDB in Action

Week 5: Data Warehousing

Two Packt Books:

  • General Data Engineering on AWS: Eagar (2021), Data Engineering with AWS
  • Data Engineering with Athena: Virtuoso et al. (2021), Serverless Analytics with Amazon Athena

Weeks 6 and 7: Hadoop → PySpark

    • The Wolohan MapReduce Book: Wolohan (2020), Mastering Large Datasets with Python, Chapters 7-10

Weeks 8 and 9: Spark in Depth

  • The Electric Eel Book: Damji et al. (2020), Learning Spark

PDFs of Books

The remainder of this page is auto-generated from all of the references across the slides for each week: click on the name of a reference to download the ebook version, if available!

Barber, David. 2012. Bayesian Reasoning and Machine Learning. Cambridge University Press.
Damji, Jules S., Brooke Wenig, Tathagata Das, and Denny Lee. 2020. Learning Spark. O’Reilly Media, Inc.
Eagar, Gareth. 2021. Data Engineering with AWS: Learn How to Design and Build Cloud-Based Data Transformation Pipelines Using AWS. 1st ed. Birmingham: Packt Publishing Limited.
Firth, John Rupert. 1957. Papers in Linguistics, 1934-1951. Oxford University Press.
Gopalan, Rukmani. 2022. The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture. O’Reilly Media, Inc.
Harenslak, Bas P., and Julian de Ruiter. 2021. Data Pipelines with Apache Airflow. Simon and Schuster.
Introduction to Unstructured Data - Zilliz Learn.” n.d. Accessed November 10, 2025.
Janssens, Jeroen, and Thijs Nieuwdorp. 2025. Python Polars: The Definitive Guide: Transforming, Analyzing, and Visualizing Data with a Fast and Expressive DataFrame API. O’Reilly Media, Inc.
Kleppmann, Martin. 2017. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media, Inc.
Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.
Loukides, Mike. 2010. What Is Data Science? O’Reilly Media.
Mell, Peter, and Timothy Grance. 2011. The NIST Definition of Cloud Computing.” National Institute of Standards and Technology, Special Publication 800 (2011): 145.
Needham, Mark, Michael Hunger, and Michael Simons. 2024. DuckDB in Action. Simon and Schuster.
Raasveldt, Mark, and Hannes Mühleisen. 2019. DuckDB: An Embeddable Analytical Database.” In Proceedings of the 2019 International Conference on Management of Data, 1981–84. SIGMOD ’19. New York, NY, USA: Association for Computing Machinery.
Raff, Edward, Drew Farris, and Stella Biderman. n.d. “How Large Language Models Work.”
Reis, Joe, and Matt Housley. 2022. Fundamentals of Data Engineering: Plan and Build Robust Data Systems. O’Reilly Media, Inc.
Ruiter, Julian de, Ismael Cabral, Kris Geusebroek, Daniel van der Ende, and Bas Harenslak. 2026. Data Pipelines with Apache Airflow, Second Edition. Simon and Schuster.
Saussure, Ferdinand de. 1916. Course in General Linguistics. Open Court.
Topol, Matthew, and Wes McKinney. 2024. In-Memory Analytics with Apache Arrow. Packt Publishing Ltd.
Virtuoso, Anthony, Mert Turkay Hocanin, Aaron Wishnick, and Rahul Pathak. 2021. Serverless Analytics with Amazon Athena: Query Structured, Unstructured, or Semi-Structured Data in Seconds Without Setting up Any Infrastructure. Packt Publishing Ltd.
White, Tom E. 2015. Hadoop: The Definitive Guide. O’Reilly Media, Inc.
Wolohan, John. 2020. Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code. Simon and Schuster.