Week 4: DuckDB, Polars, File Formats

DSAN 6000: Big Data and Cloud Computing
Fall 2025

Amit Arora and Jeff Jacobs

jj1088@georgetown.edu

Monday, September 15, 2025

Agenda and Goals for Today

Lecture

  • Distributed file systems
  • Modern file types
  • Working with large tabular data on a single node
    • DuckDB
    • Polars

Lab

  • Run a similar task with Pandas, polars and duckdb

Logistics and Review

Deadlines

  • Assignment 1: Python Skills Due Sept 5 11:59pm
  • Lab 2: Cloud Tooling Due Sept 5 6pm
  • Assignment 2: Shell & Linux Due Sept 11 11:59pm
  • Lab 3: Parallel Computing Due Sept 12 6pm
  • Assignment 3: Parallelization Due Sept 18 11:59pm
  • Lab 4: Docker and Lambda Due Sept 19 6pm
  • Assignment 4: Containers Due Sept 25 11:59pm
  • Lab 5: DuckDB & Polars Due Sept 26 6pm

Look back and ahead

  • Continue to use Slack for questions!
  • Docker (containerization)
  • Lambda functions
  • Coming up: Spark and project

Filesystems

Raw Ingredients of Storage Systems

  • Disk drives (magnetic HDDs or SSDs)
  • RAM
  • Networking and CPU
  • Serialization
  • Compression
  • Caching

From Reis and Housley (2022)

Single-Machine vs. Distributed Storage

From Reis and Housley (2022)

Single-Machine

  • They are commonly used for storing operating system files, application files, and user data files.
  • Filesystems are also used in databases to store data files, transaction logs, and backups.

Distributed Storage

  • A distributed filesystem is a type of filesystem that spans multiple computers.
  • It provides a unified view of files across all the computers in the system.
  • Have existed before cloud

File Storage Types

Local Disk

  • OS-managed filesystems on local disk partition:
  • NTFS (Windows)
  • HFS+ (MacOS)
  • ext4 (Linux)() on a local disk partition of SSD or magnetic disk

Network-Attached (NAS)

  • Accessed by clients over a network
  • Redundancy and reliability, fine-grained control of resources, storage pooling across multiple disks for large virtual volumes, and file sharing across multiple machines

Cloud Filesystems

  • Not object store (more on that later)
  • Not the virtual hard drive attached to a virtual machine
  • Fully managed: Takes care of networking, managing disk clusters, failures, and configuration (Azure Files, Amazon Elastic Filesystem)
  • Backed by Object Store

Based on Reis and Housley (2022)

Object Stores

  • Somewhat confusing because object has several meanings in computer science.
  • In this context, we’re talking about a specialized file-like construct. It could be any type of file: TXT, CSV, JSON, images, videos, audio

  • Contains objects of all shapes and sizes: each gets a unique identifier
  • Objects are immutable: cannot be modified in place (unlike local FS)

Distributed FS vs Object Store

Distributed File System Object Storage
Organization Files in hierarchical directories Flat organization (though there can be overlays to provide hierarchical files structure)
Method POSIX File Operations REST API
Immutability None: Random writes anywhere in file Immutable: need to replace/append entire object
Performance Performs best for smaller files Performs best for large files
Scalability Millions of files Billions of objects

Both provide:

  • Fault tolerance
  • Availability and consistency

Before: Data locality (for Hadoop)

From White (2015)

Today: De-Coupling Storage from Compute

From Gopalan (2022)

Data-on-Disk Formats

  • Plain Text (CSV, TSV, FWF)
  • JSON
  • Binary Files

Plain Text (CSV, TSV, FWF)

  • Pay attention to encodings!
  • Lines end in linefeed, carriage-return, or both together depending on the OS that generated
  • Typically, a single line of text contains a single record

JSON

Warning

JSON files have two flavors: JSON Lines vs. JSON. Typically when we say data is in JSON format, we imply it’s JSON Lines which means that there is a single JSON object per line, and there are multiple lines.

JSON Lines

4 records, one per line, no end comma

{"id":1, "name":"marck", "last_name":"vaisman"}
{"id":2, "name":"anderson", "last_name":"monken"}
{"id":3, "name":"amit", "last_name":"arora"}
{"id":4, "name":"abhijit", "last_name":"dasgupta"}

JSON

4 records enclosed in 1 JSON Array

[
  {"id":1, "name":"marck", "last_name":"vaisman"},
  {"id":2, "name":"anderson", "last_name":"monken"},
  {"id":3, "name":"amit", "last_name":"arora"},
  {"id":4, "name":"abhijit", "last_name":"dasgupta"},
]

Binary Files

Issues with Common File Formats (Especially CSV)

  • Ubiquitous but highly error-prone
  • Default delimiter: familiar character in English, the comma
  • Ambiguities:
    • Delimiter (comma, tab, semi-colon, custom)
    • Quote characters (single or doble quote)
    • Escaping to appropriately handle string data
  • Doesn’t natively encode schema information
  • No direct support for nested structures
  • Encoding+schema must be configured on target system to ensure ingestion
  • Autodetection provided in many cloud environments but is inappropriate for production ingestion (can be painfully slow)
  • Data engineers often forced to work with CSV data and then build robust exception handling and error detection to ensure data quality (Pydantic!)

Introducing Apache Parquet

  • Free, open-source, column-oriented data storage format created by Twitter and Cloudera (v1.0 released July 2013)
  • Data stored in columnar format (as opposed to row format), designed for read and write performance
  • Builds in schema information and natively supports nested data
  • Supported by R and Python through Apache Arrow (more coming up!)

Traditional Row-Store

  • Query: “How many balls did we sell?
  • The engine must scan each and every row until the end!

Column-Store

Row Groups

Data is stored in row groups!

Only required fields

Metadata, compression, and dictionary encoding

Apache Arrow for In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. (Topol and McKinney 2024)

Before Arrow

Topol and McKinney (2024)

After Arrow

Topol and McKinney (2024)

Arrow Compatibility

Topol and McKinney (2024)

Arrow Performance

Topol and McKinney (2024)

Topol and McKinney (2024)

Using Arrow with CSV and Parquet

Python

pyarrow or from pandas

import pandas as pd
pd.read_csv(engine = 'pyarrow')
pd.read_parquet()

import pyarrow.csv
pyarrow.csv.read_csv()

import pyarrow.parquet
pyarrow.parquet.read_table()

R

Use the arrow package

library(arrow)

read_csv_arrow()
read_parquet()
read_json_arrow()

write_csv_arrow()
write_parquet()

Recommendation: save your intermediate and analytical datasets as Parquet!

Polars

Lightning-fast DataFrame library for Rust and Python

Before We Begin…

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Pandas is slow, but less slow if you use it the right way!

Polars

Why is Polars Faster than Pandas?

  • Polars is written in Rust. Rust is compiled; Python is interpreted
    • Compiled language: you generate the machine code only once then run it, subsequent runs do not need the compilation step.
    • Interpreted language: code has to be parsed, interpreted and converted into machine code every single time.
  • Parallelization: Vectorized operations can be executed in parallel on multiple cores
  • Lazy evaluation: Polars supports two APIs lazy as well as eager evaluation (used by pandas). In lazy evaluation, a query is executed only when required. While in eager evaluation, a query is executed immediately.
  • Polars uses Arrow for in-memory data representation. Similar to how pandas uses NumPy (Pandas 2 allows using Arrow as backend)
  • Polars \(\approx\) in-memory DataFrame library + query optimizer

Ease of Use

  • Familiar API for users of Pandas: differences in syntax but still a Dataframe API making it straightforward to perform common operations such as filtering, aggregating, and joining data

  • See Migrating from Pandas

  • Reading data

    # must install s3fs -> "pip install s3fs"
    # Using Polars
    import polars as pl
    polars_df = pl.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")
    
    # using Pandas
    import pandas as pd
    pandas_df = pd.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")
  • Selecting columns (see Pushdown optimization)

    # Using Polars
    selected_columns_polars = polars_df[['column1', 'column2']]
    
    # Using Pandas
    selected_columns_pandas = pandas_df[['column1', 'column2']]

Ease of Use (contd.)

  • Filtering data

    # Using Polars
    filtered_polars = polars_df[polars_df['column1'] > 10]
    
    # Using Pandas
    filtered_pandas = pandas_df[pandas_df['column1'] > 10]
  • Though you can write Polars code that looks like Pandas, better to write idiomatic Polars code that takes advantage of Polars’ features

  • Migrating from Apache Spark: Whereas Spark DataFrame is a collection of rows, Polars DataFrame is closer to a collection of columns

Installation, Data Loading, and Basic Operations

Install via pip:

pip install polars

Import polars in your Python code and read data as usual:

import polars as pl
df = pl.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")
df.head()

shape: (5, 19)
┌──────────┬────────────┬──────────────┬──────────────┬─────┬──────────────┬──────────────┬──────────────┬─────────────┐
│ VendorID ┆ tpep_picku ┆ tpep_dropoff ┆ passenger_co ┆ ... ┆ improvement_ ┆ total_amount ┆ congestion_s ┆ Airport_fee │
---      ┆ p_datetime ┆ _datetime    ┆ unt          ┆     ┆ surcharge    ┆ ---          ┆ urcharge     ┆ ---
│ i32      ┆ ---------          ┆     ┆ ---          ┆ f64          ┆ ---          ┆ f64         │
│          ┆ datetime[n ┆ datetime[ns] ┆ i64          ┆     ┆ f64          ┆              ┆ f64          ┆             │
│          ┆ s]         ┆              ┆              ┆     ┆              ┆              ┆              ┆             │
╞══════════╪════════════╪══════════════╪══════════════╪═════╪══════════════╪══════════════╪══════════════╪═════════════╡
12023-06-012023-06-011            ┆ ... ┆ 1.033.62.50.0
│          ┆ 00:08:4800:29:41     ┆              ┆     ┆              ┆              ┆              ┆             │
12023-06-012023-06-010            ┆ ... ┆ 1.023.62.50.0
│          ┆ 00:15:0400:25:18     ┆              ┆     ┆              ┆              ┆              ┆             │
12023-06-012023-06-011            ┆ ... ┆ 1.060.050.01.75
└──────────┴────────────┴──────────────┴──────────────┴─────┴──────────────┴──────────────┴──────────────┴─────────────┘

Polars Pipeline Example

We’ll run this as part of the lab in a little bit; think how you might code this in Pandas…

Polars pipeline

Further reading

DuckDB

An in-process SQL OLAP database management system

DuckDB

  • DuckDB is an in-process SQL OLAP DB management system
  • Like sqlite, but for analytics. What does this mean? It means that your database runs inside your process, there are no servers to manage, no remote system to connect to. Easy to experiment with SQL-like syntax.
  • Vectorized processing: Loads chunks of data into memory (tries to keep everything in the CPU’s L1 and L2 cache) and is thus able to handle datasets bigger than the amount of RAM available.
  • Supports Python, R and a host of other languages
  • Important paper on DuckDB: Raasveldt and Mühleisen (2019)

Key Features

  • Columnar Storage, Vectorized Query Processing: DuckDB contains a columnar-vectorized query execution engine, where queries are run on a large batch of values (a “vector”) in one operation
    • Most analytical queries (think group by and summarize) or even data retrieval for training ML models require retrieving a subset of columns and now the entire row, columnar storage make this faster
  • In-Memory Processing: All data needed for processing is brought within the process memory (recall that columnar storage format helps with this) making queries run faster (no DB call over the network)
  • SQL Support: highly Postgres-compatible version of SQL1.
  • ACID Compliance: Transactional guarantees (ACID properties) through bulk-optimized Multi-Version Concurrency Control (MVCC).

Use Cases for DuckDB

  • Data Warehousing
  • Business Intelligence
  • Real-Time Analytics
  • IoT Data Processing

DuckDB in the Wild

DuckDB: DIY

Datalake and DuckDB

Using DuckDB in AWS Lambda

DuckDB: Fully-Managed

MotherDuck Architecture

Setting Up DuckDB

Configuration and Initialization: DuckDB is integrated into Python and R for efficient interactive data analysis (APIs for Java, C, C++, Julia, Swift, and others)

pip install duckdb

Connecting to DuckDB

import duckdb
# directly query a Pandas DataFrame
import pandas as pd
data_url = "https://raw.githubusercontent.com/anly503/datasets/main/EconomistData.csv"
df = pd.read_csv(data_url)
duckdb.sql('SELECT * FROM df')

Setting Up DuckDB (contd.)

Supported data formats: DuckDB can ingest data from a wide variety of formats – both on-disk and in-memory. See the data ingestion page for more information.

import duckdb
duckdb.read_csv('example.csv')                # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')        # read a Parquet file into a Relation
duckdb.read_json('example.json')              # read a JSON file into a Relation

duckdb.sql('SELECT * FROM "example.csv"')     # directly query a CSV file
duckdb.sql('SELECT * FROM "example.parquet"') # directly query a Parquet file
duckdb.sql('SELECT * FROM "example.json"')    # directly query a JSON file

Querying DuckDB

Essential Reading

Basic SQL Queries

import duckdb
import pandas as pd
babynames = pd.read_parquet("https://github.com/anly503/datasets/raw/main/babynames.parquet.zstd")
duckdb.sql("select count(*)  from babynames where Name='John'")

Aggregations and Grouping

duckdb.sql("select State, Name, count(*) as count  from babynames group by State, Name order by State desc, count desc") 

Querying DuckDB (contd.)

Essential reading: FROM and JOIN clauses

Joins and Subqueries

# Join two tables together
duckdb.sql("SELECT * FROM table_name JOIN other_table ON (table_name.key = other_table.key")

Window Functions

powerplants = pd.read_csv("https://raw.githubusercontent.com/anly503/datasets/main/powerplants.csv", parse_dates=["date"])
q = """
SELECT "plant", "date",
    AVG("MWh") OVER (
        PARTITION BY "plant"
        ORDER BY "date" ASC
        RANGE BETWEEN INTERVAL 3 DAYS PRECEDING
                  AND INTERVAL 3 DAYS FOLLOWING)
        AS "MWh 7-day Moving Average"
FROM powerplants
ORDER BY 1, 2;
"""
duckdb.sql(q)

Using the DuckDB CLI and Shell

  • Install DuckDB CLI or use in browser via shell.duckdb.org for data exploration via SQL; Once installed, import a local file into the shell and run queries

  • You can download powerplants.csv here

    C:\Users\<username>\Downloads\duckdb_cli-windows-amd64> duckdb
    v0.8.1 6536a77232
    Enter ".help" for usage hints.
    Connected to a transient in-memory database.
    Use ".open FILENAME" to reopen on a persistent database.
    D CREATE TABLE powerplants AS SELECT * FROM read_csv_auto('powerplants.csv');
    D DESCRIBE powerplants;
    ┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
     column_name │ column_type │  null   │   key   │ default │ extra │
       varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
    ├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
     plant       │ VARCHAR     │ YES     │         │         │       │
     date        │ DATE        │ YES     │         │         │       │
     MWh         │ BIGINT      │ YES     │         │         │       │
    └─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘
    D  SELECT * from powerplants where plant='Boston' and date='2019-01-02';
    ┌─────────┬────────────┬────────┐
      plant  │    date    │  MWh   │
     varchar │    date    │ int64  │
    ├─────────┼────────────┼────────┤
     Boston  │ 2019-01-02 │ 564337 │
    └─────────┴────────────┴────────┘

Profiling in DuckDB

Query Optimization: Use EXPLAIN and ANALYZE keywords to understand how your query is being executed and the time being spent in individual steps

D EXPLAIN ANALYZE SELECT * from powerplants where plant='Boston' and date='2019-01-02';

DuckDB will use all the cores available on the underlying compute, but you can adjust this (Full configuration details here)

D select current_setting('threads');
┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          8 │
└────────────────────────────┘
D SET threads=4;
D select current_setting('threads');
┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          4 │
└────────────────────────────┘

Benchmarks and Comparisons

  • Tricky topic! Can make your chosen solution look better by focusing on metrics on which it provides better results
  • In general, TPC-H and TPC-DS are considered the standard benchmarks for data processing.

TPC-DS Homepage

Further Reading on DuckDB

  1. Parallel Grouped Aggregation in DuckDB
  2. Meta queries
  3. Profiling queries in DuckDB
  4. DuckDB tutorial for beginners
  5. DuckDB CLI API
  6. Using DuckDB in AWS Lambda
  7. Revisiting the Poor Man’s Data Lake with MotherDuck
  8. Supercharge your data processing with DuckDB
  9. Friendlier SQL with DuckDB
  10. Building and deploying data apps with DuckDB and Streamlit

Lab 4

GitHub Classroom Link

References

Gopalan, Rukmani. 2022. The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture. O’Reilly Media, Inc. https://www.dropbox.com/scl/fi/7u469ccc8y169us5hydk0/The-cloud-data-lake-a-guide-to-building-robust-cloud-data-Rukmani-Gopalan.pdf?rlkey=ey06a6zt9g90d4zq8tfncndta&dl=1.
Raasveldt, Mark, and Hannes Mühleisen. 2019. DuckDB: An Embeddable Analytical Database.” In Proceedings of the 2019 International Conference on Management of Data, 1981–84. SIGMOD ’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3299869.3320212.
Reis, Joe, and Matt Housley. 2022. Fundamentals of Data Engineering: Plan and Build Robust Data Systems. O’Reilly Media, Inc. https://www.dropbox.com/scl/fi/w4y4ka22jiraqwo0n4ymm/Fundamentals-of-data-engineering-_-plan-and-build-robust-Joe-Reis-Matt-Housley.pdf?rlkey=4b8d3d5hgnbq6s5o5a16tbznu&dl=1.
Topol, Matthew, and Wes McKinney. 2024. In-Memory Analytics with Apache Arrow. Packt Publishing Ltd. https://www.dropbox.com/scl/fi/cl5bwkjxl8jvxd52fkb4s/Matthew-Topol-In-Memory-Analytics-with-Apache-Arrow-2nd-Edition.pdf?rlkey=anqcenvrfs6t4kaqcunkms2fz&dl=1.
White, Tom E. 2015. Hadoop: The Definitive Guide. O’Reilly Media, Inc. https://www.dropbox.com/scl/fi/j25hdcdr81bautbywm8yl/Tom-White-Hadoop-The-Definitive-Guide.pdf?rlkey=n91jivtd72nb0yjwfkhudxpes&dl=1.