Week 4: DuckDB, Polars, File Formats

DSAN 6000: Big Data and Cloud Computing
Fall 2025

Amit Arora and Jeff Jacobs

jj1088@georgetown.edu

Monday, September 15, 2025

Agenda and Goals for Today

Lecture

Distributed file systems
Modern file types
Working with large tabular data on a single node
- DuckDB
- Polars

Lab

Run a similar task with Pandas, polars and duckdb

Logistics and Review

Deadlines

Assignment 1: Python Skills Due Sept 5 11:59pm
Lab 2: Cloud Tooling Due Sept 5 6pm
Assignment 2: Shell & Linux Due Sept 11 11:59pm
Lab 3: Parallel Computing Due Sept 12 6pm
Assignment 3: Parallelization Due Sept 18 11:59pm
Lab 4: Docker and Lambda Due Sept 19 6pm
Assignment 4: Containers Due Sept 25 11:59pm
Lab 5: DuckDB & Polars Due Sept 26 6pm

Look back and ahead

Continue to use Slack for questions!
Docker (containerization)
Lambda functions
Coming up: Spark and project

Filesystems

Raw Ingredients of Storage Systems

Disk drives (magnetic HDDs or SSDs)
RAM
Networking and CPU
Serialization
Compression
Caching

Single-Machine vs. Distributed Storage

From Reis and Housley (2022)

Single-Machine

They are commonly used for storing operating system files, application files, and user data files.
Filesystems are also used in databases to store data files, transaction logs, and backups.

Distributed Storage

A distributed filesystem is a type of filesystem that spans multiple computers.
It provides a unified view of files across all the computers in the system.
Have existed before cloud

File Storage Types

Local Disk

OS-managed filesystems on local disk partition:
NTFS (Windows)
HFS+ (MacOS)
ext4 (Linux)() on a local disk partition of SSD or magnetic disk

Network-Attached (NAS)

Accessed by clients over a network
Redundancy and reliability, fine-grained control of resources, storage pooling across multiple disks for large virtual volumes, and file sharing across multiple machines

Cloud Filesystems

Not object store (more on that later)
Not the virtual hard drive attached to a virtual machine
Fully managed: Takes care of networking, managing disk clusters, failures, and configuration (Azure Files, Amazon Elastic Filesystem)
Backed by Object Store

Based on Reis and Housley (2022)

Object Stores

Somewhat confusing because object has several meanings in computer science.
In this context, we’re talking about a specialized file-like construct. It could be any type of file: TXT, CSV, JSON, images, videos, audio

Contains objects of all shapes and sizes: each gets a unique identifier
Objects are immutable: cannot be modified in place (unlike local FS)

Distributed FS vs Object Store

	Distributed File System	Object Storage
Organization	Files in hierarchical directories	Flat organization (though there can be overlays to provide hierarchical files structure)
Method	POSIX File Operations	REST API
Immutability	None: Random writes anywhere in file	Immutable: need to replace/append entire object
Performance	Performs best for smaller files	Performs best for large files
Scalability	Millions of files	Billions of objects

Both provide:

Fault tolerance
Availability and consistency

Before: Data locality (for Hadoop)

Today: De-Coupling Storage from Compute

From Gopalan (2022)

Data-on-Disk Formats

Plain Text (CSV, TSV, FWF)
JSON
Binary Files

Plain Text (CSV, TSV, FWF)

Pay attention to encodings!
Lines end in linefeed, carriage-return, or both together depending on the OS that generated
Typically, a single line of text contains a single record

JSON

Warning

JSON files have two flavors: JSON Lines vs. JSON. Typically when we say data is in JSON format, we imply it’s JSON Lines which means that there is a single JSON object per line, and there are multiple lines.

JSON Lines

4 records, one per line, no end comma

{"id":1, "name":"marck", "last_name":"vaisman"}
{"id":2, "name":"anderson", "last_name":"monken"}
{"id":3, "name":"amit", "last_name":"arora"}
{"id":4, "name":"abhijit", "last_name":"dasgupta"}

JSON

4 records enclosed in 1 JSON Array

[
  {"id":1, "name":"marck", "last_name":"vaisman"},
  {"id":2, "name":"anderson", "last_name":"monken"},
  {"id":3, "name":"amit", "last_name":"arora"},
  {"id":4, "name":"abhijit", "last_name":"dasgupta"},
]

Binary Files

Issues with Common File Formats (Especially CSV)

Ubiquitous but highly error-prone
Default delimiter: familiar character in English, the comma
Ambiguities:
- Delimiter (comma, tab, semi-colon, custom)
- Quote characters (single or doble quote)
- Escaping to appropriately handle string data
Doesn’t natively encode schema information
No direct support for nested structures
Encoding+schema must be configured on target system to ensure ingestion
Autodetection provided in many cloud environments but is inappropriate for production ingestion (can be painfully slow)
Data engineers often forced to work with CSV data and then build robust exception handling and error detection to ensure data quality (Pydantic!)

Introducing Apache Parquet

Free, open-source, column-oriented data storage format created by Twitter and Cloudera (v1.0 released July 2013)
Data stored in columnar format (as opposed to row format), designed for read and write performance
Builds in schema information and natively supports nested data
Supported by R and Python through Apache Arrow (more coming up!)

Traditional Row-Store

Query: “How many balls did we sell?
The engine must scan each and every row until the end!

Column-Store

Row Groups

Data is stored in row groups!

Only required fields

Metadata, compression, and dictionary encoding

Apache Arrow for In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. (Topol and McKinney 2024)

Before Arrow

Topol and McKinney (2024)

After Arrow

Topol and McKinney (2024)

Arrow Compatibility

Topol and McKinney (2024)

Arrow Performance

Using Arrow with CSV and Parquet

Python

pyarrow or from pandas

import pandas as pd
pd.read_csv(engine = 'pyarrow')
pd.read_parquet()

import pyarrow.csv
pyarrow.csv.read_csv()

import pyarrow.parquet
pyarrow.parquet.read_table()

Use the arrow package

library(arrow)

read_csv_arrow()
read_parquet()
read_json_arrow()

write_csv_arrow()
write_parquet()

Recommendation: save your intermediate and analytical datasets as Parquet!

Polars

Lightning-fast DataFrame library for Rust and Python

Before We Begin…

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Pandas is slow, but less slow if you use it the right way!

Apache Arrow and the “10 Things I Hate About pandas” (A 2017 post from the creator of Pandas…))
50x faster data loading in Pandas: no problem (an old 2019 article…)
Is Pandas really that slow?
Pandas 2.0 and the arrow revolution

Polars

Why is Polars Faster than Pandas?

Polars is written in Rust. Rust is compiled; Python is interpreted
- Compiled language: you generate the machine code only once then run it, subsequent runs do not need the compilation step.
- Interpreted language: code has to be parsed, interpreted and converted into machine code every single time.
Parallelization: Vectorized operations can be executed in parallel on multiple cores
Lazy evaluation: Polars supports two APIs lazy as well as eager evaluation (used by pandas). In lazy evaluation, a query is executed only when required. While in eager evaluation, a query is executed immediately.
Polars uses Arrow for in-memory data representation. Similar to how pandas uses NumPy (Pandas 2 allows using Arrow as backend)
Polars \(\approx\) in-memory DataFrame library + query optimizer

Ease of Use

Familiar API for users of Pandas: differences in syntax but still a Dataframe API making it straightforward to perform common operations such as filtering, aggregating, and joining data
See Migrating from Pandas

Reading data

# must install s3fs -> "pip install s3fs"
# Using Polars
import polars as pl
polars_df = pl.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")

# using Pandas
import pandas as pd
pandas_df = pd.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")

Selecting columns (see Pushdown optimization)

# Using Polars
selected_columns_polars = polars_df[['column1', 'column2']]

# Using Pandas
selected_columns_pandas = pandas_df[['column1', 'column2']]

Ease of Use (contd.)

Filtering data

# Using Polars
filtered_polars = polars_df[polars_df['column1'] > 10]

# Using Pandas
filtered_pandas = pandas_df[pandas_df['column1'] > 10]

Though you can write Polars code that looks like Pandas, better to write idiomatic Polars code that takes advantage of Polars’ features
Migrating from Apache Spark: Whereas Spark DataFrame is a collection of rows, Polars DataFrame is closer to a collection of columns

Installation, Data Loading, and Basic Operations

Install via pip:

pip install polars

Import polars in your Python code and read data as usual:

import polars as pl
df = pl.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")
df.head()

shape: (5, 19)
┌──────────┬────────────┬──────────────┬──────────────┬─────┬──────────────┬──────────────┬──────────────┬─────────────┐
│ VendorID ┆ tpep_picku ┆ tpep_dropoff ┆ passenger_co ┆ ... ┆ improvement_ ┆ total_amount ┆ congestion_s ┆ Airport_fee │
│ ---      ┆ p_datetime ┆ _datetime    ┆ unt          ┆     ┆ surcharge    ┆ ---          ┆ urcharge     ┆ ---         │
│ i32      ┆ ---        ┆ ---          ┆ ---          ┆     ┆ ---          ┆ f64          ┆ ---          ┆ f64         │
│          ┆ datetime[n ┆ datetime[ns] ┆ i64          ┆     ┆ f64          ┆              ┆ f64          ┆             │
│          ┆ s]         ┆              ┆              ┆     ┆              ┆              ┆              ┆             │
╞══════════╪════════════╪══════════════╪══════════════╪═════╪══════════════╪══════════════╪══════════════╪═════════════╡
│ 1        ┆ 2023-06-01 ┆ 2023-06-01   ┆ 1            ┆ ... ┆ 1.0          ┆ 33.6         ┆ 2.5          ┆ 0.0         │
│          ┆ 00:08:48   ┆ 00:29:41     ┆              ┆     ┆              ┆              ┆              ┆             │
│ 1        ┆ 2023-06-01 ┆ 2023-06-01   ┆ 0            ┆ ... ┆ 1.0          ┆ 23.6         ┆ 2.5          ┆ 0.0         │
│          ┆ 00:15:04   ┆ 00:25:18     ┆              ┆     ┆              ┆              ┆              ┆             │
│ 1        ┆ 2023-06-01 ┆ 2023-06-01   ┆ 1            ┆ ... ┆ 1.0          ┆ 60.05        ┆ 0.0          ┆ 1.75        │
└──────────┴────────────┴──────────────┴──────────────┴─────┴──────────────┴──────────────┴──────────────┴─────────────┘

Polars Pipeline Example

We’ll run this as part of the lab in a little bit; think how you might code this in Pandas…

Polars pipeline

DuckDB

An in-process SQL OLAP database management system

DuckDB

DuckDB is an in-process SQL OLAP DB management system
Like sqlite, but for analytics. What does this mean? It means that your database runs inside your process, there are no servers to manage, no remote system to connect to. Easy to experiment with SQL-like syntax.
Vectorized processing: Loads chunks of data into memory (tries to keep everything in the CPU’s L1 and L2 cache) and is thus able to handle datasets bigger than the amount of RAM available.
Supports Python, R and a host of other languages
Important paper on DuckDB: Raasveldt and Mühleisen (2019)

Key Features

Columnar Storage, Vectorized Query Processing: DuckDB contains a columnar-vectorized query execution engine, where queries are run on a large batch of values (a “vector”) in one operation
- Most analytical queries (think group by and summarize) or even data retrieval for training ML models require retrieving a subset of columns and now the entire row, columnar storage make this faster
In-Memory Processing: All data needed for processing is brought within the process memory (recall that columnar storage format helps with this) making queries run faster (no DB call over the network)
SQL Support: highly Postgres-compatible version of SQL¹.
ACID Compliance: Transactional guarantees (ACID properties) through bulk-optimized Multi-Version Concurrency Control (MVCC).

Use Cases for DuckDB

Data Warehousing
Business Intelligence
Real-Time Analytics
IoT Data Processing

DuckDB in the Wild

DuckDB: DIY

Datalake and DuckDB

Using DuckDB in AWS Lambda

DuckDB: Fully-Managed

MotherDuck Architecture

Architecture and capabilities
Seamlessly analyze data, whether it sits on your laptop, in the cloud or split between.
Hybrid execution automatically plans each part of your query and determines where it’s best computed
DuckDB Vs MotherDuck

Setting Up DuckDB

Configuration and Initialization: DuckDB is integrated into Python and R for efficient interactive data analysis (APIs for Java, C, C++, Julia, Swift, and others)

pip install duckdb

Connecting to DuckDB

import duckdb
# directly query a Pandas DataFrame
import pandas as pd
data_url = "https://raw.githubusercontent.com/anly503/datasets/main/EconomistData.csv"
df = pd.read_csv(data_url)
duckdb.sql('SELECT * FROM df')

Setting Up DuckDB (contd.)

Supported data formats: DuckDB can ingest data from a wide variety of formats – both on-disk and in-memory. See the data ingestion page for more information.

import duckdb
duckdb.read_csv('example.csv')                # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')        # read a Parquet file into a Relation
duckdb.read_json('example.json')              # read a JSON file into a Relation

duckdb.sql('SELECT * FROM "example.csv"')     # directly query a CSV file
duckdb.sql('SELECT * FROM "example.parquet"') # directly query a Parquet file
duckdb.sql('SELECT * FROM "example.json"')    # directly query a JSON file

Querying DuckDB

Essential Reading

Basic SQL Queries

import duckdb
import pandas as pd
babynames = pd.read_parquet("https://github.com/anly503/datasets/raw/main/babynames.parquet.zstd")
duckdb.sql("select count(*)  from babynames where Name='John'")

Aggregations and Grouping

duckdb.sql("select State, Name, count(*) as count  from babynames group by State, Name order by State desc, count desc")

Querying DuckDB (contd.)

Essential reading: FROM and JOIN clauses

Joins and Subqueries

# Join two tables together
duckdb.sql("SELECT * FROM table_name JOIN other_table ON (table_name.key = other_table.key")

Window Functions

powerplants = pd.read_csv("https://raw.githubusercontent.com/anly503/datasets/main/powerplants.csv", parse_dates=["date"])
q = """
SELECT "plant", "date",
    AVG("MWh") OVER (
        PARTITION BY "plant"
        ORDER BY "date" ASC
        RANGE BETWEEN INTERVAL 3 DAYS PRECEDING
                  AND INTERVAL 3 DAYS FOLLOWING)
        AS "MWh 7-day Moving Average"
FROM powerplants
ORDER BY 1, 2;
"""
duckdb.sql(q)

Using the DuckDB CLI and Shell

Install DuckDB CLI or use in browser via shell.duckdb.org for data exploration via SQL; Once installed, import a local file into the shell and run queries

You can download powerplants.csv here

C:\Users\<username>\Downloads\duckdb_cli-windows-amd64> duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D CREATE TABLE powerplants AS SELECT * FROM read_csv_auto('powerplants.csv');
D DESCRIBE powerplants;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ plant       │ VARCHAR     │ YES     │         │         │       │
│ date        │ DATE        │ YES     │         │         │       │
│ MWh         │ BIGINT      │ YES     │         │         │       │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘
D  SELECT * from powerplants where plant='Boston' and date='2019-01-02';
┌─────────┬────────────┬────────┐
│  plant  │    date    │  MWh   │
│ varchar │    date    │ int64  │
├─────────┼────────────┼────────┤
│ Boston  │ 2019-01-02 │ 564337 │
└─────────┴────────────┴────────┘

Profiling in DuckDB

Query Optimization: Use EXPLAIN and ANALYZE keywords to understand how your query is being executed and the time being spent in individual steps

D EXPLAIN ANALYZE SELECT * from powerplants where plant='Boston' and date='2019-01-02';

DuckDB will use all the cores available on the underlying compute, but you can adjust this (Full configuration details here)

D select current_setting('threads');
┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          8 │
└────────────────────────────┘
D SET threads=4;
D select current_setting('threads');
┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          4 │
└────────────────────────────┘

Benchmarks and Comparisons

Tricky topic! Can make your chosen solution look better by focusing on metrics on which it provides better results
In general, TPC-H and TPC-DS are considered the standard benchmarks for data processing.

Lab 4

GitHub Classroom Link

References

Gopalan, Rukmani. 2022. The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture. O’Reilly Media, Inc. https://www.dropbox.com/scl/fi/7u469ccc8y169us5hydk0/The-cloud-data-lake-a-guide-to-building-robust-cloud-data-Rukmani-Gopalan.pdf?rlkey=ey06a6zt9g90d4zq8tfncndta&dl=1.

Raasveldt, Mark, and Hannes Mühleisen. 2019. “DuckDB: An Embeddable Analytical Database.” In Proceedings of the 2019 International Conference on Management of Data, 1981–84. SIGMOD ’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3299869.3320212.

Reis, Joe, and Matt Housley. 2022. Fundamentals of Data Engineering: Plan and Build Robust Data Systems. O’Reilly Media, Inc. https://www.dropbox.com/scl/fi/w4y4ka22jiraqwo0n4ymm/Fundamentals-of-data-engineering-_-plan-and-build-robust-Joe-Reis-Matt-Housley.pdf?rlkey=4b8d3d5hgnbq6s5o5a16tbznu&dl=1.

Topol, Matthew, and Wes McKinney. 2024. In-Memory Analytics with Apache Arrow. Packt Publishing Ltd. https://www.dropbox.com/scl/fi/cl5bwkjxl8jvxd52fkb4s/Matthew-Topol-In-Memory-Analytics-with-Apache-Arrow-2nd-Edition.pdf?rlkey=anqcenvrfs6t4kaqcunkms2fz&dl=1.

White, Tom E. 2015. Hadoop: The Definitive Guide. O’Reilly Media, Inc. https://www.dropbox.com/scl/fi/j25hdcdr81bautbywm8yl/Tom-White-Hadoop-The-Definitive-Guide.pdf?rlkey=n91jivtd72nb0yjwfkhudxpes&dl=1.

Week 4: DuckDB, Polars, File Formats

Agenda and Goals for Today

Lecture

Lab

Logistics and Review

Filesystems

Raw Ingredients of Storage Systems

Single-Machine vs. Distributed Storage

File Storage Types

Object Stores

Distributed FS vs Object Store

Before: Data locality (for Hadoop)

Today: De-Coupling Storage from Compute

Data-on-Disk Formats

Plain Text (CSV, TSV, FWF)

JSON

Binary Files

Issues with Common File Formats (Especially CSV)

Introducing Apache Parquet

Traditional Row-Store

Column-Store

Row Groups

Metadata, compression, and dictionary encoding

Apache Arrow for In-Memory Analytics

Before Arrow

After Arrow

Arrow Compatibility

Arrow Performance

Using Arrow with CSV and Parquet

Polars

Before We Begin…

Polars

Why is Polars Faster than Pandas?

Ease of Use

Ease of Use (contd.)

Installation, Data Loading, and Basic Operations

Polars Pipeline Example

Further reading

DuckDB

DuckDB

Key Features

Use Cases for DuckDB

DuckDB: DIY

DuckDB: Fully-Managed

Setting Up DuckDB

Setting Up DuckDB (contd.)

Querying DuckDB

Querying DuckDB (contd.)

Using the DuckDB CLI and Shell

Profiling in DuckDB

Benchmarks and Comparisons

Further Reading on DuckDB

Lab 4

References