From Pandas to SQL

Converting Code into Queries

Extra Writeups
Author

Jeff Jacobs

Published

September 20, 2025

At some point during Week 4, you may have noticed a somewhat-jarring shift in our approach to data analytics…

Specifically, you may have noticed that:

So, given that your work in DSAN courses like DSAN 5000 and DSAN 5300 has been primarily oriented around Pandas syntax and coding patterns, the goal of this writeup is to get you up-to-speed with how you can “translate” the types of things you’re used to doing with Pandas code into equivalent SQL queries.

Eager vs. Lazy Evaluation

As a final note before we “zoom in” and focus on SQL, though: keep in mind that this move from code to queries is part of our endeavor in this class to start decoupling data from computation! In an eager-evaluation setting like Pandas code, when we write a data-processing pipeline, the steps are immediately evaluated, line-by-line (like any other Python code). Consider the following example, where the goal of the pipeline is to figure out our average ratio of sleep hours to work hours on weekdays:

Code
import pandas as pd
import numpy as np
rng = np.random.default_rng(seed=6000)
import calendar
list(calendar.day_name)
sleep_df = pd.DataFrame({
  'day': list(calendar.day_name),
  'sleep_hrs': rng.uniform(3, 8, size=7),
  'work_hrs': rng.uniform(6, 10, size=7),
})
display(sleep_df)
day sleep_hrs work_hrs
0 Monday 6.089095 7.418997
1 Tuesday 7.912447 8.747633
2 Wednesday 5.393727 6.011502
3 Thursday 4.347202 7.318025
4 Friday 4.556042 6.341836
5 Saturday 4.552443 7.279092
6 Sunday 6.606959 6.652768
Code
# Step 1: Compute ratios
ratio_df = sleep_df.copy()
ratio_df['ratio'] = ratio_df['work_hrs'] / ratio_df['sleep_hrs']
display(ratio_df)
day sleep_hrs work_hrs ratio
0 Monday 6.089095 7.418997 1.218407
1 Tuesday 7.912447 8.747633 1.105553
2 Wednesday 5.393727 6.011502 1.114536
3 Thursday 4.347202 7.318025 1.683387
4 Friday 4.556042 6.341836 1.391962
5 Saturday 4.552443 7.279092 1.598942
6 Sunday 6.606959 6.652768 1.006934
Code
# Step 2: Filter to just weekdays
weekend_days = ['Saturday', 'Sunday']
weekday_df = ratio_df[~ratio_df['day'].isin(weekend_days)]
display(weekday_df)
day sleep_hrs work_hrs ratio
0 Monday 6.089095 7.418997 1.218407
1 Tuesday 7.912447 8.747633 1.105553
2 Wednesday 5.393727 6.011502 1.114536
3 Thursday 4.347202 7.318025 1.683387
4 Friday 4.556042 6.341836 1.391962
Code
# Step 3: Compute mean ratio over the weekdays
weekday_df['ratio'].mean()
1.3027691063719424

Notice here, however, that we are carrying out unnecessary computations in Step 1! Since we’re going to drop the weekend rows in Step 2 anyways, there is no need to compute the ratio for these two days. Since Python (and thus Pandas) is evaluated line-by-line, however, Pandas has no way of knowing that these two ratios don’t need to be computed. It simply does what it is told, dutifully computing all 7 ratios.

There is an alternative approach here: we could instead write out our “plan” for this pipeline in advance, in the form of a query, before actually running any code! We can then utilize a query optimization engine like Polars, which can scan over our query and figure out that we don’t need to compute ratios for weekend days, since they would be dropped in Step 2! Here is the same pipeline re-written using Polars’ LazyFrame in place of a Pandas DataFrame:

Code
import polars as pl
sleep_pl = pl.LazyFrame(sleep_df)
compute_ratio_expr = pl.col('work_hrs') / pl.col('sleep_hrs')
compute_ratios_expr = (
  sleep_pl.with_columns(
    compute_ratio_expr.alias('ratio')
  ).filter(~pl.col('day').is_in(weekend_days))
)

Note that no computation has actually taken place yet! We have just written out our plan for the computation, in the form of Polars expressions (which don’t actually run until they’re paired with a context).

Because of this, we can see Polars’ query optimization engine in action. First, we can use the optimized=False argument to show_graph() to see what our pipeline looks like without any optimizations (that is, running line-by-line as written, as our Pandas code did above):

Code
compute_ratios_expr.show_graph(optimized=False)

Remember that these query plans should be read from bottom to top. Now, compare that with the following graph, which shows our query after Polars scans over our instructions, detecting points where it can re-arrange our pipeline for greater efficiency:

Code
compute_ratios_expr.show_graph()

Notice the difference in the ordering of steps: Polars has successfully figured out that it should filter out the weekend days before carrying out the calculations!

Onto DuckDB!

Once you get the hang of the material in Weeks 4-6, it may slowly dawn on you that Polars is kind of a… “middle ground” compromise, allowing:

  • Pandas enthusiasts who are most comfortable writing Python code but want to write it in a way that it can be optimized before execution to “meet halfway” with
  • SQL enthusiasts who are already used to writing all of their queries out before executing them

I choose the “meet halfway” metaphor very specifically because, when you write Polars expressions like the compute_ratios_expr expression in the above code block, you are essentially writing SQL queries programmatically! Meaning, as you become more and more comfortable with Polars, you are also implicitly becoming more and more comfortable with writing queries, you just don’t know it yet!

To see what I mean, let’s do one final “translation” and re-write our pipeline once more using DuckDB this time:

Code
import duckdb
con = duckdb.connect()
ratio_query = """
SELECT
  day,
  sleep_hrs,
  work_hrs,
  work_hrs / sleep_hrs AS ratio
FROM sleep_df
WHERE day NOT IN ('Saturday','Sunday')
"""
result_df = con.execute(ratio_query).df()
result_df
day sleep_hrs work_hrs ratio
0 Monday 6.089095 7.418997 1.218407
1 Tuesday 7.912447 8.747633 1.105553
2 Wednesday 5.393727 6.011502 1.114536
3 Thursday 4.347202 7.318025 1.683387
4 Friday 4.556042 6.341836 1.391962

Now, just like with Polars, we can ask DuckDB to explain how it will carry out this query: the syntax is a bit weird, but you can just add the commands EXPLAIN ANALYZE to the beginning of any query to obtain DuckDB’s query plan:

Code
ratio_query_explain = f"EXPLAIN ANALYZE {ratio_query}"
print(ratio_query_explain)
explanation = con.execute(ratio_query_explain).fetchone()
print(explanation[1])
EXPLAIN ANALYZE 
SELECT
  day,
  sleep_hrs,
  work_hrs,
  work_hrs / sleep_hrs AS ratio
FROM sleep_df
WHERE day NOT IN ('Saturday','Sunday')

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE  SELECT   day,   sleep_hrs,   work_hrs,   work_hrs / sleep_hrs AS ratio FROM sleep_df WHERE day NOT IN ('Saturday','Sunday') 
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 0.0004s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            day            │
│         sleep_hrs         │
│          work_hrs         │
│           ratio           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             5             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((day != 'Saturday') AND  │
│     (day != 'Sunday'))    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             5             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│        PANDAS_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            day            │
│         sleep_hrs         │
│          work_hrs         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 7           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             7             │
│          (0.00s)          │
└───────────────────────────┘                             

And, like before (again reading from bottom to top of the query plan), we can see that despite us writing the division work_hrs / sleep_hrs earlier in our query than the filtering step (the WHERE statement), DuckDB has also successfully figured out that executing these in reverse order is more efficient! (The PROJECTION step, occurring after the FILTER step, is the first step where the ratio result appears)

Basic Pandas Operations in SQL

With all that in mind, you can now move towards thinking through how you can carry out the above sequence of “translations” – from Pandas to Polars to DuckDB – for other common Pandas tasks you’re used to from e.g. DSAN 5000!

To nudge you in that direction, in this section we’ll conclude with a few examples of common Pandas pipelines and how they translate to SQL code.

Filtering Rows

This is probably the most straightforward operation in SQL world, since SQL provides us with the WHERE operator which we can pair with a boolean condition to filter the rows of our DataFrame.

So, for example, if we wanted to select one specific row in our DataFrame on the basis of some value, we can do this using the general syntax SELECT * FROM df WHERE condition. The following code uses this syntax to select just the row containing data for Thursday:

Code
con.execute("""
SELECT * FROM sleep_df
WHERE day = 'Friday'
""").df()
day sleep_hrs work_hrs
0 Friday 4.556042 6.341836

Notice how, since SQL is a “declarative” language, it avoids the confusion which comes up in Python between the single-equals-sign assignment operator = and the double-equals-sign equality operator ==. In SQL, you can actually use either of these in a boolean operator to indicate that you’d like to check for equality, though the single-equals-sign = used above is more common:

Code
con.execute("""
SELECT * FROM sleep_df
WHERE day == 'Friday'
""").df()
day sleep_hrs work_hrs
0 Friday 4.556042 6.341836

You may have noticed, however, that we didn’t use this = operator when filtering out weekend days above. Instead, we used the IN operator, along with the NOT operator for negation. The IN operator on its own works like:

Code
con.execute("""
SELECT * FROM sleep_df
WHERE day IN ('Monday','Wednesday','Friday')
""").df()
day sleep_hrs work_hrs
0 Monday 6.089095 7.418997
1 Wednesday 5.393727 6.011502
2 Friday 4.556042 6.341836

And then we can use the NOT operator to obtain the exact opposite (the days which are not among the three listed days in the query):

Code
con.execute("""
SELECT * FROM sleep_df
WHERE day NOT IN ('Monday','Wednesday','Friday')
""").df()
day sleep_hrs work_hrs
0 Tuesday 7.912447 8.747633
1 Thursday 4.347202 7.318025
2 Saturday 4.552443 7.279092
3 Sunday 6.606959 6.652768

Selecting Columns

Notice how, up until now, we’ve been using SELECT * at the beginning of our queries. That’s because our goal was to view the entire contents of the DataFrame, just with certain rows selected and others filtered out.

If we instead want to select certain columns, we can just replace the wildcard symbol * that we placed after SELECT with a specific list of columns:

Code
con.execute("""
SELECT day, work_hrs FROM sleep_df
""").df()
day work_hrs
0 Monday 7.418997
1 Tuesday 8.747633
2 Wednesday 6.011502
3 Thursday 7.318025
4 Friday 6.341836
5 Saturday 7.279092
6 Sunday 6.652768

Importantly, we can also match regular expression patterns in the names of columns, by using SELECT COLUMNS() rather than just SELECT. Here, for example, we use it to select all of the columns whose names end with hrs:

Code
con.execute("""
SELECT COLUMNS('.+_hrs') FROM sleep_df
""").df()
sleep_hrs work_hrs
0 6.089095 7.418997
1 7.912447 8.747633
2 5.393727 6.011502
3 4.347202 7.318025
4 4.556042 6.341836
5 4.552443 7.279092
6 6.606959 6.652768

Note that this COLUMNS() syntax is specific to DuckDB’s “flavor” of SQL, and not standard in other implementations of SQL.

Defining New Columns

Finally, you may have noticed that the SELECT operator is basically doing double duty: not only did we use it in the previous section to request the three original columns in sleep_df (day, sleep_hrs, and work_hrs), but also to define a new column named ratio, by dividing work_hrs by sleep_hrs.

This is a general feature of SQL, and DuckDB (and the select() function in Polars)! We can SELECT the columns “as-is” by just including their names, or we can define new columns within the select statement. For example, if we wanted the squared values of our two numeric columns for some reason, we could do the following:

Code
con.execute("""
SELECT
  day,
  sleep_hrs,
  sleep_hrs^2 AS sleep_squared,
  work_hrs,
  work_hrs^2 AS work_squared
FROM sleep_df
""").df()
day sleep_hrs sleep_squared work_hrs work_squared
0 Monday 6.089095 37.077083 7.418997 55.041509
1 Tuesday 7.912447 62.606823 8.747633 76.521076
2 Wednesday 5.393727 29.092288 6.011502 36.138152
3 Thursday 4.347202 18.898161 7.318025 53.553486
4 Friday 4.556042 20.757516 6.341836 40.218888
5 Saturday 4.552443 20.724741 7.279092 52.985182
6 Sunday 6.606959 43.651905 6.652768 44.259327

Combining Operations

To bring everything together, let’s see how we can combine the above three tasks into a single query!

Code
con.execute("""
SELECT
  day,
  work_hrs,
  work_hrs^2 AS work_squared
FROM sleep_df
WHERE day = 'Tuesday'
""").df()
day work_hrs work_squared
0 Tuesday 8.747633 76.521076