DSAN 6000: Big Data and Cloud Computing
Fall 2025
Monday, October 6, 2025
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf
collect CAUTION
Spark Application UI shows important facts about you Spark job:
Adapted from AWS Glue Spark UI docs and Spark UI docs
Problem: make a new column with ages for adults-only
+-------+--------------+
|room_id| guests_ages|
+-------+--------------+
| 1| [18, 19, 17]|
| 2| [25, 27, 5]|
| 3|[34, 38, 8, 7]|
+-------+--------------+
Adapted from UDFs in Spark
```{python}
from pyspark.sql.functions import udf, col
@udf("array<integer>")
def filter_adults(elements):
return list(filter(lambda x: x >= 18, elements))
# alternatively
from pyspark.sql.types IntegerType, ArrayType
@udf(returnType=ArrayType(IntegerType()))
def filter_adults(elements):
return list(filter(lambda x: x >= 18, elements))
```+-------+----------------+------------+
|room_id| guests_ages | adults_ages|
+-------+----------------+------------+
| 1 | [18, 19, 17] | [18, 19]|
| 2 | [25, 27, 5] | [25, 27]|
| 3 | [34, 38, 8, 7] | [34, 38]|
| 4 |[56, 49, 18, 17]|[56, 49, 18]|
+-------+----------------+------------+
```{python}
from pyspark.sql.functions import udf
from pyspark.sql.types import LongType
# define the function that can be tested locally
def squared(s):
return s * s
# wrap the function in udf for spark and define the output type
squared_udf = udf(squared, LongType())
# execute the udf
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))
```
Costs:
Other ways to make your Spark jobs faster source:
From PySpark docs - Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using the pandas_udf as a decorator or to wrap the function, and no additional configuration is required. A Pandas UDF behaves as a regular PySpark function API in general.
```{python}
@pandas_udf("first string, last string")
def split_expand(s: pd.Series) -> pd.DataFrame:
return s.str.split(expand=True)
df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(split_expand("name")).show()
+------------------+
|split_expand(name)|
+------------------+
| [John, Doe]|
+------------------+
```Input of the user-defined function:
Output of the user-defined function:
Grouping semantics:
Output size:

DSAN 6000 Week 7: Spark Clusters