Data Cleaning with Python

Extra Writeups
Author
Affiliation

Jeff Jacobs

Published

September 26, 2024

Links
import pandas as pd
import numpy as np
Source: Data Cleaning with Python

(1) Overview

A series of longitudinal studies in the Journal of the American Medical Association (JAMA), most recently Reuben et al. (2019) direct link, have found consistent and robust associations between childhood exposure to lead and adult psychopathology:

In this multidecade, longitudinal study of lead-exposed children, higher childhood blood lead level was associated with greater psychopathology across the life course and difficult adult personality traits. Childhood lead exposure may have long-term consequences for adult mental health and personality. (Reuben et al. 2019)

Though we won’t explore that association directly here, I wanted to gather two datasets that would be relevant for studying it, to demonstrate some basic data-cleaning steps using Python libraries like pandas and openpyxl.

If you’re interested in applying your DSAN 5000 skills to the healthcare sphere, definitely consider studying this issue! I have a close friend, for example, who teaches at a school in East Oakland which is currently facing a lead contamination crisis. As a third dataset on top of the two we clean here, for example, you could try finding data on income by neighborhood—there is a reason why lead contamination is continually “discovered” at schools in neighborhoods like East Oakland and West Baltimore, and not at schools in neighborhoods like Georgetown.

(3) All Aboard the SHIP

SHIP stands (weirdly) for State Health Insurance Assistance Program, a program in the state of Maryland which subsidizes health insurance payments for low-income healthcare recipients. You can find the full info and codebook for the SHIP data here

Here we also see a third way to use glob(): somewhat like tab completion in the Linux shell, we can type the beginning of the filename followed by the wildcard character * to obtain the full filename we need. Note that glob.glob() always returns a list, so we need to add [0] at the end to access the first result (even if there is only one result in total!).

ship_fpath = glob.glob("data/SHIP*.csv")[0]
ship_df = pd.read_csv(ship_fpath)
ship_df.head()
Jurisdiction Value Race/ ethnicity Year Measure
0 State 4291.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
1 Allegany 3309.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
2 Anne Arundel 5734.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
3 Baltimore City 10093.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
4 Baltimore County 4210.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits

(4) Lead Levels Data from the CDC

Especially given

%pip install openpyxl
Requirement already satisfied: openpyxl in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (3.1.2)
Requirement already satisfied: et-xmlfile in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from openpyxl) (1.1.0)
Note: you may need to restart the kernel to use updated packages.

And data on childhood lead levels can be found from the CDC website’s data portal. The most recent sample was conducted in 2021, the results of which are contained in the Excel-format dataset we load here.

xlsx_fpath = glob.glob("data/*.xlsx")[0]
places_df = pd.read_excel(xlsx_fpath, skiprows=3)
places_df.rename(columns={'Unnamed: 0': 'fips', 'Unnamed: 1': 'county'}, inplace=True)
places_df = places_df.rename(columns={'Unnamed: 0': 'fips', 'Unnamed: 1': 'county'})
rename_map = {
    'Unnamed: 0': 'fips',
    'Unnamed: 1': 'county',
    #'Unnamed: 2': 'child_pop',
    'Percent': 'pct_lead'
}
places_df.rename(columns=rename_map, inplace=True)
cols_to_keep = list(rename_map.values())
places_df = places_df[cols_to_keep]
places_df
fips county pct_lead
0 001 Allegany County 0.020888
1 003 Anne Arundel County 0.003220
2 005 Baltimore County 0.009980
3 009 Calvert County 0.006550
4 011 Caroline County 0.019973
5 013 Carroll County 0.006377
6 015 Cecil County 0.006893
7 017 Charles County 0.002288
8 019 Dorchester County 0.027439
9 021 Frederick County 0.003054
10 023 Garrett County NaN
11 025 Harford County 0.003109
12 027 Howard County 0.008291
13 029 Kent County NaN
14 031 Montgomery County 0.003826
15 033 Prince George's County 0.012376
16 035 Queen Anne's County NaN
17 037 St. Mary's County NaN
18 039 Somerset County 0.015837
19 041 Talbot County 0.012442
20 043 Washington County 0.008881
21 045 Wicomico County 0.013187
22 047 Worcester County 0.012959
23 510 Baltimore (city) County 0.041661
24 NaN Unknown NaN
25 NaN NaN NaN
26 Notes: 'N/A' indicates data are supressed when... NaN NaN
27 Population estimates calculated as population ... NaN NaN
28 Data received and processed by CDC as of April... NaN NaN

First, we note that only the rows from index 0 to index 23 contain the data we want (the rest contain footnotes added to the cells below the main dataset). So, we use the .iloc accessor from Pandas to “slice” just these first 24 rows, keeping in mind that Python’s slice operator : is inclusive-exclusive, meaning that:

  • The number before the : will be the first index included, but
  • The number after the : will be one greater than the last index included
places_df = places_df.iloc[0:24].copy()
places_df
fips county pct_lead
0 001 Allegany County 0.020888
1 003 Anne Arundel County 0.003220
2 005 Baltimore County 0.009980
3 009 Calvert County 0.006550
4 011 Caroline County 0.019973
5 013 Carroll County 0.006377
6 015 Cecil County 0.006893
7 017 Charles County 0.002288
8 019 Dorchester County 0.027439
9 021 Frederick County 0.003054
10 023 Garrett County NaN
11 025 Harford County 0.003109
12 027 Howard County 0.008291
13 029 Kent County NaN
14 031 Montgomery County 0.003826
15 033 Prince George's County 0.012376
16 035 Queen Anne's County NaN
17 037 St. Mary's County NaN
18 039 Somerset County 0.015837
19 041 Talbot County 0.012442
20 043 Washington County 0.008881
21 045 Wicomico County 0.013187
22 047 Worcester County 0.012959
23 510 Baltimore (city) County 0.041661
places_df['county'] = places_df['county'].str.replace(" County","")
places_df
fips county pct_lead
0 001 Allegany 0.020888
1 003 Anne Arundel 0.003220
2 005 Baltimore 0.009980
3 009 Calvert 0.006550
4 011 Caroline 0.019973
5 013 Carroll 0.006377
6 015 Cecil 0.006893
7 017 Charles 0.002288
8 019 Dorchester 0.027439
9 021 Frederick 0.003054
10 023 Garrett NaN
11 025 Harford 0.003109
12 027 Howard 0.008291
13 029 Kent NaN
14 031 Montgomery 0.003826
15 033 Prince George's 0.012376
16 035 Queen Anne's NaN
17 037 St. Mary's NaN
18 039 Somerset 0.015837
19 041 Talbot 0.012442
20 043 Washington 0.008881
21 045 Wicomico 0.013187
22 047 Worcester 0.012959
23 510 Baltimore (city) 0.041661
places_df['county'] = places_df['county'].str.replace('(city)', 'City', regex=False)
places_df
fips county pct_lead
0 001 Allegany 0.020888
1 003 Anne Arundel 0.003220
2 005 Baltimore 0.009980
3 009 Calvert 0.006550
4 011 Caroline 0.019973
5 013 Carroll 0.006377
6 015 Cecil 0.006893
7 017 Charles 0.002288
8 019 Dorchester 0.027439
9 021 Frederick 0.003054
10 023 Garrett NaN
11 025 Harford 0.003109
12 027 Howard 0.008291
13 029 Kent NaN
14 031 Montgomery 0.003826
15 033 Prince George's 0.012376
16 035 Queen Anne's NaN
17 037 St. Mary's NaN
18 039 Somerset 0.015837
19 041 Talbot 0.012442
20 043 Washington 0.008881
21 045 Wicomico 0.013187
22 047 Worcester 0.012959
23 510 Baltimore City 0.041661
ship_df['Year'] == 2017
0        True
1        True
2        True
3        True
4        True
        ...  
1245    False
1246    False
1247    False
1248    False
1249    False
Name: Year, Length: 1250, dtype: bool
ship_df = ship_df.loc[ship_df['Year'] == 2017]
ship_df = ship_df[ship_df['Jurisdiction'] != "State"]
ship_df = ship_df[ship_df['Race/ ethnicity'] == "All races/ ethnicities (aggregated)"]
ship_df
Jurisdiction Value Race/ ethnicity Year Measure
1 Allegany 3309.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
2 Anne Arundel 5734.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
3 Baltimore City 10093.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
4 Baltimore County 4210.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
5 Calvert 2999.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
6 Caroline 7556.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
7 Carroll 4216.0 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
8 Cecil 9584.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
9 Charles 2817.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
10 Dorchester 11251.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
11 Frederick 3064.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
12 Garrett 7967.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
13 Harford 3020.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
14 Howard 3082.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
15 Kent 13662.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
16 Montgomery 2312.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
17 Prince George's 1955.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
18 Queen Anne's 6119.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
19 Saint Mary's 6173.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
20 Somerset 2696.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
21 Talbot 7661.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
22 Washington 5410.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
23 Wicomico 2897.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
24 Worcester 3502.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
ship_df
Jurisdiction Value Race/ ethnicity Year Measure
1 Allegany 3309.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
2 Anne Arundel 5734.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
3 Baltimore City 10093.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
4 Baltimore County 4210.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
5 Calvert 2999.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
6 Caroline 7556.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
7 Carroll 4216.0 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
8 Cecil 9584.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
9 Charles 2817.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
10 Dorchester 11251.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
11 Frederick 3064.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
12 Garrett 7967.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
13 Harford 3020.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
14 Howard 3082.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
15 Kent 13662.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
16 Montgomery 2312.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
17 Prince George's 1955.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
18 Queen Anne's 6119.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
19 Saint Mary's 6173.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
20 Somerset 2696.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
21 Talbot 7661.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
22 Washington 5410.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
23 Wicomico 2897.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
24 Worcester 3502.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
ship_df['Jurisdiction'] = ship_df['Jurisdiction'].str.replace("Baltimore County", "Baltimore")
ship_df
Jurisdiction Value Race/ ethnicity Year Measure
1 Allegany 3309.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
2 Anne Arundel 5734.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
3 Baltimore City 10093.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
4 Baltimore 4210.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
5 Calvert 2999.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
6 Caroline 7556.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
7 Carroll 4216.0 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
8 Cecil 9584.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
9 Charles 2817.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
10 Dorchester 11251.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
11 Frederick 3064.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
12 Garrett 7967.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
13 Harford 3020.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
14 Howard 3082.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
15 Kent 13662.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
16 Montgomery 2312.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
17 Prince George's 1955.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
18 Queen Anne's 6119.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
19 Saint Mary's 6173.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
20 Somerset 2696.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
21 Talbot 7661.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
22 Washington 5410.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
23 Wicomico 2897.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
24 Worcester 3502.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
ship_df['Jurisdiction'] = ship_df['Jurisdiction'].str.replace("Saint", "St.")
ship_df
Jurisdiction Value Race/ ethnicity Year Measure
1 Allegany 3309.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
2 Anne Arundel 5734.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
3 Baltimore City 10093.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
4 Baltimore 4210.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
5 Calvert 2999.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
6 Caroline 7556.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
7 Carroll 4216.0 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
8 Cecil 9584.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
9 Charles 2817.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
10 Dorchester 11251.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
11 Frederick 3064.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
12 Garrett 7967.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
13 Harford 3020.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
14 Howard 3082.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
15 Kent 13662.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
16 Montgomery 2312.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
17 Prince George's 1955.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
18 Queen Anne's 6119.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
19 St. Mary's 6173.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
20 Somerset 2696.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
21 Talbot 7661.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
22 Washington 5410.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
23 Wicomico 2897.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
24 Worcester 3502.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
merged_df = places_df.merge(ship_df, left_on='county', right_on='Jurisdiction', how='left', indicator=False)
merged_df
fips county pct_lead Jurisdiction Value Race/ ethnicity Year Measure
0 001 Allegany 0.020888 Allegany 3309.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
1 003 Anne Arundel 0.003220 Anne Arundel 5734.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
2 005 Baltimore 0.009980 Baltimore 4210.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
3 009 Calvert 0.006550 Calvert 2999.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
4 011 Caroline 0.019973 Caroline 7556.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
5 013 Carroll 0.006377 Carroll 4216.0 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
6 015 Cecil 0.006893 Cecil 9584.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
7 017 Charles 0.002288 Charles 2817.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
8 019 Dorchester 0.027439 Dorchester 11251.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
9 021 Frederick 0.003054 Frederick 3064.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
10 023 Garrett NaN Garrett 7967.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
11 025 Harford 0.003109 Harford 3020.2 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
12 027 Howard 0.008291 Howard 3082.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
13 029 Kent NaN Kent 13662.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
14 031 Montgomery 0.003826 Montgomery 2312.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
15 033 Prince George's 0.012376 Prince George's 1955.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
16 035 Queen Anne's NaN Queen Anne's 6119.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
17 037 St. Mary's NaN St. Mary's 6173.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
18 039 Somerset 0.015837 Somerset 2696.1 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
19 041 Talbot 0.012442 Talbot 7661.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
20 043 Washington 0.008881 Washington 5410.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
21 045 Wicomico 0.013187 Wicomico 2897.6 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
22 047 Worcester 0.012959 Worcester 3502.8 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
23 510 Baltimore City 0.041661 Baltimore City 10093.5 All races/ ethnicities (aggregated) 2017 Mental Health ED visits
merged_df[['pct_lead','Value']].corr()
pct_lead Value
pct_lead 1.00000 0.59498
Value 0.59498 1.00000

References

Reuben, Aaron, Jonathan D. Schaefer, Terrie E. Moffitt, Jonathan Broadbent, Honalee Harrington, Renate M. Houts, Sandhya Ramrakha, Richie Poulton, and Avshalom Caspi. 2019. “Association of Childhood Lead Exposure With Adult Personality Traits and Lifelong Mental Health.” JAMA Psychiatry 76 (4): 418–25. https://doi.org/10.1001/jamapsychiatry.2018.4192.