Exploratory Data Analysis of the THOR Dataset in Seaborn

Extra Writeups

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

September 26, 2024

Colab Link

Click here to open in Colab

Accompanying Video

The recording of me going through each step of this writeup in order is now available here. Because of the length of the video, please don’t think of it as something you need to watch from start to finish. Instead you can just use it as a resource where, if you encounter something in the writeup that you don’t understand, or if there is a particular cleaning step / plot that you want more information about, you can jump directly to that portion of the video.

(1) Lab Background

(1.1) The Vietnam / Second Indochinese War

A massive bombing campaign in Cambodia. Anything that flies on anything that moves.
–Henry Kissinger, US National Security Advisor, Dec. 9, 1970.

In 1965, the United States launched Operation Rolling Thunder, a years-long aerial assault on Vietnam which devastated the country and its people (killing over 3 million), irreversably damaging its environment (with more extensive use of chemical weapons like Napalm than in any war before or since), its crops and forests (with 19 million gallons of Agent Orange sprayed over 500,000 acres of crops and 5 million acres of forests), and its infrastructure. As a campaign of genuinely unprecedented ferocity and destruction launched by a military superpower against a poor, mostly rural/agrarian country (only 20% of its citizens lived in cities at the time), its devastating effects are still felt to this day by those in Vietnam, Laos, and Cambodia.

(All of which is all my not-so-subtle chance to plug, before moving onto the data analysis, the donation page for the HALO Trust: an org headquartered right here in NW DC which is leading the worldwide effort to defuse unexploded landmines in these countries, where for example 50,000 Laotians have been killed by these mines since the war “officially” ended)

(1.2) The Role of Data Science in the Devastation

As an example of a war where modern science and technology was seen as a central military resource (with chemistry research leading to the deployment of Agent Orange and Napalm, computer science research used to track and target individuals via early instances of machine learning models, and social science research used to e.g. justify driving villagers from their homes into concentration camps called “Strategic Hamlets”), it is an important piece of history for data scientists to grapple with: how the technologies we develop could be used to help people, but also could be used to inflict unimaginable horrors upon millions of innocent civilians. Datasets pertaining to this war, many of which have only become publicly/easily-available in the past few years, provide the perfect opportunity to use data science itself to study this example of a military “data science project” of a prior generation.

(1.3) EDA and Bayesian Statistics: Challenging Prior Beliefs to Derive Posterior Beliefs

Although the entire point of EDA in theory is to go in “without any assumptions” (but see below/slides), I’ll give a quick summary of the impression(s) I came away with after the cleaning, tabulating, and visualizing below, in combination with what I knew about it beforehand, so (a) you know what to expect going in and/or (b) so you can use it as a “prior” that you update while working through the visualizations (challenging received assumptions, rather than pretending we have none, should be the goal of EDA imo!)

Briefly: although Operation Rolling Thunder is often cited as the beginning of what people in the US usually call the “Vietnam War”, by looking at the basic collection of tabulations and visualizations herein it slowly emerges that probably the most horrifying and unprecedented destruction was experienced not even by those in Vietnam itself but by those in Laos over the timespan represented in the dataset (and, after a US-sponsored coup in 1970, Cambodia as well). I knew beforehand, for example, that Laos was the most-bombed (per square meter) territory in history, but in terms of the goals of EDA I definitely found my priors about the Vietnam War challenged by what the tables/plots seemed to be “saying” about the underlying historical event.

As a final piece of information to have as part of your prior knowledge, though, before we dive in: Keep in mind that although the numbers we’ll be examining are mostly in absolute units such as the number of bombings, they should be interpreted in the context of the relative populations of each country in the dataset, at the beginning of the campaign:

Country	Population	Year	Source
Cambodia	6.2 million	1965	World Bank
Laos	2.4 million	1965	World Bank
Democratic Republic of Vietnam (North Vietnam)	17.7 million	1965	CIA Intelligence Report (via CREST), 1966, p. 26
Republic of Vietnam (South Vietnam)	16 million	End of 1964	CIA Intelligence Report (via CREST), 1968, p. 11

(1.4) Dataset Info

The dataset we’ll be analyzing is called the Vietnam War Theater History of Operations (THOR) dataset, released by the US Department of Defense in 2015. The raw data is too large (~1.5GB) to easily download and work with, so I’ll be using a reduced-form version below, but the full dataset is available at the previous link (from data.world).¹

(2) Loading the Data

As you should slowly be getting in the habit of doing, we start by importing:

Pandas (for dataset storage and manipulation),
NumPy (for any mathematical transformations we might want to apply), and
Seaborn (for visualization, whether EDA-focused or otherwise)

import pandas as pd
import numpy as np
import seaborn as sns

Next we download the dataset. Technically Pandas’ read_csv() function does support URL arguments, so that we could load the dataset directly into Pandas via URL, but since we often open and close notebooks, or Jupyter crashes, to avoid re-downloading each time you can run the following line, which should download the .csv file into the same directory as wherever this notebook is stored (on Colab, for example, the same portion of your Google Drive that Colab allocates for the storage of the notebook).

Since we’ve already used Python’s requests library before, for data scraping, we’ll use that here to quickly request and save the file:

import requests
thor_url = "https://jpj.georgetown.domains/dsan5000-scratch/eda/thor_strikes.csv"
with open("thor_strikes.csv", 'wb') as outfile:
  csv_content = requests.get(thor_url, stream=True).content
  outfile.write(csv_content)

If the download was successful, you should have a thor_strikes.csv file in the same folder as this notebook (in Colab you can check the file tab in the sidebar on the left side of the page). We can then use pd.read_csv() to load the .csv-formatted dataset, and we can re-load the dataset again in the future without having to worry about re-downloading it.

strike_df = pd.read_csv("thor_strikes.csv")

Another good habit you can get into is always checking the output of the .head() function after loading the dataset, just to make sure that everything loaded as you expected. It usually doesn’t (😅)

strike_df.head()

	MFUNC_DESC	MISSIONID	TGTCOUNTRY	THOR_DATA_VIET_ID	MSNDATE	SOURCEID	NUMWEAPONSDELIVERED	TGTTYPE	TGTLATDD_DDD_WGS84	TGTLONDDD_DDD_WGS84
0	STRIKE	1047	LAOS	4	1970-02-02	642780	2	TRUCKS	16.902500	106.014166
1	STRIKE	1407	LAOS	6	1970-11-25	642782	6	AAA\37MM CR MORE	19.602222	103.597222
2	STRIKE	9064	LAOS	7	1972-03-08	642783	0	TRUCKS	14.573611	106.689722
3	STRIKE	8630	LAOS	16	1971-05-12	642792	4	TRK\PRK\STORE AREA	17.563611	105.756666
4	STRIKE	1391	LAOS	18	1971-12-19	642794	0	PERSONNEL\ANY	16.864166	105.349166

In our case, though, the output looks pretty much as expected, in terms of Pandas being able to automatically detect the header row and the splits between columns. So, let’s move on to the EDA!

(3) Missing Data

First things first, before we worry about missing data itself, we should think about which variables we want to analyze, since for now we can just worry about checking for missing values within the columns corresponding to these variables.

The official 59-page codebook for the dataset, an absolutely crucial thing to have on hand during any data analysis, can be found at this link. To skip over you having to read a 59-page document, though, I’ll now introduce a shortened, simplified codebook.

In our case, since we’re doing exploratory data analysis, in theory we shouldn’t have any hypotheses in mind yet. In reality we do have hypotheses in our heads, subconsciously at least, so when I say that I just mean that we’re not directly trying to confirm or deny these hypotheses at this stage. We are just hoping to question/interrogate them as we go along…

Keep in mind Tukey’s metaphor, that:

EDA = detective work, collecting evidence for a future trial, while
CDA = the trial itself!

I’m just going to focus on a few key variables which are central to understanding the processes underlying the data (the destruction of the countries across former Indochina). Let’s also think about what type of data each column represents—categorical vs. numeric, ordinal vs. cardinal, etc. Just like in the lab we can then use this info, as recorded in the following table, as our codebook for understanding what each variable represents:

Variable Name	Description	Variable Type	Datatype
`TGTCOUNTRY`	This should tell us, for a given strike, the country within whose borders the strike was carried out.	Categorical	`string`
`MSNDATE`	This should tell us the date (in Y/M/D format) that the strike was carried out.	Discrete numeric (date)	`object`(!) (see below)
`TGTTYPE`	This should tell us the type of target, e.g., whether it was a vehicle, a group of people, an individual person, etc.	Categorical	`string`
`TGTLATDD_DDD_WGS84`	The latitude coordinate of the target	Continuous	`float`
`TGTLONDD_DDD_WGS84`	The longitude coordinate of the target	Continuous	`float`

As our first EDA task, let’s look into the first variable in the table above: TGTCOUNTRY. As the section header suggests, we’re going to see if this column has any missing data and, if so, what the missing data means and what to do about it.

[Crucial point here:] If we started using Pandas methods with their default arguments, we might get to thinking that nothing is wrong here, and move on from missing data checks… Take a look at what the value_counts() function returns when we use this function to look at the distribution of values in TGTCOUNTRY without including any additional parameters:

strike_df['TGTCOUNTRY'].value_counts()

TGTCOUNTRY
LAOS              690161
NORTH VIETNAM     245977
SOUTH VIETNAM      54253
CAMBODIA           13271
THAILAND             146
UNKNOWN               23
WESTPAC WATERS         6
Name: count, dtype: int64

All seems well from this table: we might think “ok, great, they have encoded the 23 missing values from the original non-digitized data as "UNKNOWN", and all the other values are interpretable non-missing labels, so we’re good right?”

Sadly we’re NOT GOOD 😭. Let’s check one more thing, to slowly move towards the hidden issue here. Let’s start by summing up the counts that value_counts() has given us for each value:

value_counts_sum = strike_df['TGTCOUNTRY'].value_counts().sum()
value_counts_sum

Next let’s take a look at the total number of rows and columns in our DataFrame, by checking the shape attribute

[Here it’s important to note that, unlike most of the other attributes that we use to analyze DataFrame objects, shape is not a function but a static variable, an attribute of the DataFrame object. That’s why we just use .shape rather than .shape() to access this information.]

num_rows = strike_df.shape[0]
num_rows

Do you see the issue? If not, let’s check the difference between these two values:

num_rows - value_counts_sum

This tells us, to our horror, that there are 3,837 rows in our DataFrame which are completely unaccounted-for in the output produced by value_counts() 😰

And the reason is, as it turns out, the value_counts() function excludes missing values by default, so that the above output actually tells us absolutely nothing about whether or not there are missing values in the column! Scary stuff.

There is an easy fix, however: we can just include the dropna = False argument in our call to the value_counts() function, which will override this default behavior and show missing values as one of the possible values in the table of counts:

strike_df['TGTCOUNTRY'].value_counts(dropna = False)

TGTCOUNTRY
LAOS              690161
NORTH VIETNAM     245977
SOUTH VIETNAM      54253
CAMBODIA           13271
NaN                 3837
THAILAND             146
UNKNOWN               23
WESTPAC WATERS         6
Name: count, dtype: int64

And now we see that, in fact, there are thousands of strikes for which no target country was recorded in the dataset: about 167 times more missing data than we originally thought when we only saw that 23 of the rows had a TGTCOUNTRY value of "UNKNOWN".

So, what can we do about this? In later weeks of the course and/or in later weeks of DSAN5100 we’ll learn about some more advanced methods for inputing missing values, for example by explicitly modeling the data-generating process that led some of the values to be missing. Until we know how to do that, however, for the sake of moving on with the EDA we will ignore these rows when we move onto other EDA tasks, keeping in mind that this is NOT acceptable practice in general for data science, outside of tutorial examples like this!!

Before we move to those tasks, however, let’s quickly look at how we can visualize the distribution of missing values in this column (along with the other columns), using the third-party missingno package mentioned in the lab instructions!

!pip install missingno

Requirement already satisfied: missingno in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (0.5.2)
Requirement already satisfied: numpy in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from missingno) (1.26.0)
Requirement already satisfied: matplotlib in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from missingno) (3.8.0)
Requirement already satisfied: scipy in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from missingno) (1.11.2)
Requirement already satisfied: seaborn in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from missingno) (0.12.2)
Requirement already satisfied: contourpy>=1.0.1 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from matplotlib->missingno) (1.1.1)
Requirement already satisfied: cycler>=0.10 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from matplotlib->missingno) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from matplotlib->missingno) (4.42.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from matplotlib->missingno) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from matplotlib->missingno) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from matplotlib->missingno) (10.0.1)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from matplotlib->missingno) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from matplotlib->missingno) (2.8.2)
Requirement already satisfied: pandas>=0.25 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from seaborn->missingno) (2.1.0)
Requirement already satisfied: pytz>=2020.1 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from pandas>=0.25->seaborn->missingno) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from pandas>=0.25->seaborn->missingno) (2023.3)
Requirement already satisfied: six>=1.5 in /Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)

I don’t want to give away the answer to this portion of the lab by producing a missing-data matrix directly, so instead I will use a different but still very helpful function from missingno, the msno.bar() function, to visualize missingness across the different variables in our DataFrame:

import missingno as msno

msno.bar(strike_df)

Although we see in hindsight that the visual properties of this plot would not have been that useful in discovering the missing values in TGTCOUNTRY (I say “visual properties” because it still helpfully provides the non-missing counts at the top of each bar, which could have allowed us to see this issue), the plot is very useful for discovering the missingness prevalent in the TGTTYPE variable. So, just like in the case of TGTCOUNTRY, we will move on while keeping in mind that this variable has an even more extreme issue with missingness that will bias our results (until we learn how to explicitly incorporate and de-bias the missingness!)

(4) Different Formats Within The Same Column

(4.1) The Ambiguous `object` Type

As our next EDA task, let’s look into that (!) in the table above: while the rest of the variables are in fairly unambiguous formats, object is a particularly scary datatype in Pandas, since “object” in fact just means that the datatype of a given value in this column could be anything. If you know Java, for example, you know that Object is the datatype that all other datatypes are subtypes of. Similarly, in Python, every class implicitly inherits the properties of the object class.

Looking at the output of head() from above, however, you may reasonably think that the entries in this column will all be of string type. And this instinct would be further enforced by checking the datatypes of the first five entries in the column:

strike_df['MSNDATE'].head().apply(type)

0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
3    <class 'str'>
4    <class 'str'>
Name: MSNDATE, dtype: object

Therefore, you might start moving ahead, working with this column as if it is a string column. You should push back on this instinct!! It will save you so many headaches later on, when you are neck-deep in the data analysis, if you verify what’s going on with any columns that Pandas loaded as object type right at the beginning!

For example, rather than drawing inferences about the entire dataset from only the first five columns (a procedure jokingly referred to as “engineer’s induction”), we can do something a bit less error-prone by making a new column MSNDATE_type which just tells us the type of each value in the MSNDATE column, then checking whether or not every entry in MSNDATE_type is string. If it is, then we’re fine adopting the it’s-a-string-variable assumption and moving on. But if it’s not, we’ll have to do something to handle the mismatch. Let’s see what happens. We’ll make the column, then we’ll use the value_counts() function from Pandas to see the distribution of datatypes in the column (where now we remember that we pretty much always want to include the dropna = False parameter):

strike_df['MSNDATE_type'] = strike_df['MSNDATE'].apply(type)

strike_df['MSNDATE_type'].value_counts(dropna=False)

MSNDATE_type
<class 'str'>    1007674
Name: count, dtype: int64

So far so good: Pandas loaded the column as type object, which often means but does not always mean that the column is filled with string variables. There is still one more issue with this column, however, looming on the horizon…

(4.2) Splitting The Column Into Parts

Since in general we want to analyze the yearly volume of strikes in each country, the format of the values in the MSNDATE column is not that helpful to us at the moment. So, again we might proceed by intuition, using engineer’s induction on the first five rows of the DataFrame to infer that we can just split on the - character to quickly divide this single column into three separate columns: one for year, one for month, and one for day.

Let’s see what happens when we try this out.

Writing the Parse Function

In Python, unlike the nice separate_wider_*() functions in R that I talked about in this writeup, I personally haven’t found all that many helpful functions specially-made for splitting a string up into pieces.

So, what I usually do in this type of situation (but see a few sections below) is utilize the following setup: as a nice “trick” for generating multiple columns from one individual column, Pandas allows you to write functions which return pd.Series objects (these Series objects are sort of like one-dimensional DataFrames), and then use this function to set the values of multiple columns in one “sweep” using syntax like:

def function_returning_series_object(value):
  """
  A function that will take in a single value from a column and
  return a pd.Series object, where the **keys** in the pd.Series
  correspond to the **names of the columns** you're hoping to create
  """
  // Do something with `value` to generate these results
  resulting_x_value = 'first result'
  resulting_y_value = 'second result'
  resulting_z_value = 'third result'
  series_values = {
    'x' = resulting_x_value,
    'y' = resulting_y_value,
    'z' = resulting_z_value
  }
  return pd.Series(series_values)

// Specify the names of the columns you're about to create
columns_to_create = ['x','y','z']
// And use the apply() function to "send" the values of some existing column
// to the function_returning_series_object() function defined above
df[columns_to_create] = df['some_existing_column'].apply(function_returning_series_object)

In our case, so you can see this in action, let’s define a function that will take in the values of the MSNDATE column (which we expect to have the format YY-MM-DD, split these values on the - character, and return the three resulting pieces in pd.Series form with the keys year, month, and day.

def split_date(date_val):
  date_elts = date_val.split("-")
  date_values = {
      'year': date_elts[0],
      'month': date_elts[1],
      'day': date_elts[2]
  }
  return pd.Series(date_values)

The tqdm Library

Quick Note: tqdm

When we use the apply() function from Pandas to apply our split_date() function to each value of the MSNDATE column, it will take kind of a long time to iterate through and parse each date. For this reason, I’m including example code here showing how to import and use the third-party Python library tqdm to create a new progress_apply() function. If the waiting feels tedious on your machine, for example, you can call progress_apply() in the place of apply(), which will display a progress bar showing far along it is. In Colab this will work right away, since tqdm comes pre-installed. Depending on your environment it may crash though, since some versions of tqdm are incompatible with some versions of pandas, so to be safe I’m sticking with the regular .apply() function

from tqdm.notebook import tqdm
tqdm.pandas()

Back to EDA

The running time of the progress_apply() call will be bad enough, but as we’ll see, it gets even worse: right near the end, about 63% of the way through, the code will crash with an error! (You can uncomment the following code, run it, and wait a bit to see the error if you’d like, or just jump past the following cell)

## And apply it to MSNDATE
#columns_to_create = ['year','month','day']
#strike_df[columns_to_create] = strike_df['MSNDATE'].progress_apply(split_date)

The error looks like:

<ipython-input-35-ec79f5e7def4> in split_date(date_val)
      3   date_values = {
      4       'year': date_elts[0],
----> 5       'month': date_elts[1],
      6       'day': date_elts[2]
      7   }

IndexError: list index out of range

So what happened?

Long story short, engineer’s induction failed us again. It turns out that some entries within the MSNDATE column contain dates in a different format from the format we saw in the first 5 rows of the dataset. But how did I magically figure this out?

Let’s start by carrying out Engineer’s Induction 2.0 (TM), by looking at the tail of strike_df rather than the head. Sometimes, if the column has one format at the beginning, but another near the end, this will show us the alternative format:

strike_df.tail()

	MFUNC_DESC	MISSIONID	TGTCOUNTRY	THOR_DATA_VIET_ID	MSNDATE	SOURCEID	NUMWEAPONSDELIVERED	TGTTYPE	TGTLATDD_DDD_WGS84	TGTLONDDD_DDD_WGS84	MSNDATE_type
1007669	STRIKE	1405	LAOS	4559448	1973-01-01	58102	0	PERSONNEL\ANY	15.191111	106.206666	<class 'str'>
1007670	STRIKE	8719	LAOS	4560531	1972-05-08	68563	0	NaN	16.758055	106.461944	<class 'str'>
1007671	STRIKE	9526	SOUTH VIETNAM	4561255	1972-07-10	69287	0	NaN	16.684166	107.053888	<class 'str'>
1007672	STRIKE	0692	LAOS	4650354	1970-11-28	527864	0	TRUCKS	17.286388	105.609722	<class 'str'>
1007673	STRIKE	7232	LAOS	4651825	1971-10-19	541658	4	NaN	15.292777	107.137500	<class 'str'>

We’re not so lucky this time. The last 5 values of the MSNDATE column look like they’re formatted the same way as the first 5 values are.

So, next let’s do something a bit smarter, and just check how many - characters each entry has—we can code this check in a very similar way to how we coded the check above when we looked at the type of each entry in the column.

In fact, here we don’t even need to come up with our own function and use apply() or anything like that, since Pandas has built-in functionality allowing us to access (by using the .str suffix after extracting just the string-format column) any of Python’s built-in functions for strings, like len() or replace(), at which point Pandas will then automatically apply our chosen to each value in the column.

So, let’s use this .str suffix to apply the built-in count() function (that Python provides for all string objects) to each value of MSNDATE, counting the number of times that - appears in each value, then look at the distribution of these counts across the entire column:

strike_df['MSNDATE_dashcount'] = strike_df['MSNDATE'].str.count("-")

strike_df['MSNDATE_dashcount'].value_counts()

MSNDATE_dashcount
2    993524
0     14150
Name: count, dtype: int64

And we’ve found our issue: 14,150 of the rows do not use a YYYY-MM-DD format! Since we now have this MSNDATE_dashcount column available to use, let’s filter the full DataFrame to look at just (the first few) rows where it is equal to 0, i.e., rows where the MSNDATE column does not contain any dashes:

strike_df.loc[strike_df['MSNDATE_dashcount'] == 0,].head()

	MFUNC_DESC	MISSIONID	TGTCOUNTRY	THOR_DATA_VIET_ID	MSNDATE	SOURCEID	NUMWEAPONSDELIVERED	TGTTYPE	TGTLATDD_DDD_WGS84	TGTLONDDD_DDD_WGS84	MSNDATE_type
639026	STRIKE	JH301	SOUTH VIETNAM	2876015	19700409	258195	12	AREA\DEPOT	14.825953	107.716270	<class 'str'>
639027	STRIKE	JH309	SOUTH VIETNAM	2876638	19700216	262548	24	AREA\DEPOT	14.510215	108.768815	<class 'str'>
639028	STRIKE	01305	SOUTH VIETNAM	2876852	19651029	221263	0	UNKNOWN\UNIDENTIFIED	10.186505	106.229604	<class 'str'>
639029	STRIKE	JG402	SOUTH VIETNAM	2877103	19701114	219532	24	AREA\DEPOT	16.331801	107.012509	<class 'str'>
639030	STRIKE	J8012	LAOS	2877181	19700721	219801	48	TRUCK PARK\STOP	16.286964	106.922250	<class 'str'>

And we see that, to our relief, the issue isn’t so bad: if engineer’s induction does hold for this situation, then (hopefully) we can split those 14,150 rows without any dashes by just taking the MSNDATE values in these rows and directly extracting the first 4 characters as the year, the next 2 characters as the month and the final 2 characters as the day. As an additional sanity check here, since we are using engineer’s induction, we can check the full range of possible lengths of the MSNDATE values:

strike_df['MSNDATE_len'] = strike_df['MSNDATE'].str.len()
strike_df['MSNDATE_len'].value_counts()

MSNDATE_len
10    993524
8      14150
Name: count, dtype: int64

And we see another encouraging sign: that there is not (for example) some third length that the values of this column can have, that we’d need to worry about on top of the two different formats we’ve found

(There still could be more than two formats, if the additional formats happened to produce date strings with length 8 or 10, but we’ll worry about that if we experience more issues with splitting MSNDATE under the assumption of two formats)

Re-Coding the Date Parsing Function: v2.1

So let’s make an updated version of our split_date() function, which explicitly checks whether the date it is given is encoded in the 8-character or 10-character format, and does the splitting accordingly:

def split_date_v2_1(date_val):
  if len(date_val) == 10:
    # Do what we did before, using .split()
    date_elts = date_val.split("-")
    date_values = {
      'year': date_elts[0],
      'month': date_elts[1],
      'day': date_elts[2]
    }
  else:
    # Assume it is in YYYYMMDD format, so that we can
    # just extract substrings to obtain Y, M, and D
    date_values = {
        'year': date_val[0:4],
        'month': date_val[4:6],
        'day': date_val[6:8]
    }
  return pd.Series(date_values)

And now we could run this using the same .apply() code as before, to see if it successfully parses the column. In the first version of the writeup I did run it, but it took about ~5 minutes, which is too long for a reasonable quick-learning writeup, imo! (You could just go ahead and create a cell containing the following code, and run it, if you are ok waiting for 5mins:)

columns_to_create = ['year','month','day']
strike_df[columns_to_create] = strike_df['MSNDATE'].progress_apply(split_date_v2)

For me, though, 5 minutes is too slow, so here is a screenshot of the output after this call to progress_apply() has finished:

The swifter Library

Another Quick Note: swifter

Like last time, this may take a while to complete: cases like these are where I highly recommend using tqdm, at a minimum!

There is also the third-party library swifter, which would be nice for the kinds of embarrasingly parallel tasks we’re applying to the column here, except that (as far as I can tell) swifter is only able to provide speedups for numeric operations :/ I thought I’d mention it here, however, since if you are doing some numeric operation that’s taking this long, you might try seeing if swifter will speed up your code!

And as an example of the vast differences that can arise between efficient and inefficient versions of processing code, even in EDA: since I really only need to extract the year (rather than the three elements of each date) for the remaining analysis, for speed I used the following lambda function instead, which just scoops out the first 4 characters of the MSNDATE value (which works because, in either format, the first 4 characters of the string contain the year), and it took approximately 1 second to finish:

# Depending on your Jupyter environment, you may need to install/update the
# `ipywidgets` library for `tqdm` to work correctly
#!pip install ipywidgets

# progress_apply() version (with progress bar)
#strike_df['year'] = strike_df['MSNDATE'].progress_apply(lambda x: x[0:4])
# apply() version
strike_df['year'] = strike_df['MSNDATE'].apply(lambda x: x[0:4])

Finally, now that we’ve extracted just the year of the mission into its own column, let’s convert that column from string format into int format. There are a lot of ways to do this type of conversion, but for me the pd.to_numeric() function is most useful for several reasons, so I’ll use that:

strike_df['year'] = pd.to_numeric(strike_df['year'])

And as we can see by checking the .dtypes attribute (as with shape, remember that dtypes is not a function!), year now has the int64 (64-bit integer) format, as we wanted:

strike_df.dtypes

MFUNC_DESC              object
MISSIONID               object
TGTCOUNTRY              object
THOR_DATA_VIET_ID        int64
MSNDATE                 object
SOURCEID                 int64
NUMWEAPONSDELIVERED      int64
TGTTYPE                 object
TGTLATDD_DDD_WGS84     float64
TGTLONDDD_DDD_WGS84    float64
MSNDATE_type            object
MSNDATE_dashcount        int64
MSNDATE_len              int64
year                     int64
dtype: object

(5) Collapsing Messy Categorical Variables Into Clean Ones

The one variable we should still look at is TGTTYPE: although we already saw how messy it was in the sense of having lots of missing values, it turns out that it is also messy in the sense of having lots and lots of potential values, some of which are very similar but not identical, so that (for example) trying to include this variable as a “factor variable” in a cross-tabulation or a two-way plot (see below) would result in a very messy plot with too many bars/lines/colors/etc.

This becomes clear if we run the following code, showing that there are 235 possible values that this categorical variable can take on:

strike_df['TGTTYPE'].value_counts()

TGTTYPE
AREA\DEPOT         118271
MOTOR VEHICLE      108533
TRUCK PARK\STOP     77840
ROAD                60494
ANTI-AIRCRAFT       52870
                    ...  
AMMO BLDG               1
CP                      1
MOUNTAIN PASS           1
SAM OXIDIZER            1
MIG19                   1
Name: count, Length: 235, dtype: int64

Since we’re focusing on EDA and visualization, though, let’s treat this value_counts() table as a “mini-dataset” and generate an EDA-style visualization of how many instances of each possible category actually appear in the broader dataset. We’ll call this mini-dataset target_dist, since it represents just the counts (and therefore, as a whole, the distribution) of the various possible target-type categories. Note, however, that because value_counts() returns a pd.Series object, not a pd.DataFrame object, we’ll need to explicitly convert its return value into a pd.DataFrame (plus turn the index column into a “regular” column and rename it from index into something more informative) in order to use it with Seaborn.

target_dist_df = pd.DataFrame(strike_df['TGTTYPE'].value_counts()).reset_index()
target_dist_df.columns = ['TGTTYPE_val','count']
target_dist_df

	TGTTYPE_val	count
0	AREA\DEPOT	118271
1	MOTOR VEHICLE	108533
2	TRUCK PARK\STOP	77840
3	ROAD	60494
4	ANTI-AIRCRAFT	52870
...	...	...
230	AMMO BLDG	1
231	CP	1
232	MOUNTAIN PASS	1
233	SAM OXIDIZER	1
234	MIG19	1

235 rows × 2 columns

target_dist_df.iloc[:20]

	TGTTYPE_val	count
0	AREA\DEPOT	118271
1	MOTOR VEHICLE	108533
2	TRUCK PARK\STOP	77840
3	ROAD	60494
4	ANTI-AIRCRAFT	52870
5	TROOPS	47532
6	BRIDGE	28272
7	UNKNOWN\UNIDENTIFIED	27328
8	BUILDINGS	24453
9	SEGMENT\TRANS ROUTE	24181
10	FORD	13893
11	WATER VEHICLES	10845
12	TRUCKS	10224
13	INTERDICTION POINT	9414
14	FERRY	9121
15	BIVOUAC	8505
16	ROADS\HIGHWAYS	7762
17	STORAGE AREA	7762
18	CAVE	6350
19	PERSONNEL\ANY	6314

Now we can think about how to visualize this: since target_dist represents a distribution (in particular, a distribution of counts), essentially we have already done the set of computations that the sns.displot() function demonstrated in the very beginning of the W06 Lab Demo performs under the hood. Meaning: if we hadn’t already computed counts for each value of TGTTYPE, we could use sns.displot() with the original strike_df dataset to visualize these counts. Since we’ve already computed counts, however, we can more straightforwardly just plot each count as a bar in a sns.barplot() visualization (which will look identical to what sns.displot() would generate given strike_df).

(I mentioned the fact that target_dist is a distribution of counts since, although displot() can be used for both probability distributions and distributions of counts, the y-axis for plots of counts will be even easier to interpret than the y-axis for plots of probability mass values, since in the latter case the y-axis values will be “squished down” into decimals ranging from 0 to 1)

# Since Seaborn is built on top of `matplotlib`, we usually import these two libraries
# in the following order, and use the aliases `plt` and `sns`, respectively"
import matplotlib.pyplot as plt
import seaborn as sns

sns.barplot(
    data=target_dist_df,
    x="TGTTYPE_val",
    y='count'
).set(xticks=[], xticklabels=[])
plt.show()

/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

After the call to sns.barplot() in the previous cell, notice that I’m also calling .set() with the options xticks=[] and xticklabels=[]. This call with these options simply removes all ticks and tick labels on the x-axis of the plot, since otherwise it would have displayed a chaotic overlapping mass of 235 ticks and 235 labels. Without these ticks/labels, we are able to focus just on the distribution of the y-axis values, and especially the fact that they start to decrease rapidly after a certain point.

This rapid decrease tells us, for example, that we can probably either

1. aggregate the rarely-appearing values into broader categories, to see if they are (for example) just slightly-modified versions of the more-frequently-appearing values [the safer, more statistically-principled approach], or
1. ignore these values when we’re examining cross-tabulations and/or generating visualizations where data is grouped by these values.

Since we’re focusing on EDA, I’ll be employing approach (b), but please keep in mind that you might be throwing away important information by ignoring these less-frequently-occurring values (many examples come to mind, but to avoid making the writeup any longer I will just say, feel free to ask about these cases!)

Therefore, I’m going to (arbitrarily!) keep the top 10 most-frequently occurring values, and then collapse any values beyond this top 10 down into an 11th catch-all category called “other”:

# Extract just the 10 most-frequently occurring values
rank_cutoff = 10
most_frequent_values = target_dist_df['TGTTYPE_val'].iloc[:rank_cutoff].values
# Print them out (for future reference)
print(most_frequent_values)
# Then take the **set difference** between the **full set of values** and the
# **values that we're keeping**, and re-map these remaining values onto the single value
# "other"
all_values = target_dist_df['TGTTYPE_val'].values
non_frequent_values = [v for v in all_values if v not in most_frequent_values]
# Sanity check
num_non_frequent = len(non_frequent_values)
print(f"{num_non_frequent} non-frequent values")
# And apply the mapping to the original strike_df dataset
rename_map = {nfv: "other" for nfv in non_frequent_values}
strike_df['TGTTYPE'] = strike_df['TGTTYPE'].apply(lambda x: rename_map[x] if x in rename_map else x)
# Print out the new value_counts, to check that the mapping worked
strike_df['TGTTYPE'].value_counts()

['AREA\\DEPOT' 'MOTOR VEHICLE' 'TRUCK PARK\\STOP' 'ROAD' 'ANTI-AIRCRAFT'
 'TROOPS' 'BRIDGE' 'UNKNOWN\\UNIDENTIFIED' 'BUILDINGS'
 'SEGMENT\\TRANS ROUTE']
225 non-frequent values

TGTTYPE
other                   161206
AREA\DEPOT              118271
MOTOR VEHICLE           108533
TRUCK PARK\STOP          77840
ROAD                     60494
ANTI-AIRCRAFT            52870
TROOPS                   47532
BRIDGE                   28272
UNKNOWN\UNIDENTIFIED     27328
BUILDINGS                24453
SEGMENT\TRANS ROUTE      24181
Name: count, dtype: int64

(6) Tabulation and Cross-Tabulation

Now that we’re finally done with those fairly messy/tedious cleaning steps, let’s make some tables!

(6.1) Summary Statistics Table: `.describe()`

First, as in the other lab demo you have, let’s use .describe() to create a table showing us some basic summary statistics about our now-cleaned variables. Your first instinct might be to just call strike_df.describe(), as in the following code cell, but that’s another instinct you should probably push back on:

strike_df.describe()

	THOR_DATA_VIET_ID	SOURCEID	NUMWEAPONSDELIVERED	TGTLATDD_DDD_WGS84	TGTLONDDD_DDD_WGS84	MSNDATE_dashcount	MSNDATE_len	year
count	1.007674e+06	1.007674e+06	1.007674e+06	969931.000000	969931.000000	1.007674e+06	1.007674e+06	1.007674e+06
mean	2.264882e+06	1.712625e+06	1.202383e+01	17.221362	105.875013	1.971916e+00	9.971916e+00	1.969216e+03
std	1.278396e+06	1.321200e+06	5.958607e+01	1.801880	1.249900	2.353301e-01	2.353301e-01	1.738883e+00
min	4.000000e+00	3.000000e+00	0.000000e+00	0.247721	43.416666	0.000000e+00	8.000000e+00	1.965000e+03
25%	1.164722e+06	5.796808e+05	0.000000e+00	16.430555	105.712000	2.000000e+00	1.000000e+01	1.968000e+03
50%	2.264671e+06	1.454324e+06	4.000000e+00	17.125000	106.155833	2.000000e+00	1.000000e+01	1.969000e+03
75%	3.218455e+06	2.593384e+06	1.200000e+01	18.000000	106.638000	2.000000e+00	1.000000e+01	1.970000e+03
max	4.670411e+06	4.909001e+06	9.822000e+03	135.717675	167.500000	2.000000e+00	1.000000e+01	1.975000e+03

We see that this output looks kind of messy, if we don’t filter out non-numeric-variable columns like the id variables or the _dashcount and _len variables we created as helper columns above. If we do take care to filter out those columns first, .describe() provides us with more focused and more useful information.

First we drop the helper columns, since we don’t need these anymore now that MSNDATE has been cleaned to our satisfaction. We use the inplace = True argument to Pandas’ drop() function to indicate that we want to permanently drop these two columns:

cols_to_drop = [
    'MSNDATE_len',
    'MSNDATE_dashcount',
    'MSNDATE_type'
]
strike_df.drop(columns = cols_to_drop, inplace = True)

And now we specify just the subset of numeric columns we’d like summaries of, and call .describe() only on that subset:

# I break the column names onto separate lines here to make it easy for
# myself to go back and include/exclude columns from the call to describe():
# I just comment/uncomment the line corresponding to the column I want to
# hide/show, respectively
numeric_cols = [
    'NUMWEAPONSDELIVERED',
    'TGTLATDD_DDD_WGS84',
    'TGTLONDDD_DDD_WGS84',
    'year'
]
# pd.option_context() has kind of a strange syntax, but it just allows us to
# specify how to format each number in the result of the call to describe()
# without having to worry about permanently changing our Pandas display
# settings
with pd.option_context('display.float_format', '{:.2f}'.format):
  display(strike_df[numeric_cols].describe())

	NUMWEAPONSDELIVERED	TGTLATDD_DDD_WGS84	TGTLONDDD_DDD_WGS84	year
count	1007674.00	969931.00	969931.00	1007674.00
mean	12.02	17.22	105.88	1969.22
std	59.59	1.80	1.25	1.74
min	0.00	0.25	43.42	1965.00
25%	0.00	16.43	105.71	1968.00
50%	4.00	17.12	106.16	1969.00
75%	12.00	18.00	106.64	1970.00
max	9822.00	135.72	167.50	1975.00

And already, just from this description, we can make some simple but important inferences: for example that 75% of the bombings were carried out before or during the year 1970, despite the fact that the bombing campaign as a whole did not end until 1975 (which we can see from the max value of year)

(6.2) Cross-Tabulation of the Relationship Between Two Variables

If you googled “cross-tabulation in pandas” like the lab asks you to (which I hope you did in the days between the lab handout and this writeup!), you’ll find the Pandas function crosstab().

crosstab() works similarly to describe() except that, rather than computing summary statistics across the entire dataset, it computes a set of conditional summary statistics, so that we can use it e.g. to examine the distributions of a numeric variable (say, year) for different values of a categorical variable (say, TGTCOUNTRY). Here we use it with its default parameter settings to generate a table showing us the number of strikes within each cross-tabulated “bin”—that is, each (year,country) pair where year is some value in the year column and country is some value in the TGTCOUNTRY column:

pd.crosstab(strike_df['year'], strike_df['TGTCOUNTRY'])

TGTCOUNTRY	CAMBODIA	LAOS	NORTH VIETNAM	SOUTH VIETNAM	THAILAND	UNKNOWN	WESTPAC WATERS
year
1965	0	1197	2596	9238	0	0	0
1966	0	16825	24261	4918	11	0	0
1967	18	29738	69420	1004	6	0	0
1968	0	105046	108110	3763	81	19	6
1969	23	167010	729	1353	41	4	0
1970	6099	209273	1801	2684	7	0	0
1971	1130	107675	3342	7932	0	0	0
1972	1243	48531	34281	23303	0	0	0
1973	4758	4866	1437	54	0	0	0
1975	0	0	0	4	0	0	0

And again, especially with whatever domain knowledge we possess, we can detect patterns in this table that help us understand the underlying phenomena:

For example, we see that bombing runs in Cambodia were sporadic before 1970, but quickly jumped to a high of 6,099 in 1970, mirroring a turbulent event in Cambodian history that year

([very long story short,] Cambodia’s somewhat-popular leader Prince Sihanouk was overthrown in a US-sponsored coup in 1970 and replaced by the less-popular army general Lon Nol, leading the now-exiled-in-China Sihanouk to devote his energy towards supporting/accelerating communist [mostly, though not entirely, Khmer Rouge] resistance to the Lon Nol military dictatorship, which in turn led the US to begin bombing Khmer Rouge/other communist strongholds in support of the regime)

More relevant to our discussion of Laos in the beginning, though, we can also start to see how Laos became the most-bombed territory in history: unlike the other countries, wherein bombings had high “spikes” in some years but low-level “lulls” in other years, Laos was continually bombed over 100,000 times per year every year between 1968 and 1971 (inclusive).

(7) Seaborn Time!

As mentioned above (and just in case you’re running these cells without running the cells above first), we import matplotlib and seaborn using aliases as follows:

import matplotlib.pyplot as plt
import seaborn as sns

(7.1) Aggregation

Now, since this data has a much larger number of rows than the cars-by-country dataset used in the W06 lab demonstration, using sns.catplot() will take a while to produce the same types of visualizations. So for our use case, we’re going to start off by working with aggregated data, to make the number of datapoints manageable and quickly visualizable using Seaborn.

As mentioned earlier (and as we saw in the cross-tabulation), we can start to get a good feel for “what’s going on” in the data by focusing on year-by-year statistics first, before “zooming in” on the particular months and/or days of the bombings. So, let’s perform that aggregation now: we’ll use Pandas’ agg() function to count the number of bombings per year in each country, while averaging the latitude and longitude and finding the maximum of the number of weapons delivered.

While averaging and maximizing might be a bit contrived in this example, in general this is where a lot of your thought should go when you are first starting on EDA! You should ask yourself which variables you want to group by (country, in our case) and which variables you want to aggregate, and for the latter set of variables which specific aggregations you want to apply that will best allow you to answer the types of EDA questions you are brainstorming.

For example, while at first I thought about computing the mode of the longitude and latitude per year, to try and find the most-bombed location in each country during each year, in reality the latitude and longitude values are too fine-grained, in that a single village being bombed might 100 times might show up as 100 different sets of coordinates, unless two planes happened to bomb the exact same point on earth down to the maximum possible level of precision in the numeric recording mechanism used by the US military, not a very likely prospect

(If that was confusing, it’s the same intuition around why, if we had 100 students each draw a unit circle and measure that circle’s circumference, all of these values would be approximately \(2\pi\), and yet it would be unlikely to find that any two students’ values were equal down to 50 or 100 decimal places)

So, let’s run the agg() function to compute these aggregate values! There are a ton of different ways to use this function, syntax-wise, but I find the following syntax to be the most flexible for my most common use-cases (if you are curious about other ways to use it, or you can’t get this to work as a “template” for the aggregations you are trying to perform, let me know/schedule an office hour!)

(Also, we are now going to drop the extraneous id functions that we kept along until now: I kept them in there because I want you to get comfortable with cases where you keep variables around just in case you end up needing them later on in your analysis, but since agg() requires an aggregation rule for each column in the dataset, I will just drop whatever values we’re not including in the aggregation at this point.)

# Printing the column names here for reference
strike_df.columns

Index(['MFUNC_DESC', 'MISSIONID', 'TGTCOUNTRY', 'THOR_DATA_VIET_ID', 'MSNDATE',
       'SOURCEID', 'NUMWEAPONSDELIVERED', 'TGTTYPE', 'TGTLATDD_DDD_WGS84',
       'TGTLONDDD_DDD_WGS84', 'year'],
      dtype='object')

Here, so that I have some actual column in the dataset that I can aggregate to get the count per group, I make a column called num_bombings filled with the value 1 for now, so that when we aggregate this column by summing it, the column will thereafter contain the count of the number of observations (bombings) per group

strike_df['num_bombings'] = 1

# And specifying columns to keep, that will be used as either *group* or
# *aggregated* variables
cols_to_keep = [
    'TGTCOUNTRY',
    'year',
    'TGTTYPE',
    'NUMWEAPONSDELIVERED',
    'TGTLATDD_DDD_WGS84',
    'TGTLONDDD_DDD_WGS84',
    'num_bombings'
]

# And run!
agg_df = strike_df[cols_to_keep].groupby(['TGTCOUNTRY','year','TGTTYPE']).agg(
    mean_weapons = ('NUMWEAPONSDELIVERED', 'mean'),
    num_bombings = ('num_bombings', 'sum'),
    mean_lat = ('TGTLATDD_DDD_WGS84', 'mean'),
    mean_lon = ('TGTLONDDD_DDD_WGS84', 'mean'),
)

As we can see from the output of head(), it looks like it has aggregated the data as we wanted, it’s just a bit messy since the aggregated data is grouped in a three-level hierarchy (country->year->target type):

agg_df.head()

			mean_weapons	num_bombings	mean_lat	mean_lon
TGTCOUNTRY	year	TGTTYPE
CAMBODIA	1967	AREA\DEPOT	3.333333	12	14.693574	106.988342
		ROAD	4.000000	2	14.675000	106.783333
		TRUCK PARK\STOP	12.750000	4	14.573833	106.990167
	1969	ANTI-AIRCRAFT	5.000000	11	14.298727	107.505909
	1969	UNKNOWN\UNIDENTIFIED	2.333333	12	14.593815	107.539481

If this type of hierarchical index is what you want for your use-case, then you’re done! In this case, though, we’d like to analyze various relationships among these variables which are unrelated to the specific country->year->target type hierarchy, so we’ll just use reset_index() (remembering to also include the inplace = True option!) to convert the hierarchical index columns into “regular” data columns:

agg_df.reset_index(inplace=True)
agg_df

	TGTCOUNTRY	year	TGTTYPE	mean_weapons	num_bombings	mean_lat	mean_lon
0	CAMBODIA	1967	AREA\DEPOT	3.333333	12	14.693574	106.988342
1	CAMBODIA	1967	ROAD	4.000000	2	14.675000	106.783333
2	CAMBODIA	1967	TRUCK PARK\STOP	12.750000	4	14.573833	106.990167
3	CAMBODIA	1969	ANTI-AIRCRAFT	5.000000	11	14.298727	107.505909
4	CAMBODIA	1969	UNKNOWN\UNIDENTIFIED	2.333333	12	14.593815	107.539481
...	...	...	...	...	...	...	...
229	UNKNOWN	1968	ROAD	0.800000	5	18.925000	105.595900
230	UNKNOWN	1968	SEGMENT\TRANS ROUTE	6.833333	6	18.401953	105.744138
231	UNKNOWN	1968	other	4.250000	8	16.690458	106.489083
232	UNKNOWN	1969	UNKNOWN\UNIDENTIFIED	0.000000	4	NaN	NaN
233	WESTPAC WATERS	1968	other	0.000000	6	17.566667	107.044389

234 rows × 7 columns

As a final consideration before we create our Seaborn catplots: if we had spent more time cleaning the TGTTYPE variable—for example, by collapsing the individual values down into a smaller set of categories like “people”, “vehicles”, “buildings”—that would provide us with another easy-to-analyze categorical variable. For the sake of getting through the visualizations without more code, though, I won’t be generating plots involving this variable, so I’m now going to aggregate our dataset one more time, to sum/average over the different target-type values for each (country, year) pair. But, if you decide you want to use this dataset for a project, you can stop here and clean TGTTYPE to carry out an EDA investigation in a different direction!

country_year_df = strike_df.groupby(['TGTCOUNTRY','year']).agg(
    mean_weapons = ('NUMWEAPONSDELIVERED', 'mean'),
    num_bombings = ('num_bombings', 'sum'),
    mean_lat = ('TGTLATDD_DDD_WGS84', 'mean'),
    mean_lon = ('TGTLONDDD_DDD_WGS84', 'mean'),
).reset_index()

(7.2) Categorical Plots with `catplot()`

Let’s generate our first catplot()! Here we’ll create a non-stacked bar plot, where the x-axis will represent years and the y-axis will represent number of bombings, while the bars themselves will be colored by country (the hue option in catplot:

sns.catplot(
    data = country_year_df,
    kind = "bar",
    x = 'year',
    y = 'num_bombings',
    hue = 'TGTCOUNTRY',
    orient = 'v'
)
plt.show()

We can already infer some things from this plot, but since there were so few bombings in Thailand and in the “Unknown” and “WESTPAC WATERS” categories, let’s exclude these to make the plot more readable (it may be tempting to drop them, and it would typically be ok in many cases, but data-science-wise it’s a bit sketchy in the sense that it would skew our results if we wanted to compute e.g. the proportion of all bombings in each country at some point in our analysis later on)

countries_to_keep = [
    'CAMBODIA',
    'LAOS',
    'NORTH VIETNAM',
    'SOUTH VIETNAM'
]
country_year_sub_df = country_year_df.loc[
    country_year_df['TGTCOUNTRY'].isin(countries_to_keep),
].copy()
sns.catplot(
    data = country_year_sub_df,
    kind = "bar",
    x = 'year',
    y = 'num_bombings',
    hue = 'TGTCOUNTRY',
    orient = 'v'
)
plt.show()

/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

So for me, at least, the plot in this form really already drives home the point from the beginning of the writeup: that “The Vietnam War” really could be called “The Vietnam War Plus The Continuous Onslaught Of Modern Weaponry Rained Down Upon Laos, A Similarly Poor And Agrarian Country With A Fraction Of The Population Of Vietnam”.

(7.3) Plotting Non-Aggregated Data

Now that we’ve seen how to make a basic catplot() using aggregated data, we can move back to the non-aggregated (individual-bombing level) data to see how catplot() can automatically aggregate data for us, in whatever way enables it to make the plot we ask for. This is a super powerful function, in that sense, it just runs much more slowly (so I wanted to show some quick-to-generate examples first!)

Looking again at the columns in strike_df here, as a reminder:

strike_df.columns

Index(['MFUNC_DESC', 'MISSIONID', 'TGTCOUNTRY', 'THOR_DATA_VIET_ID', 'MSNDATE',
       'SOURCEID', 'NUMWEAPONSDELIVERED', 'TGTTYPE', 'TGTLATDD_DDD_WGS84',
       'TGTLONDDD_DDD_WGS84', 'year', 'num_bombings'],
      dtype='object')

We see that so far we’ve neglected the NUMWEAPONSDELIVERED variable. Let’s use a violin plot this time to see what the data within this variable looks like:

strike_sub_df = strike_df.loc[strike_df['TGTCOUNTRY'].isin(countries_to_keep),].copy()
sns.catplot(
    data=strike_sub_df,
    kind='violin',
    x = 'NUMWEAPONSDELIVERED',
    y='TGTCOUNTRY',
    hue='TGTCOUNTRY'
)

/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

Although we now see that violin plots are probably not the best way to visualize this variable. In general violin plots are intended to plot the “curvature” of a distribution over some range of values, but in this case the distribution of our data is almost entirely concentrated at 0, with a few very extreme outlier values. We can still make some inferences, however: like how Laos was not only the most bombed country, but also the country which experienced the largest single bombing of the entire campaign—the extreme outlier corresponding to a bombing run where over 8000 were delivered to the target.

Finally, while we unfortunately don’t have two full-on continuous numeric variables to plot in the form of (e.g.) a scatterplot, we can use the year of each mission as a type of pseudo-continous numeric variable (the full date would be even better, if we wanted to spend more time cleaning the date variable so we could visualize the day-by-day statistics), plotted against the NUMWEAPONSDELIVERED value for that mission (similar to the bar graph from above, but this time generated automatically by Seaborn from the raw, non-aggregated data):

# Here, (1) Again we use strike_sub_df to filter out the countries outside of our 4 main countries
# and thus make the plot slightly more readable, and (2) I'm using Pandas' sample() function to
# make the number of points manageable for Seaborn to plot (since otherwise it takes over a minute,
# which is mostly spent drawing nearly 1 million points that are almost entirely down near 0)
strike_sub_sample_df = strike_sub_df.sample(10000)
sns.lmplot(
    data=strike_sub_sample_df, 
    x="year",
    y="NUMWEAPONSDELIVERED",
    col="TGTCOUNTRY",
    hue="TGTCOUNTRY")
plt.show()

/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jpj/.pyenv/versions/3.11.5/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

And from these four plots we see what we could already probably see in the tabulations from earlier: that Cambodia is the only country which saw a clear year-to-year increase in bombings experienced, from the 1970 coup onwards. As another potential “branching path” in our EDA journey, we could feasibly dive far more deeply into this phenomenon in the Cambodia data specifically: visualizing these bombing-rate increases at a more fine-grained daily level, for example, and seeing whether/how they relate to political and military developments in Sihanouk’s/China’s backing of the Khmer Rouge.

And, last but certainly not least, a correlogram of our aggregated variables, generated using the same code as used in the W06 lab demo (we first display the full country_year_sub_df, so we can glance back and forth between the correlations encoded as colors in the correlogram and the values themselves)

country_year_sub_df

	TGTCOUNTRY	year	mean_weapons	num_bombings	mean_lat	mean_lon
0	CAMBODIA	1967	5.500000	18	14.664901	106.965969
1	CAMBODIA	1969	3.608696	23	14.452686	107.523425
2	CAMBODIA	1970	19.907526	6099	12.656550	106.588893
3	CAMBODIA	1971	84.664602	1130	13.221369	106.461660
4	CAMBODIA	1972	80.004827	1243	12.587244	106.407726
5	CAMBODIA	1973	106.332072	4758	11.898209	105.176818
6	LAOS	1965	10.939850	1197	17.552708	105.521857
7	LAOS	1966	11.938068	16825	16.618527	106.264797
8	LAOS	1967	12.266561	29738	16.364346	106.365654
9	LAOS	1968	9.391200	105046	17.052071	105.898343
10	LAOS	1969	10.355326	167010	17.470780	105.498338
11	LAOS	1970	11.221041	209273	17.145868	105.686660
12	LAOS	1971	13.407160	107675	17.140651	105.549005
13	LAOS	1972	18.685891	48531	17.093275	105.228933
14	LAOS	1973	46.943280	4866	17.607340	104.471988
15	NORTH VIETNAM	1965	12.527735	2596	20.248847	105.176626
16	NORTH VIETNAM	1966	9.555418	24261	18.333420	106.140361
17	NORTH VIETNAM	1967	9.060746	69420	18.516883	106.320436
18	NORTH VIETNAM	1968	7.279752	108110	17.895921	106.223039
19	NORTH VIETNAM	1969	2.847737	729	17.957101	106.128279
20	NORTH VIETNAM	1970	7.349806	1801	17.063296	105.956814
21	NORTH VIETNAM	1971	12.736385	3342	17.170001	106.070248
22	NORTH VIETNAM	1972	14.750387	34281	18.364637	106.144553
23	NORTH VIETNAM	1973	28.956159	1437	18.439947	105.850921
24	SOUTH VIETNAM	1965	13.495670	9238	13.340375	107.601053
25	SOUTH VIETNAM	1966	13.735665	4918	13.053547	107.553177
26	SOUTH VIETNAM	1967	9.340637	1004	15.226087	107.357685
27	SOUTH VIETNAM	1968	11.776774	3763	15.712929	107.287747
28	SOUTH VIETNAM	1969	17.725795	1353	14.273498	107.630294
29	SOUTH VIETNAM	1970	8.029434	2684	14.339186	107.127581
30	SOUTH VIETNAM	1971	7.367625	7932	16.382974	106.494151
31	SOUTH VIETNAM	1972	7.496674	23303	14.872121	106.908634
32	SOUTH VIETNAM	1973	95.425926	54	13.715959	107.502381
33	SOUTH VIETNAM	1975	1.000000	4	11.007222	106.834444

sns.set_theme(style="white")
corr_vars = ['mean_weapons', 'num_bombings', 'mean_lat', 'mean_lon']
corr = country_year_sub_df[corr_vars].corr()  #Compute the correlation matrix
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool)) 
f, ax = plt.subplots(figsize=(11, 9)) #initialize figure

cmap = sns.diverging_palette(230, 20, as_cmap=True) #custom diverging colormap

    # # Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

Footnotes

My old link to the dataset, from data.mil, does not seem to work anymore, and I can’t locate this dataset on the Department of Defense’s official website anymore, so if someone knows what happened there let me know :D↩︎

(1) Lab Background

(1.1) The Vietnam / Second Indochinese War

(1.2) The Role of Data Science in the Devastation

(1.3) EDA and Bayesian Statistics: Challenging Prior Beliefs to Derive Posterior Beliefs

(1.4) Dataset Info

(2) Loading the Data

(3) Missing Data

(4) Different Formats Within The Same Column

(4.1) The Ambiguous object Type

(4.2) Splitting The Column Into Parts

(5) Collapsing Messy Categorical Variables Into Clean Ones

(6) Tabulation and Cross-Tabulation

(6.1) Summary Statistics Table: .describe()

(6.2) Cross-Tabulation of the Relationship Between Two Variables

(7) Seaborn Time!

(7.1) Aggregation

(7.2) Categorical Plots with catplot()

(7.3) Plotting Non-Aggregated Data

Footnotes

(4.1) The Ambiguous `object` Type

(6.1) Summary Statistics Table: `.describe()`

(7.2) Categorical Plots with `catplot()`