[Quick reminder if you’re reading this: before the workshop try to get as far in this interactive tutorial as you can – we’re gonna recap the six topics there fairly quicky.]

What is “Text-As-Data”? Why should I care?

Standing on the Shoulders of Giants

The lazy answer: there are entire excursuses[^excurs] on exactly this question, in basically whatever field you could imagine, at this point. I originally was only going to write up a brief overview of other resources, but that eventually turned into a whole separate document, the “Text-As-Data Resource Extravaganza”. I’ll only briefly look at it in the workshop, but definitely keep it on the front of your mind, and I’ll probably refer back to it a lot (e.g., for examples of applications, and especially for citing the “founding documents” of various text-as-data methods without cluttering these tutorial files)

Picking Up Your Data and Looking At It from Weird Angles

Now, to answer the second question: At its most basic level, the reason you should learn text-as-data methods is because they transform an unimaginably large body of salient but non-numeric information into salient numeric information, which can then be integrated into the “standard” processes of empirical analyses that social scientists have been using for decades: regressions, causal inference, even qualitative exploration and evaluation of field notes.

However, there’s really a two-way symbiotic (might I even say… DIALECTICAL) relationship between text-as-data and social science, and that’s where things get really exciting IMO. In a nutshell, the epistemic frameworks and models of the social sciences bring clarity to the cold statistical quantities produced by text-as-data methods, while at the same time the text-as-data methods provide said quantities as completely new perspectives/angles from which to view social science problems. In other words, text-as-data methods have the potential to exponentiate the efficacy of Rawls’ “reflective equilibrium” approach to model-building (Rawls (1951))

In my dissertation work I’m trying to argue (by example) that these methods can enrich scholarship even outside of the standard empirical domains – specifically, in studies of intellectual history and the history of political/economic thought. So if that interests you definitely talk to me afterwards, but in this tutorial I’ll have empirical social scientists (i.e., people doing more straightforward independent-and-dependent-variables estimate-a-parameter-style work) in mind as my primary audience. Now to some pretty pictures – carrots to hang in front of your head in the coming weeks!

Pretty Pictures

Topic Modeling

Test

Test

(Roberts et al’s presentation cites Nielsen (2013) for this figure, but I can’t find it in there… Let me know if you see it!)

Word Embeddings

Use of the word Iraq over time This figure from Rudolph and Blei (2018) shows how the “meaning” of Iraq, in the world of Congressional debates, changed from 1858 to 2009. For now (until you learn Principal Component Analysis!), you can roughly think of the y-axis as how close the meaning was to “some country in the Middle East” (the top) vs. “hotbed of terrorism” (the bottom) over that year.

When Text-As-Data Meets Information Theory

Innovation and influence of Marxist texts

Innovation and influence of Marxist texts

I couldn’t resist including one of my own plots. We may get to learn this method at the very end, but it combines dynamic topic modeling (how do emphases on different topics change over time?) and information theory (specifically, Kullback-Leibler Divergence which measures how surprised you are at time \(t+1\) if you expected the same topic emphases as seen in time \(t\)). See the slides from my talk at the Santa Fe Institute if you’re curious!

Neural Networks

A neural network learning artistic style

A neural network learning artistic style

Gatys, Ecker, and Bethge (2015) you absolutely wild for this one. And that’s all I have to say about that.

And now to the main event!

Installing R and R Studio

Why are there two things? What’s the difference between R and R Studio?

Deep breath…

So before we download stuff it’s actually important to understand the difference between R and R Studio. To me, the easiest way to think about it is: * R is the actual black-box machine that takes in R code and spits out… whatever the R code asked it to do * R Studio is just an interface (like a nice picture frame) built around R.

To drive the point home, stare at this picture for a long time to see how it’s possible to do R programming (and even text-as-data in R) entirely without R Studio. Stare for an even longer time to see how this would be extremely tedious and… kind of like browsing the paintings at the Louvre by peeking through the keyhole on the front door (there’s no keyhole on the front door, but follow me on this one). R in the Terminal For example, just look at that first line after R started: the terminal cuts the code off after like 40 characters – there’s a LOT more code in that line overflowing past the righthand side, and it’s… not straightforward to go back and see what the full line was, after you’ve executed it :(

Doesn’t this look much less… masochistic? R Studio

Now that you’re sold, let’s actually download and install both R (first) and R Studio (afterwards)!

For (just) R, go here. You want the “base” version, in case they ask. Windows and Linux users I can help out here. Mac users you’re out of luck, I’ve never owned an Apple product…

Now, for R Studio, go here. See above for Jeff-help possibilities.

Now, if you successfully cloned the GitHub, you’ll see a file inside the w1_files folder called “first_program.r”. Double-click on that and if R Studio doesn’t open with the contents of that file in the upper-left corner, come talk to me… unless you have a Mac (jk you can still talk to me, sry I’m a saltboi): R Studio Panels If your screen looks like the above, you’re ready to move to the next section!

The four panels of R studio

Before we start clicking and dragging and hacking, let’s look at what the four squares in front of you mean (in counterclockwise order for some reason):

Top Left: The Editor

The Editor

The Editor

This is where you will (or at least should) spend 90% of your time. It’s where any file you open will open “into”, and where buttons to save and run your code live.

Bottom Left: The Console

The Console

The Console

I was serious when I said R Studio is just a fancy wrapper around R, and this is where that manifests: when you run whatever code is in your Editor, all that R Studio really does is copies it and then pastes it line-by-line into this Console. This doesn’t prevent you, however, from typing your own commands directly into the Console. You can test this out by clicking the panel to bring up the cursor and then typing 1+1.

The difference between being the next Mark Zuckerberg and the next Marky Mark (sans Funky Bunch) can boil down to knowing the following shortcut: if you type a command into the Console but want to change one thing about it (say, plot \(y = x+3\) rather than \(y = x+2\)), don’t retype it! Just click on the panel to bring up the cursor, then press the up arrow to browse through the last-entered commands. If you accidentally go past the command you want, you can use the down arrow to move to the subsequently-entered command.

Bottom Right: The File Browser

File Browser

File Browser

This is where you can look at files, just like you do in Windows Explorer (Windows), the ls command in Linux, and… “Finder” I think? in OSX. Here, however, you can single-click on the filename to open it into your Editor. Knowing how Working Directories work is integral to using this File Browser, so… don’t mess with it until you read that section below.

Top Right: The Environment Browser

Environment Browser

Environment Browser

This is where you will be able to see (and “browse”) all of the variables and functions you create during your R session. For example, if you load a dataset, this will tell you how many observations (rows) and variables (columns) your dataset has.

I put “browse” in scare-quotes above because unlike files (which just open in the Editor regardless of their type), different types of variables have different ways of browsing them. For our purposes, the only important thing to know is that data frames (the variable type that R uses to hold loaded datasets) will have a tiny spreadsheet icon all the way on the right side of their row in the Environment Browser, and if you click this icon you will be able to see a “spreadsheet”-style view of the entries in your dataset. HOWEVER, note that this spreadsheet view is read-only – if you want to edit the data in your dataset, you’ll need to get through a few more sections.

.Rmd files vs. .r files vs. the console

Now we can finally get down to details regarding the actual day-to-day “workflow” of an R coder hacker ninja.
The power of hacking in R

The power of hacking in R

For the remainder of the workshop, we’re not going to be using the .r format, the format of the first_program.r file that you loaded above. Although it beats the Console when it comes to ease-of-use, the .Rmd format leaves .r in the dust.

Basically, while the bare-bones .r format just allows you to type lines of R code and then run them one-by-one, the .Rmd format (short for R Markdown) lets you

  • Combine your code, its output, and your (formatted, hyperlinked, image-filled) documentation of the code/output into a single HTML, PDF, or MS Word document
    • As a quick example, this tutorial itself was written in R Markdown!
  • Organize your code into named “cells”, for example to separate the different tasks you need to perform.
    • These cells also have a huge range of parameters, for example providing the ability to cache the results of a cell so that you don’t have to re-run it every time you re-generate your HTML/PDF/Word document.
  • Customize the document output format however you want
    • For example, a block of code at the top of this tutorial’s .Rmd file tells R Studio to generate an HTML document with a particular title and author, format it so that it can be previewed on GitHub, auto-generate a table of contents, load the bibliographic data from a .bib file (W1_files/TAD_Week_1.bib), and so on.

So, to see the .Rmd format in action, click on the fancy_program.Rmd file you’ll see in the W1_files directory within the File Browser, then click the “Knit” button at the top of the Editor to generate your first HTML-format R Markdown file!

Real quick before moving on: any time you want to make a new R Markdown file, just click on the… paper-with-green-plus-sign icon (the icon all the way on the left, in the toolbar directly below the “File, Edit, Code, …” toolbar). R Studio will provide you with a Wizard allowing you to specify the title, author, and output format. Then, when you’re ready to save the file, you’ll want to save it in your Working Directory, which we explain… NOW

The Working Directory

What Is It?

No joke, I genuinely think the idea of the “working directory” is the number-one stumbling block and source of headaches for people first learning R. So maybe pay attention to this section if you’ve zoned out for everything before this :P

In theory, the Working Directory setting is a way for you to organize your folder+file structure such that everything is organized in a logical manner. For example, this week’s tutorial has a w1_files subfolder which itself has an img subfolder, so that we don’t have to wade through the image files when looking for a particular .Rmd or .bib file.

In practice, however, the Working Directory (WD) is an opaque gatekeeper that gives you FATAL ERRORS if you try to run code when the WD is set to \(X\), but proceeds without any issues when the WD is set to a different directory \(Y\). In fact, this is the best-case scenario. In an even deeper layer of WD hell, your code gives some output \(O_X\) when the WD is \(X\), and some completely different output \(O_Y\) when it’s set to \(Y\). Which one is correct? Without understanding how the WD works, you’ll never know (and… paradoxically, once you do understand WDs, you’ll never run into this situation in the first place.)

So, long story short, the Working Directory is just:

a folder somewhere on your computer where R will look for code/images/datafiles/whatever whenever you click “Knit” (or “Run”, “Source”, anything of that nature).

The problem is, though, that R Studio makes it completely unclear (in my opinion) what exactly your WD is set to. Obnoxiously, there’s no guarantee that the folder shown in the File Browser is the same as your WD. Even MORE obnoxiously, there’s no guarantee that the folder containing the current .Rmd or .r file you’re editing is the WD, or that the WD contains any file in your Editor panel.

SO, as far as I can tell, the most straightforward way to find out our WD is to go to the Console panel, click so it displays the cursor, and then type:

getwd()
## [1] "/home/jjacobs/tad_workshop"

[Quick sidenote: that was the first example of an R Markdown “code cell” in this tutorial]

If this is the folder containing all the files you want to work with in this session, you’re good. Otherwise, you’ll need to CHANGE IT…

How Do I Change It?

This is also not so straightforward, in the sense that there are multiple ways to do it, from mutiple panels/menus (specifically the “File, Edit, Code, …” toolbar, the Console, and the File Browser… oh and the keyboard too):

  • “File, Edit, Code, …” Toolbar: This is the way I would (highly) recommend doing it. Click “Session”, then “Set Working Directory”, and then “Choose Directory…”. Then just browse to the directory you want to work in, and click “Open”.
  • Keyboard: Press Ctrl+Shift+H. This is a keyboard shortcut that just does the same exact thing as the prior method. OSX/Linux users… sorry, I don’t know what your shortcut is, but if you do the prior method it should be written right next to “Choose Directory…” in the submenu.
  • File Browser: A two-parter.
    • First, browse to the folder you want to work in using the File Browser. Helpfully, you can click the button with the ellipsis (3 tiny dots) on it at the top right of the panel (underneath… the loopy arrow all the way at the top-right of the panel).
    • Then, click on the “More” button with the gear icon at the top of the File Browser (it’s after the “New Folder”, “Delete”, and “Rename” buttons), then click “Set As Working Directory”.
  • Console: Probably the most complicated way to do it. You can use the command setwd(<path>), replacing <path> with the actual path to the directory you want to work with. For example, on Windows you could use
setwd("C:\\Users\\jj\\Dropbox\\tad-workshop\\week-1")

[Quick note: for this code cell, I set the option eval to FALSE, since I don’t actually want R to run it (I like my WD as-is), just display it.]

Why the double-backslashes, you ask? Programming languages interpret a single backslash \ in a string as a note from the coder that they’re about to insert a “special”/non-standard character. For example, \t inserts a tab, while \n on Windows or \r on Linux/OSX moves the text down to the next line. So to tell it that we actually want a backslash, we have to say \\, i.e., “Hey R I’m about to put a special character” and then “Just kidding, I actually want a normal backslash for once”.)

LAST THING regarding the Working Directory: the best way I’ve found to minimize WD issues overall is to always start R Studio by double-clicking on my .Rmd file from within Windows Explorer/Finder. The one reasonable thing R Studio does is set the WD to be the location of the file you just double-clicked if you start it this way. So yeah, do this instead of (for example) starting R Studio from the Start menu or… the little icon bar at the bottom of the screen in OSX. I’d actually recommend removing it from the Start menu or the bottom bar.

Exploring the R language itself

Basic Syntax and the Five Main Data Structures

Finally finally finally we get to do some code (see above .gif). If you did the Interactive R Tutorial mentioned in the survey and at the very top of this page, you can… take a stretch break or something. But I’m going to really quickly zoom through the topics covered, in case people didn’t have time to do it or got stuck somewhere (don’t worry, getting stuck is a very normal experience for anyone at any skill level in R/in programming in general). Here we go.

Basics

R can be used as a glorified calculator!

# You can type
1+1
## [1] 2
# or
2^5
## [1] 32
# or even
sqrt(pi)
## [1] 1.772454

Those # symbols at the beginning of a line tell R “this is just a comment (a note-to-self), not a line of code that you should try and run”. You should use them early and often in your code, so that when you go back to it the next month you remember why you did what you did…

Values can be assigned to variables using the <- operator:

# The variable x will point to the value 3
x <- 3
# The variable y will point to the value sqrt(2)
y <- sqrt(2)
# The variable z will point to the string "Hello"
z <- "Hello"

And then the variable names can be used in the same way you used numbers in the arithmetic above:

# Compute the sum of x and y
x + y
## [1] 4.414214
# Learn what sqrt(2)/2 is
y / 2
## [1] 0.7071068
# Adding numbers and strings doesn't make sense and makes R angry
x + z
## Error in x + z: non-numeric argument to binary operator

[Note that for this last cell to display I had to add error=TRUE to the code-cell settings. Otherwise the error stops the R Markdown document from compiling (which is good… usually you don’t want to make a document with errors in it)]

R has a few very simple “basic data types” (though I learned to call these “primitives” in my CS classes…): * numeric: A number, like 1, 0.338, pi, etc. * character: A (potentially-empty string of) character(s). Like "Hello", "?!?", or "". It can get a bit tricky since these strings can also hold numeric “characters”, so 1 is a numeric value but "1" is a character value! * logical: Either TRUE or FALSE. Usually used to control program logic (do something if computer_temperature > 100 is TRUE), but can also be used to “scoop out” rows from a dataset that match certain criteria. More on this later, but if you know Stata it works like the drop if and keep if statements.

Why doesn’t R tell me anything when I press enter and store something in a variable? Good question. Very good question. Anyways, make sure you type the name of the variable again, if you want to display the result of the value-storing…

Vectors

Ordered collections of values! You create these by putting a c and then a parenthetical list of values, separated by commas. For example:

# A numeric vector
x <- c(1,2,3)
x
## [1] 1 2 3
# A character vector
y <- c("a","b","c")
y
## [1] "a" "b" "c"
# A logical vector
z <- c(TRUE, FALSE, TRUE)
z
## [1]  TRUE FALSE  TRUE

Once again, you can do “arithmetic” on these vectors, and R will usually (though not always!!) do what you expect:

# Produces a vector holding the *square* of every number in x
x^2
## [1] 1 4 9
# Appends an exclamation point to every character value in y. Now's as good a time as any to learn the paste() and paste0() functions
paste0(y,"!")
## [1] "a!" "b!" "c!"
# Produces the *negation* of z. "!" is the "NOT" operator in R.
!z
## [1] FALSE  TRUE FALSE

Matrices

From sequences of stuff to BOXES FILLED WITH STUFF wow! Matrices are a bit weird in R, since basically to make them you need to provide a vector of all the values first, and then tell R how to “convert” this vector into a matrix using the nrow, ncol, and byrow options to the matrix() function:

x <- matrix(
    c(1,2,3,4),
    nrow=2,
    ncol=2,
    byrow=TRUE
)
# Display x
x
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
# See what is in row 2 column 1 of x
x[2,1]
## [1] 3

If yall remember your matrix multiplication rules, you can perform “normal” matrix multiplication using the new %*% operator:

y <- matrix(
    c(-1,-2,-3,-4),
    nrow=2,
    ncol=2,
    byrow=TRUE
)
y
##      [,1] [,2]
## [1,]   -1   -2
## [2,]   -3   -4
# Normal matrix multiplication, weird %*% thing
x %*% y
##      [,1] [,2]
## [1,]   -7  -10
## [2,]  -15  -22
# Weird matrix multiplication (the Hadamard Product), normal * thing
x * y
##      [,1] [,2]
## [1,]   -1   -4
## [2,]   -9  -16

So yeah, be extremely careful with matrices, since the multiplication you’re expecting is NOT * but %*%. Lastly the t() function gives you the transpose of a matrix:

t(x)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Factors

Inching closer to working with (numeric) data, factors let us create and specify the different levels of categorical variables:

fav_color <- factor(c(1,2,2,2,3,2,1,1),labels=c("Red","Black","Green"))
fav_color
## [1] Red   Black Black Black Green Black Red   Red  
## Levels: Red Black Green

Don’t get confused about the correspondence between the first and second arguments to the factor() function! The first (unnamed) argument, a vector, says “Here is my data”, while the second argument (named “labels”) says “A value of 1 means this, a value of 2 means this, …”:

goat_votes <- c(4,4,4,3,4,1,2,1,4)
goat_labels <- c("MJ", "LeBron", "Kobe", "Sabonis")
goat_factor <- factor(goat_votes, labels=goat_labels)
goat_factor
## [1] Sabonis Sabonis Sabonis Kobe    Sabonis MJ      LeBron  MJ      Sabonis
## Levels: MJ LeBron Kobe Sabonis

Data Frames

Probably (definitely) the most important data structure for social scientists. Basically everything you would expect from a spreadsheet, aka a glorified matrix with column names, which makes it easier to specify and “scoop out” particular observations of interest:

x <- data.frame(
    name=c("Ja Rule","Lenin","Mao","Marx"),
    born=c(1976,1870,1893,1818),
    successful_rev=c(TRUE,TRUE,TRUE,FALSE)
)
# Display x, column names and all
x
##      name born successful_rev
## 1 Ja Rule 1976           TRUE
## 2   Lenin 1870           TRUE
## 3     Mao 1893           TRUE
## 4    Marx 1818          FALSE
# "Scoop out" observations of people born after 1890
x[x$born > 1890,]
##      name born successful_rev
## 1 Ja Rule 1976           TRUE
## 3     Mao 1893           TRUE

That last statement is a bit hairy: the basic syntax for “scooping” is x[<rows you want>,<columns you want>]. And you leave <rows you want> blank if you want all rows, and <columns you want> blank if you want all columns. So then the statement boils down to: “I want to scoop out rows in x where the born column is greater than 1890, but keep all the columns for these rows”.

Still, the syntax is super weird – why does x appear twice in succession (x[x...])? It should make more sense if we break this “compound” statement apart, giving the “moving parts” separate names and lines:

# First, we scoop out just the "born" column using the "$" operator
born_vals <- x$born
born_vals
## [1] 1976 1870 1893 1818
# Next, we generate a *logical* vector which is TRUE for slots where born is greater than 1890 and FALSE everywhere else
born_after_1890 <- born_vals > 1890
born_after_1890
## [1]  TRUE FALSE  TRUE FALSE
# Finally we use this logical vector to *scoop out* just the TRUE rows
obs_i_want <- x[born_after_1890,]
obs_i_want
##      name born successful_rev
## 1 Ja Rule 1976           TRUE
## 3     Mao 1893           TRUE

Note, however, that you’re almost never going to be making a data frame from within R like I did above. Instead, you’ll usually have R create a data frame automatically from data you load in from a file… almost a beautiful segue into the next section, except we have to talk about Lists real quick. But AFTER THAT! loading data files.

Lists

Lists are basically like vectors except that they can hold any type of value (really, anything – including matrices, functions, and even MORE LISTS) in any slot, whereas the values in a vector have to all be of the same type:

boring_vector <- c(1,2,3)
boring_vector
## [1] 1 2 3
exciting_list <- list(1,"Sabonis",c(-1,0,1),boring_vector,diag(3),list("we","need","to","go","deeper"))
exciting_list
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "Sabonis"
## 
## [[3]]
## [1] -1  0  1
## 
## [[4]]
## [1] 1 2 3
## 
## [[5]]
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## 
## [[6]]
## [[6]][[1]]
## [1] "we"
## 
## [[6]][[2]]
## [1] "need"
## 
## [[6]][[3]]
## [1] "to"
## 
## [[6]][[4]]
## [1] "go"
## 
## [[6]][[5]]
## [1] "deeper"

See how exciting that is? The problem is that they require an entirely different syntax from vectors/matrices to pull things out:

exciting_list[[3]]
## [1] -1  0  1
exciting_list[[5]]
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Functions

Here I’m adding one of two “bonus” R constructs on top of the five covered in the interactive tutorial. Namely, the function construct. If you haven’t noticed, a lot of parentheses start popping up as soon as you start on even the simplest task in R (even the “Hello world” program has parentheses: print("Hello world!")). Which raises the question,

Why do I always need to type these parentheses?

And the answer is that the parentheses are a way of sending data to a special type of R variable, called a function. While “normal” variable names just point to a value, like how x <- 3 makes the name x point to the numerical value 3, function variables actually point to another block of R code, that either you or somebody else has already written. So when you type print("blah"), you’re basically telling R “go to the code block someone wrote that’s called print(), and give it the character string "blah" so it can do its thing and figure out how to display that down in the R Console.

So, what that means is that there are two separate parts of a function in R, though often people just say “function” despite the fact that they’re only talking about one part (metonymy is a hell of a drug):

  1. The actual named block of code, which can take in any number of arguments (variables that get passed to it) and return the result of its computations. This looks like the following:
# Define the function
my_sqrt <- function(number, positive=TRUE){
  if (positive){
    return(sqrt(number))
  } else {
    return(-sqrt(number))
  }
}
  1. Other points in the code where you call the function, via (say) result <- function_name(arguments) and thus tell R “give these variables to the code block and when it has produced an output, store it in the result variable”:
x <- 5

#
# ...other code...
#

# Now call the function from above *without* the optional keyword arg
my_sqrt(x)
## [1] 2.236068
# And *with* the keyword arg
my_sqrt(x, positive=FALSE)
## [1] -2.236068

As the comments here suggest, there’s one other important thing to know about R functions, which is that you don’t have to send them all of the arguments every time you call them: there are positional arguments that are required for any call to the function, whereas there are separate keyword arguments that are optional, and that typically have sensible default values. For example, even the basic print() function has a secret keyword argument, na.print, that you can send to change the way R prints NA values in numeric variables. Otherwise, though, it uses the sensible default value of just printing the two characters "NA".):

print(c(1,2,NA,4))
## [1]  1  2 NA  4
print(c(1,2,NA,4),na.print = "SOMETHING HAS GONE HORRIBLY WRONG")
## [1]                                 1                                 2
## [3] SOMETHING HAS GONE HORRIBLY WRONG                                 4

Lastly, just note that you can “string” functions together, by plugging the output of one in as an argument to another. For example, this line calls my_sqrt first, then takes its output and sends it as an argument to print:

print(my_sqrt(5))
## [1] 2.236068

And you can still include optional arguments to either function in these compound statements:

print(my_sqrt(c(4,9,NA), positive=FALSE), na.print = "UH OH")
## [1]    -2    -3 UH OH
# Note that NA is *not* the same as NaN:
print(my_sqrt(c(-4,9,NA), positive=FALSE), na.print="NA_But_*Not*_NaN")
## Warning in sqrt(number): NaNs produced
## [1]              NaN               -3 NA_But_*Not*_NaN

If Statements and Loops

This is the last thing I can think of that’s absolutely necessary for you to be able to utilize the core functionality of R. Basically an if statement and a for loop are two fundamental constructs which 1. Allow your code to produce different results based on the contents of some variable or output of some function (the if statement):

# This is necessary so you get the same results as me (it ensures that our computers' random number generators produce the same series of numbers)
set.seed(1948)
# Flip a (fair) coin, where 0 is tails and 1 is heads
coinflip_result <- rbinom(n=1, size=1, prob=0.5)
print(coinflip_result)
## [1] 1
# Now let the user know what the result *means*
if (coinflip_result == 0){
  print("Tails")
} else {
  print("Heads")
}
## [1] "Heads"
  1. Allow you to repeat some code a specified number of times, eliminating the need to ever copy-and-paste code to handle (for example) multi-dimensional data:
my_numbers <- 1:20
for (current_number in my_numbers){
  # Print the (positive) square root of the number
  print(my_sqrt(current_number))
}
## [1] 1
## [1] 1.414214
## [1] 1.732051
## [1] 2
## [1] 2.236068
## [1] 2.44949
## [1] 2.645751
## [1] 2.828427
## [1] 3
## [1] 3.162278
## [1] 3.316625
## [1] 3.464102
## [1] 3.605551
## [1] 3.741657
## [1] 3.872983
## [1] 4
## [1] 4.123106
## [1] 4.242641
## [1] 4.358899
## [1] 4.472136

Loading Data: .csv,.dta,.RData

Alright! Now that we’ve got the basics down, and we understand the 5 most-used data structures in R (to refresh your memory: Vectors, Matrices, Factors, Data Frames, and Lists), let’s load our first real (numeric) data file! Specifically, we’re going to load .csv (comma-separated values), .RData (R’s default format), and .dta (Stata-formatted) data files.

.csv

We’ll start with the most common format, .csv. The built-in R function read.csv() is the “standard” way to load a .csv file:

unga_res <- read.csv("datasets/UNGA_votes/DescriptionsUN1-72.csv")
head(unga_res)
##   session rcid abstain yes no importantvote     date   unres amend para
## 1       1    3       4  29 18             0 1/1/1946  R/1/66     1    0
## 2       1    4       8   9 34             0 1/2/1946  R/1/79     0    0
## 3       1    5       1  28 22             0 1/4/1946  R/1/98     0    0
## 4       1    6      10  12 27             0 1/4/1946 R/1/107     0    0
## 5       1    7       0  25 18             0 1/2/1946 R/1/295     1    0
## 6       1    8       2  38  1             0 1/5/1946 R/1/297     1    0
##                            short
## 1 AMENDMENTS, RULES OF PROCEDURE
## 2     SECURITY COUNCIL ELECTIONS
## 3               VOTING PROCEDURE
## 4    DECLARATION OF HUMAN RIGHTS
## 5     GENERAL ASSEMBLY ELECTIONS
## 6                  ECOSOC POWERS
##                                                                                                                                                                                                                                                  descr
## 1  TO ADOPT A CUBAN AMENDMENT TO THE UK PROPOSAL REFERRING THE PROVISIONAL RULES OF PROCEDURE AND ANY AMENDMENTS THEREOF TO THE 6TH COMMITTEE, SAID AMENDMENT PRESCRIBING A 1-WEEK TIME LIMIT WITHIN WHICH THE 6TH COMM. MUST SUBMIT ITS REPORT ON THE
## 2                                                                                                   TO ADOPT A USSR PROPOSAL ADJOURNING DEBATE ON AND POSTPONINGELECTIONS OF THE NON-PERMANENT MEMBERS OF THE SECURITY COUNCIL, TO THE FOLLOWING WEEK.
## 3                                                                                         TO ADOPT THE KOREAN PROPOSAL THAT INVALID BALLOTS BE INCLUDED IN THE TOTAL NUMBER OF \\MEMBERS PRESENT AND VOTING\\\\, IN CALCULATING THE MAJORITY VOTE.\\""
## 4                                                                                                                                           TO ADOPT A CUBAN PROPOSAL (A/3-C) THAT AN ITEM ON A DECLARATION OF THE RIGHTS AND DUTIES OF MAN BE TABLED.
## 5                                                       TO ADOPT A 6TH COMMITTEE AMENDMENT (A/14) TO THE PROVISIONAL RULES OF PROCEDURE, WHICH AMENDMENT PROVIDES THAT RULE 73 SHOULD END WITH:\\THERE SHALL BE NO NOMINATIONS.\\\\"\t0\t0\t0\t0\t0\t0
## 6 TO ADOPT A SECOND 6TH COMM. AMENDMENT (A/14) TO THE PROVISIONAL RULES OF PROCEDURE, WHICH AMENDMENT EPLACES PROVISIONAL RULE T WITH A NEW TEXT AUTHORIZING THE ECONOMIC & SOC. COUNCIL TO CALL INTERNATIONAL CONFERENCES ON ANY MATTER WITHIN ITS CO
##   me nu di hr co ec
## 1  0  0  0  0  0  0
## 2  0  0  0  0  0  0
## 3  0  0  0  0  0  0
## 4  0  0  0  1  0  0
## 5 NA NA NA NA NA   
## 6  0  0  0  0  0  1

[As you can see, this is also your first experience with malformed .csv files that you’ll never be able to fix no matter how hard you try. I spent ~30 minutes trying to fix the NA entries to no avail.]

Now we can do cool stuff, like scoop out only the votes pertaining to the Israeli ethnic cleansing of Palestine:

isr_pal_res <- unga_res[unga_res$me == 1,]
head(isr_pal_res)
##      session rcid abstain yes no importantvote      date     unres amend
## NA        NA   NA      NA  NA NA            NA      <NA>      <NA>    NA
## 28         1   30       6  41  5             0 12/6/1946 R/1/1288A     0
## 32         1   34       5  41  6             0 12/6/1946 R/1/1288E     0
## 75         2   77      10  33 13             0 11/6/1947  R/2/1424     0
## NA.1      NA   NA      NA  NA NA            NA      <NA>      <NA>    NA
## 77         2 9002       7   7 39             0  5/6/1947  SS/1/114     1
##      para                        short
## NA     NA                         <NA>
## 28      0  FRENCH TOGOLAND TRUSTEESHIP
## 32      0 BRITISH TOGOLAND TRUSTEESHIP
## 75      0    PALESTINE, ECONOMIC UNION
## NA.1   NA                         <NA>
## 77      0     PALESTINE, JEWISH AGENCY
##                                                                                                                                                                                                                                    descr
## NA                                                                                                                                                                                                                                  <NA>
## 28                                                          TO ADOPT THE TRUSTEESHIP AGREEMENT FOR FRENCH TOGOLAND SUBMITTED BY FRANCE; SAID AGREEMENT APPROVED IN 4TH COMM DRAFT RESOL. (A/258 AND ADD. 1, CORR. 2, CORR. 3 AND REV.1).
## 32                                                          TO ADOPT THE TRUSTEESHIP AGREEMENT FOR BRITISH TOGOLAND SUBMITTED BY THE U.K.; AGREEMENT WAS APPROVED IN 4TH COMM. DRAFT RESOL. (A/258 AND ADD. 1, CORR. 2, CORR. 3, REV.1).
## 75                                                    TO ADOPT THE REPORT AND RESOLUTIONS A AND B (A/516) OF THE AD HOC COMM. ON THE PALESTINE QUESTION, PROVIDING AND APPROVING A PLAN OF PARTITION WITH ECONOMIC UNION, FOR PALESTINE.
## NA.1                                                                                                                                                                                                                                <NA>
## 77   TO ADOPT POLISH RESOLUTION (A/BUR/80) AS AMENDED BY CZECHOSLOVAKIA, DECIDING TO INVITE THE JEWISH AGENCY FOR PALESTINE TO STATE THE POINT OF VIEW OF THE JEWISH PEOPLE ON PALESTINE BEFORE THE GENERAL ASSEMBLY IN PLENARY MEETING.
##      me nu di hr co   ec
## NA   NA NA NA NA NA <NA>
## 28    1  0  0  0  1    0
## 32    1  0  0  0  1    0
## 75    1  0  0  0  0    1
## NA.1 NA NA NA NA NA <NA>
## 77    1  0  0  0  0    0

I personally never use the read.csv() function, however, since there is an alternative function called fread() that works literally about 1000 times faster than read.csv(). BUT since fread() is not a “built-in” R function, it requires us to download, install, and load the data.table package. Thus we start our first foray into downloading non-built-in R packages…

The general procedure, if you know which package contains the function you want, is as follows: 1. In the Console (not in a .r or .Rmd file, since we’ll only ever have to do this once whereas those files are for storing code you may want to run multiple times) type install.packages(<package name>), replacing <package name> with the name of the package you want to download. R Studio should automatically start downloading, and (hopefully) let you know when everything is complete. This means that the package is downloaded and installed onto your hard drive, BUT you still need to tell R that you actually want to use the package (if it wasn’t set up like this, R would have to have all packages on your hard drive loaded into memory at all times, which would make your computer slow to a crawl… thus we tell it specifically which packages to “activate”) 2. To actually load the package into memory, in (either the console or a code file) type and run library(<package name>). Once this has been run in a session, you don’t need to run it again, but you will have to run it each time you close and re-open R.

Lastly, if there is a newer version of a package you’ve already downloaded, you can update it using update.package(<package name>). If you want to update all of your packages at once, just run update.packages().

Viewing and Editing Your Dataset

The viewing part is easy: just click the little spreadsheet icon on the righthand side of the Environment Browser, in the row corresponding to your dataframe.

Editing, on the other hand, is far less straightforward, but R Studio provides a (very) bare-bones interface allowing you to both view and modify the data frame: the edit() function:

my_data <- data.frame(x=c(1,2,3),y=c("a","b","c"))
my_data <- edit(my_data)

Note that if you don’t take the output of edit(my_data) and store it back into my_data, you will be editing the values in the spreadsheet but literally everything you do will be thrown out right when you close the spreadsheet window… Another great R Studio design choice.

Last But Not Least: Clearing the Workspace

rm(list = ls())

[Note that I’ve set the eval code-cell option to FALSE, since I don’t actually need/want R to clear out all my variables at the moment.]

To understand this line we need to break it into two parts: first, we note that the ls() function simply gives a list of all the variables R is keeping track of – in other words, all of the variables you’ve made since you started your R session (or even before that if you chose to load the workspace from the prior session). So then the rm() function takes a single keyword argument, list, and then clears out everything you give it from the workspace. Thus, if we give it all the variables in our workspace, as produced by ls(), we clear the entire thing.

Bibliography

Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. 2015. “A Neural Algorithm of Artistic Style.” arXiv:1508.06576 [Cs, Q-Bio], August. http://arxiv.org/abs/1508.06576.

Nielsen, Richard Alexander. 2013. “The Lonely Jihadist: Weak Networks and the Radicalization of Muslim Clerics,” September. https://dash.harvard.edu/handle/1/11124850.

Rawls, John. 1951. “Outline of a Decision Procedure for Ethics.” The Philosophical Review 60 (2): 177–97. doi:10.2307/2181696.

Rudolph, Maja, and David Blei. 2018. “Dynamic Embeddings for Language Evolution.” In Proceedings of the 2018 World Wide Web Conference, 1003–11. WWW ’18. Republic; Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee. doi:10.1145/3178876.3185999.