Two Ways of Sampling from a Data Frame

Extra Writeups

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

A few students have run into issues when trying to use the sample() function, from base-R, to sample from a full data.frame or tibble. In this writeup I’ll argue that this is a case where using a function from the tidyverse called slice_sample() will make your life much easier, but I will also show how to do this sampling using only base-R functions.

Before we start, we make sure to use set.seed(5100) at the beginning, so that your grader gets the same results as you do even when working with random processes!

set.seed(5100)

Creating a Deck of Cards Using `expand.grid()`

This is done as was introduced in the Bootcamp:

Code

ranks <- c("Ace", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight",
           "Nine", "Ten", "Jack", "Queen", "King")
suits <- c("Hearts", "Diamonds", "Clubs", "Spades")
deck_df <- expand.grid(ranks, suits)
colnames(deck_df) <- c("Rank", "Suit")
head(deck_df)

Rank	Suit
Ace	Hearts
Two	Hearts
Three	Hearts
Four	Hearts
Five	Hearts
Six	Hearts

And we can check the dimensions of deck_df just to make sure it created the right number of cards:

dim(deck_df)

[1] 52  2

Now, since sampling without replacement is the default case for both sample functions, to illustrate how to use parameters to these functions I will be sampling with replacement.

Using `tidyverse` to Sample WITH Replacement

The following code uses the pipe operator %>% to take the data.frame, deck_df, and “pipe it into” the slice_sample() function from the tidyverse. We have to provide two arguments:

n: The number of samples we’d like to take, and
replace: If set to TRUE, the sampling is performed with replacement. Otherwise (the default), the sampling is performed without replacement.

Code

library(tidyverse)
deck_sample_df <- deck_df |> slice_sample(n = 12, replace = TRUE)
deck_sample_df

Rank	Suit
Seven	Spades
Two	Clubs
Ace	Hearts
Eight	Diamonds
Ace	Clubs
Five	Diamonds
Five	Clubs
Nine	Spades
Eight	Spades
Seven	Clubs
Five	Spades
King	Diamonds

Here we can confirm that it sampled with replacement since we see that it selected the Jack of Spades twice (once in slot 4 and once in slot 12).

Note that, although using the pipe operator %>% is the “standard” way to use tidyverse functions, you can still use the functions without using the pipe operator (long story short, the pipe operator just takes whatever comes before the %>% and “plugs it in” as the first argument to the function that comes after the %>%), by specifying the first argument to the slice_sample() function explicitly:

deck_sample_df <- slice_sample(deck_df, n = 12, replace = TRUE)
deck_sample_df

Rank	Suit
Nine	Hearts
Ten	Diamonds
Nine	Hearts
Jack	Hearts
Jack	Clubs
King	Spades
Eight	Spades
Ace	Clubs
Eight	Diamonds
Five	Diamonds
Ace	Hearts
Four	Clubs

To me, one nice aspect of slice_sample() over other base-R functions is (among other things) it ensures that the column names are maintained when you sample, which is not always true for the base-R functions. It’s also possible to do in base-R (without using tidyverse libraries/functions), though, just less straightforwardly.

Using Base-R to Sample WITH Replacement

First off, note that just applying sample() to the deck will not produce the outcome we expect, or want, which probably unfortunately goes against our intuitions for how this function should work:

sample(deck_df, 5, replace = TRUE)

Suit	Suit.1	Rank	Suit.2	Suit.3
Hearts	Hearts	Ace	Hearts	Hearts
Hearts	Hearts	Two	Hearts	Hearts
Hearts	Hearts	Three	Hearts	Hearts
Hearts	Hearts	Four	Hearts	Hearts
Hearts	Hearts	Five	Hearts	Hearts
Hearts	Hearts	Six	Hearts	Hearts
Hearts	Hearts	Seven	Hearts	Hearts
Hearts	Hearts	Eight	Hearts	Hearts
Hearts	Hearts	Nine	Hearts	Hearts
Hearts	Hearts	Ten	Hearts	Hearts
Hearts	Hearts	Jack	Hearts	Hearts
Hearts	Hearts	Queen	Hearts	Hearts
Hearts	Hearts	King	Hearts	Hearts
Diamonds	Diamonds	Ace	Diamonds	Diamonds
Diamonds	Diamonds	Two	Diamonds	Diamonds
Diamonds	Diamonds	Three	Diamonds	Diamonds
Diamonds	Diamonds	Four	Diamonds	Diamonds
Diamonds	Diamonds	Five	Diamonds	Diamonds
Diamonds	Diamonds	Six	Diamonds	Diamonds
Diamonds	Diamonds	Seven	Diamonds	Diamonds
Diamonds	Diamonds	Eight	Diamonds	Diamonds
Diamonds	Diamonds	Nine	Diamonds	Diamonds
Diamonds	Diamonds	Ten	Diamonds	Diamonds
Diamonds	Diamonds	Jack	Diamonds	Diamonds
Diamonds	Diamonds	Queen	Diamonds	Diamonds
Diamonds	Diamonds	King	Diamonds	Diamonds
Clubs	Clubs	Ace	Clubs	Clubs
Clubs	Clubs	Two	Clubs	Clubs
Clubs	Clubs	Three	Clubs	Clubs
Clubs	Clubs	Four	Clubs	Clubs
Clubs	Clubs	Five	Clubs	Clubs
Clubs	Clubs	Six	Clubs	Clubs
Clubs	Clubs	Seven	Clubs	Clubs
Clubs	Clubs	Eight	Clubs	Clubs
Clubs	Clubs	Nine	Clubs	Clubs
Clubs	Clubs	Ten	Clubs	Clubs
Clubs	Clubs	Jack	Clubs	Clubs
Clubs	Clubs	Queen	Clubs	Clubs
Clubs	Clubs	King	Clubs	Clubs
Spades	Spades	Ace	Spades	Spades
Spades	Spades	Two	Spades	Spades
Spades	Spades	Three	Spades	Spades
Spades	Spades	Four	Spades	Spades
Spades	Spades	Five	Spades	Spades
Spades	Spades	Six	Spades	Spades
Spades	Spades	Seven	Spades	Spades
Spades	Spades	Eight	Spades	Spades
Spades	Spades	Nine	Spades	Spades
Spades	Spades	Ten	Spades	Spades
Spades	Spades	Jack	Spades	Spades
Spades	Spades	Queen	Spades	Spades
Spades	Spades	King	Spades	Spades

A way to avoid this is to make sure that you are using the sample() function NOT on the entire data.frame object, but just to select a subset of the rows of the data.frame, like the following:

deck_df[sample(nrow(deck_df), 15, replace = TRUE),]

	Rank	Suit
15	Two	Diamonds
16	Three	Diamonds
37	Jack	Clubs
8	Eight	Hearts
32	Six	Clubs
51	Queen	Spades
42	Three	Spades
34	Eight	Clubs
26	King	Diamonds
10	Ten	Hearts
19	Six	Diamonds
14	Ace	Diamonds
7	Seven	Hearts
49	Ten	Spades
22	Nine	Diamonds

First off, notice how here we can again confirm that it sampled with replacement since it had to create additional ids like 34.1 and 34.2 to represent the fact that card #34 (the Eight of Clubs) ended up in our sample 3 times.

Also note how, rather than sampling from the data.frame, which may be intuitively/linguistically how we would describe what we want, we are actually sampling from the set of indices of the data.frame, then asking R to give us the rows corresponding to those sampled indices. Concretely, to see what’s going on, let’s just look at the row filter we’ve provided (the portion of the full code that is within the square brackets [], before the comma):

sample(nrow(deck_df), 15, replace = TRUE)

 [1] 37 42 28 29 45 16 20 41 34 25 44 37 35 29 48

We see that, in fact, we are not really sampling from the data.frame itself, so much as sampling from a list of its indices (from 1 to 52), and then after performing this sample we are going and asking R to give us the rows at the indices that ended up in this sample. Keeping this distinction in mind (between the rows themselves and their indices) can be helpful for debugging code like this.

Creating a Deck of Cards Using expand.grid()

Using tidyverse to Sample WITH Replacement

Using Base-R to Sample WITH Replacement

Creating a Deck of Cards Using `expand.grid()`

Using `tidyverse` to Sample WITH Replacement