Two Ways of Sampling from a Data Frame

Extra Writeups
Author
Affiliation

Jeff Jacobs

A few students have run into issues when trying to use the sample() function, from base-R, to sample from a full data.frame or tibble. In this writeup I’ll argue that this is a case where using a function from the tidyverse called slice_sample() will make your life much easier, but I will also show how to do this sampling using only base-R functions.

Before we start, we make sure to use set.seed(5100) at the beginning, so that your grader gets the same results as you do even when working with random processes!

set.seed(5100)

Creating a Deck of Cards Using expand.grid()

This is done as was introduced in the Bootcamp:

Code
ranks <- c("Ace", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight",
           "Nine", "Ten", "Jack", "Queen", "King")
suits <- c("Hearts", "Diamonds", "Clubs", "Spades")
deck_df <- expand.grid(ranks, suits)
colnames(deck_df) <- c("Rank", "Suit")
head(deck_df)
Rank Suit
Ace Hearts
Two Hearts
Three Hearts
Four Hearts
Five Hearts
Six Hearts

And we can check the dimensions of deck_df just to make sure it created the right number of cards:

dim(deck_df)
[1] 52  2

Now, since sampling without replacement is the default case for both sample functions, to illustrate how to use parameters to these functions I will be sampling with replacement.

Using tidyverse to Sample WITH Replacement

The following code uses the pipe operator %>% to take the data.frame, deck_df, and “pipe it into” the slice_sample() function from the tidyverse. We have to provide two arguments:

  • n: The number of samples we’d like to take, and
  • replace: If set to TRUE, the sampling is performed with replacement. Otherwise (the default), the sampling is performed without replacement.
Code
library(tidyverse)
deck_sample_df <- deck_df |> slice_sample(n = 12, replace = TRUE)
deck_sample_df
Rank Suit
Seven Spades
Two Clubs
Ace Hearts
Eight Diamonds
Ace Clubs
Five Diamonds
Five Clubs
Nine Spades
Eight Spades
Seven Clubs
Five Spades
King Diamonds

Here we can confirm that it sampled with replacement since we see that it selected the Jack of Spades twice (once in slot 4 and once in slot 12).

Note that, although using the pipe operator %>% is the “standard” way to use tidyverse functions, you can still use the functions without using the pipe operator (long story short, the pipe operator just takes whatever comes before the %>% and “plugs it in” as the first argument to the function that comes after the %>%), by specifying the first argument to the slice_sample() function explicitly:

deck_sample_df <- slice_sample(deck_df, n = 12, replace = TRUE)
deck_sample_df
Rank Suit
Nine Hearts
Ten Diamonds
Nine Hearts
Jack Hearts
Jack Clubs
King Spades
Eight Spades
Ace Clubs
Eight Diamonds
Five Diamonds
Ace Hearts
Four Clubs

To me, one nice aspect of slice_sample() over other base-R functions is (among other things) it ensures that the column names are maintained when you sample, which is not always true for the base-R functions. It’s also possible to do in base-R (without using tidyverse libraries/functions), though, just less straightforwardly.

Using Base-R to Sample WITH Replacement

First off, note that just applying sample() to the deck will not produce the outcome we expect, or want, which probably unfortunately goes against our intuitions for how this function should work:

sample(deck_df, 5, replace = TRUE)
Suit Suit.1 Rank Suit.2 Suit.3
Hearts Hearts Ace Hearts Hearts
Hearts Hearts Two Hearts Hearts
Hearts Hearts Three Hearts Hearts
Hearts Hearts Four Hearts Hearts
Hearts Hearts Five Hearts Hearts
Hearts Hearts Six Hearts Hearts
Hearts Hearts Seven Hearts Hearts
Hearts Hearts Eight Hearts Hearts
Hearts Hearts Nine Hearts Hearts
Hearts Hearts Ten Hearts Hearts
Hearts Hearts Jack Hearts Hearts
Hearts Hearts Queen Hearts Hearts
Hearts Hearts King Hearts Hearts
Diamonds Diamonds Ace Diamonds Diamonds
Diamonds Diamonds Two Diamonds Diamonds
Diamonds Diamonds Three Diamonds Diamonds
Diamonds Diamonds Four Diamonds Diamonds
Diamonds Diamonds Five Diamonds Diamonds
Diamonds Diamonds Six Diamonds Diamonds
Diamonds Diamonds Seven Diamonds Diamonds
Diamonds Diamonds Eight Diamonds Diamonds
Diamonds Diamonds Nine Diamonds Diamonds
Diamonds Diamonds Ten Diamonds Diamonds
Diamonds Diamonds Jack Diamonds Diamonds
Diamonds Diamonds Queen Diamonds Diamonds
Diamonds Diamonds King Diamonds Diamonds
Clubs Clubs Ace Clubs Clubs
Clubs Clubs Two Clubs Clubs
Clubs Clubs Three Clubs Clubs
Clubs Clubs Four Clubs Clubs
Clubs Clubs Five Clubs Clubs
Clubs Clubs Six Clubs Clubs
Clubs Clubs Seven Clubs Clubs
Clubs Clubs Eight Clubs Clubs
Clubs Clubs Nine Clubs Clubs
Clubs Clubs Ten Clubs Clubs
Clubs Clubs Jack Clubs Clubs
Clubs Clubs Queen Clubs Clubs
Clubs Clubs King Clubs Clubs
Spades Spades Ace Spades Spades
Spades Spades Two Spades Spades
Spades Spades Three Spades Spades
Spades Spades Four Spades Spades
Spades Spades Five Spades Spades
Spades Spades Six Spades Spades
Spades Spades Seven Spades Spades
Spades Spades Eight Spades Spades
Spades Spades Nine Spades Spades
Spades Spades Ten Spades Spades
Spades Spades Jack Spades Spades
Spades Spades Queen Spades Spades
Spades Spades King Spades Spades

A way to avoid this is to make sure that you are using the sample() function NOT on the entire data.frame object, but just to select a subset of the rows of the data.frame, like the following:

deck_df[sample(nrow(deck_df), 15, replace = TRUE),]
Rank Suit
15 Two Diamonds
16 Three Diamonds
37 Jack Clubs
8 Eight Hearts
32 Six Clubs
51 Queen Spades
42 Three Spades
34 Eight Clubs
26 King Diamonds
10 Ten Hearts
19 Six Diamonds
14 Ace Diamonds
7 Seven Hearts
49 Ten Spades
22 Nine Diamonds

First off, notice how here we can again confirm that it sampled with replacement since it had to create additional ids like 34.1 and 34.2 to represent the fact that card #34 (the Eight of Clubs) ended up in our sample 3 times.

Also note how, rather than sampling from the data.frame, which may be intuitively/linguistically how we would describe what we want, we are actually sampling from the set of indices of the data.frame, then asking R to give us the rows corresponding to those sampled indices. Concretely, to see what’s going on, let’s just look at the row filter we’ve provided (the portion of the full code that is within the square brackets [], before the comma):

sample(nrow(deck_df), 15, replace = TRUE)
 [1] 37 42 28 29 45 16 20 41 34 25 44 37 35 29 48

We see that, in fact, we are not really sampling from the data.frame itself, so much as sampling from a list of its indices (from 1 to 52), and then after performing this sample we are going and asking R to give us the rows at the indices that ended up in this sample. Keeping this distinction in mind (between the rows themselves and their indices) can be helpful for debugging code like this.