set.seed(5100)
Two Ways of Sampling from a Data Frame
A few students have run into issues when trying to use the sample()
function, from base-R, to sample from a full data.frame
or tibble
. In this writeup I’ll argue that this is a case where using a function from the tidyverse
called slice_sample() will make your life much easier, but I will also show how to do this sampling using only base-R functions.
Before we start, we make sure to use set.seed(5100)
at the beginning, so that your grader gets the same results as you do even when working with random processes!
Creating a Deck of Cards Using expand.grid()
This is done as was introduced in the Bootcamp:
Code
<- c("Ace", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight",
ranks "Nine", "Ten", "Jack", "Queen", "King")
<- c("Hearts", "Diamonds", "Clubs", "Spades")
suits <- expand.grid(ranks, suits)
deck_df colnames(deck_df) <- c("Rank", "Suit")
head(deck_df)
Rank | Suit |
---|---|
Ace | Hearts |
Two | Hearts |
Three | Hearts |
Four | Hearts |
Five | Hearts |
Six | Hearts |
And we can check the dimensions of deck_df
just to make sure it created the right number of cards:
dim(deck_df)
[1] 52 2
Now, since sampling without replacement is the default case for both sample functions, to illustrate how to use parameters to these functions I will be sampling with replacement.
Using tidyverse
to Sample WITH Replacement
The following code uses the pipe operator %>%
to take the data.frame
, deck_df
, and “pipe it into” the slice_sample()
function from the tidyverse
. We have to provide two arguments:
n
: The number of samples we’d like to take, andreplace
: If set toTRUE
, the sampling is performed with replacement. Otherwise (the default), the sampling is performed without replacement.
Code
library(tidyverse)
<- deck_df |> slice_sample(n = 12, replace = TRUE)
deck_sample_df deck_sample_df
Rank | Suit |
---|---|
Seven | Spades |
Two | Clubs |
Ace | Hearts |
Eight | Diamonds |
Ace | Clubs |
Five | Diamonds |
Five | Clubs |
Nine | Spades |
Eight | Spades |
Seven | Clubs |
Five | Spades |
King | Diamonds |
Here we can confirm that it sampled with replacement since we see that it selected the Jack of Spades twice (once in slot 4 and once in slot 12).
Note that, although using the pipe operator %>%
is the “standard” way to use tidyverse
functions, you can still use the functions without using the pipe operator (long story short, the pipe operator just takes whatever comes before the %>%
and “plugs it in” as the first argument to the function that comes after the %>%
), by specifying the first argument to the slice_sample()
function explicitly:
<- slice_sample(deck_df, n = 12, replace = TRUE)
deck_sample_df deck_sample_df
Rank | Suit |
---|---|
Nine | Hearts |
Ten | Diamonds |
Nine | Hearts |
Jack | Hearts |
Jack | Clubs |
King | Spades |
Eight | Spades |
Ace | Clubs |
Eight | Diamonds |
Five | Diamonds |
Ace | Hearts |
Four | Clubs |
To me, one nice aspect of slice_sample()
over other base-R functions is (among other things) it ensures that the column names are maintained when you sample, which is not always true for the base-R functions. It’s also possible to do in base-R (without using tidyverse libraries/functions), though, just less straightforwardly.
Using Base-R to Sample WITH Replacement
First off, note that just applying sample()
to the deck will not produce the outcome we expect, or want, which probably unfortunately goes against our intuitions for how this function should work:
sample(deck_df, 5, replace = TRUE)
Suit | Suit.1 | Rank | Suit.2 | Suit.3 |
---|---|---|---|---|
Hearts | Hearts | Ace | Hearts | Hearts |
Hearts | Hearts | Two | Hearts | Hearts |
Hearts | Hearts | Three | Hearts | Hearts |
Hearts | Hearts | Four | Hearts | Hearts |
Hearts | Hearts | Five | Hearts | Hearts |
Hearts | Hearts | Six | Hearts | Hearts |
Hearts | Hearts | Seven | Hearts | Hearts |
Hearts | Hearts | Eight | Hearts | Hearts |
Hearts | Hearts | Nine | Hearts | Hearts |
Hearts | Hearts | Ten | Hearts | Hearts |
Hearts | Hearts | Jack | Hearts | Hearts |
Hearts | Hearts | Queen | Hearts | Hearts |
Hearts | Hearts | King | Hearts | Hearts |
Diamonds | Diamonds | Ace | Diamonds | Diamonds |
Diamonds | Diamonds | Two | Diamonds | Diamonds |
Diamonds | Diamonds | Three | Diamonds | Diamonds |
Diamonds | Diamonds | Four | Diamonds | Diamonds |
Diamonds | Diamonds | Five | Diamonds | Diamonds |
Diamonds | Diamonds | Six | Diamonds | Diamonds |
Diamonds | Diamonds | Seven | Diamonds | Diamonds |
Diamonds | Diamonds | Eight | Diamonds | Diamonds |
Diamonds | Diamonds | Nine | Diamonds | Diamonds |
Diamonds | Diamonds | Ten | Diamonds | Diamonds |
Diamonds | Diamonds | Jack | Diamonds | Diamonds |
Diamonds | Diamonds | Queen | Diamonds | Diamonds |
Diamonds | Diamonds | King | Diamonds | Diamonds |
Clubs | Clubs | Ace | Clubs | Clubs |
Clubs | Clubs | Two | Clubs | Clubs |
Clubs | Clubs | Three | Clubs | Clubs |
Clubs | Clubs | Four | Clubs | Clubs |
Clubs | Clubs | Five | Clubs | Clubs |
Clubs | Clubs | Six | Clubs | Clubs |
Clubs | Clubs | Seven | Clubs | Clubs |
Clubs | Clubs | Eight | Clubs | Clubs |
Clubs | Clubs | Nine | Clubs | Clubs |
Clubs | Clubs | Ten | Clubs | Clubs |
Clubs | Clubs | Jack | Clubs | Clubs |
Clubs | Clubs | Queen | Clubs | Clubs |
Clubs | Clubs | King | Clubs | Clubs |
Spades | Spades | Ace | Spades | Spades |
Spades | Spades | Two | Spades | Spades |
Spades | Spades | Three | Spades | Spades |
Spades | Spades | Four | Spades | Spades |
Spades | Spades | Five | Spades | Spades |
Spades | Spades | Six | Spades | Spades |
Spades | Spades | Seven | Spades | Spades |
Spades | Spades | Eight | Spades | Spades |
Spades | Spades | Nine | Spades | Spades |
Spades | Spades | Ten | Spades | Spades |
Spades | Spades | Jack | Spades | Spades |
Spades | Spades | Queen | Spades | Spades |
Spades | Spades | King | Spades | Spades |
A way to avoid this is to make sure that you are using the sample()
function NOT on the entire data.frame
object, but just to select a subset of the rows of the data.frame
, like the following:
sample(nrow(deck_df), 15, replace = TRUE),] deck_df[
Rank | Suit | |
---|---|---|
15 | Two | Diamonds |
16 | Three | Diamonds |
37 | Jack | Clubs |
8 | Eight | Hearts |
32 | Six | Clubs |
51 | Queen | Spades |
42 | Three | Spades |
34 | Eight | Clubs |
26 | King | Diamonds |
10 | Ten | Hearts |
19 | Six | Diamonds |
14 | Ace | Diamonds |
7 | Seven | Hearts |
49 | Ten | Spades |
22 | Nine | Diamonds |
First off, notice how here we can again confirm that it sampled with replacement since it had to create additional ids like 34.1
and 34.2
to represent the fact that card #34 (the Eight of Clubs) ended up in our sample 3 times.
Also note how, rather than sampling from the data.frame
, which may be intuitively/linguistically how we would describe what we want, we are actually sampling from the set of indices of the data.frame
, then asking R to give us the rows corresponding to those sampled indices. Concretely, to see what’s going on, let’s just look at the row filter we’ve provided (the portion of the full code that is within the square brackets []
, before the comma):
sample(nrow(deck_df), 15, replace = TRUE)
[1] 37 42 28 29 45 16 20 41 34 25 44 37 35 29 48
We see that, in fact, we are not really sampling from the data.frame
itself, so much as sampling from a list of its indices (from 1 to 52), and then after performing this sample we are going and asking R to give us the rows at the indices that ended up in this sample. Keeping this distinction in mind (between the rows themselves and their indices) can be helpful for debugging code like this.