Tutorial

Randomly Sampling your Data with slice_sample()

There are times when it is helpful to have a representative random sample of your data. Sometimes this is helpful to have a smaller dataset on which to develop your code (especially when you have a really large dataset that makes your computer slow down). At other times, you want to split your large dataset into a training set and a testing set for modeling. The idea is to randomly sample ~ 70-80% of your data to train several models, then pick the best one to test on the complementary dataset of testing data. There should be no data in common between your training and testing set in order to get unbiased estimates of the accuracy of your model.

Exercise 1

Write the R code required to randomly sample 70 rows of the supra dataset.

supra %>% 
  ---(---)

supra %>% 
  slice_sample(n = ---)

supra %>% 
  slice_sample(n = 70)

Exercise 2

Write the R code required to randomly sample 30% of the smartpill dataset.

smartpill %>% 
  ---(---)

smartpill %>% 
  slice_sample(prop = ---)

smartpill %>% 
  slice_sample(prop = 0.3)

Exercise 3

Write the R code required to use slice_sample() to randomly sample a 75% testing set from cmv, and assign it to training_set. Then use an antijoin() function to make a complementary 25% testing set from the cmv dataset. Check example 2 in the flipbook if you need help.

cmv  %>% 
  ---(???) ->
training_set

cmv %>% 
  ---(???)

cmv  %>% 
  slice_sample(???) ->
training_set

cmv %>% 
  anti_join(???)

cmv  %>% 
  slice_sample(prop = .75) ->
training_set

cmv %>% 
  anti_join(training_set)