Skip to Tutorial Content

The Problem of Duplicate Data

Sometimes you get get duplicate entries of the same observation, either from data entry errors, or from errors in merging data from different sources, or from data queries on different days. It is important to identify and remove these duplicates before you do your analysis. The distinct() function will tell you how many distinct rows you have, and the get_dupes() function from the {janitor} package will help you identify the duplicates for fixing.

Exercise 1

The example below counts the number of rows in a version of the prostate dataset which has some duplicate rows. Add the R code required to count the number of distinct rows.

prostate %>% 
  nrow()

prostate %>% 
  ---()
prostate %>% 
  nrow()

prostate %>% 
  distinct()

Exercise 2

Write the R code required to find the duplicate rows in the prostate dataset.

prostate %>% 
  ---()
prostate %>% 
  get_dupes()

Exercise 3

This version of the cmv dataset has some duplicate rows. Add the R code required to count the number of distinct rows.

cmv %>% 
  nrow()

cmv %>% 
  ---()
cmv %>% 
  nrow()

cmv %>% 
  distinct()

Exercise 4

Write the R code required to find the duplicate rows in the cmv dataset.

cmv %>% 
  ----()
cmv %>% 
  get_dupes()

Tutorial