The Problem of Missing Data
Missing data are quite common in real-world datasets, and cause a lot of problems, particularly with modeling.
Let’s take a look at 3 datasets in the {medicaldata} package to see if they have missing data. We will use the {visdat} package to visualize the missing data.
Prostate
First, let’s explore the missingness of the prostate dataset
prostate %>%
visdat::vis_dat()
You can see that there are some missing data in the 316 observations for the variables p_vol, t_vol, and t_stage, and a few for preop_psa.
CMV
Next, let’s explore the missingness of the cmv dataset
cmv %>%
visdat::vis_dat()
You can see that there are some missing data in the diganosis_type, cd_8 dose, and some for cd3_dose.
Smartpill
Next, let’s explore missingness in the smartpill dataset
smartpill %>%
vis_dat()
You can see that there are lots of missing data in the 95 observations for the variables that measure something other than time, but these are correlated in particular patients, who may have used older smartpills that could only measure time, and not contractions or amplitudes.
Exercise 1
Write the R code required to filter the prostate dataset to rows with missing data (NA) in the prostate volume (p_vol) variable.
prostate %>%
select(age, aa, p_vol) %>%
filter(---)
prostate %>%
select(age, aa, p_vol) %>%
filter(is.na(p_vol))
Exercise 2
Write the R code required to filter the cmv dataset to exclude rows that have a missing value (NA) for cd8_dose.
cmv %>%
select(id, diagnosis, cd8_dose) %>%
filter(---)
cmv %>%
select(id, diagnosis, cd8_dose) %>%
filter(!is.na(---))
cmv %>%
select(id, diagnosis, cd8_dose) %>%
filter(!is.na(cd8_dose))
Exercise 3
The smartpill_empty dataset has completely empty rows in rows 20 to 50. Write the R code required to remove these empty rows from this dataset.
smartpill_empty %>%
---_---("---")
smartpill_empty %>%
remove_empty("---")
smartpill_empty %>%
remove_empty("rows")
Exercise 4
Write the R code required to drop all observations in the cmv dataset that have at least one variable with missing data (NA).
cmv %>%
----()
cmv %>%
drop_na()
Exercise 5
Write the R code required to visualize the missing data (with the _vis_dat()_ function from the {visdat} package) before and after dropping missing observations from the cmv dataset.
cmv %>%
---()
cmv %>%
drop_na() %>%
---()
cmv %>%
vis_dat()
cmv %>%
drop_na() %>%
---()
cmv %>%
vis_dat()
cmv %>%
drop_na() %>%
vis_dat()