• Instructions
  • Basic Operations
  • Cleaning and Counting
  • Combining Data
  • Plotting
  • Functional Programming
  • Wrapping Up


Instructions

  1. This exam covers material from R for Data Science. You may find the study guide useful. If you have any questions about scope, please get in touch.

  2. You must complete the exam within 90 minutes.

  3. You may use any books or digital resources you want during this examination, but you may not communicate with any person other than your examiner.

  4. You are required to use the RStudio IDE for the practical portions of this exam. You may use either the desktop edition or rstudio.cloud as you prefer.


There is of course no one correct way to solve each of these questions. I’ve included the solutions that first came to my mind when I did this practice exam. If you find a mistake or a better solution, please feel free to submit a pull request on Github.


You can download the blank sample exam .rmd here. You can download this current .rmd containing the solutions here.

By default, the code for each solution is hidden. You can toggle the code on and off by clicking Code/Hide at the top right of each chunk or Code > Show All Code / Hide All Code at the top right of this document.


library(tidyverse)

Basic Operations

  1. Read the file person.csv and store the result in a tibble called person.
(person <- read_csv(here::here("person.csv")))
ABCDEFGHIJ0123456789
person_id
<chr>
personal_name
<chr>
family_name
<chr>
dyerWilliamDyer
pbFrankPabodie
lakeAndersonLake
roeValentinaRoerich
danforthFrankDanforth
  1. Create a tibble containing only family and personal names, in that order. You do not need to assign this tibble or any others to variables unless explicitly asked to do so. However, as noted in the introduction, you must use the pipe operator %>% and code that follows the tidyverse style guide.
person %>% 
  select(family_name, personal_name)
ABCDEFGHIJ0123456789
family_name
<chr>
personal_name
<chr>
DyerWilliam
PabodieFrank
LakeAnderson
RoerichValentina
DanforthFrank
  1. Create a new tibble containing only the rows in which family names come before the letter M. Your solution should work for tables with more rows than the example, i.e., you cannot rely on row numbers or select specific names.
before_m <- letters[which(letters == "a"):which(letters == "m") - 1]

person %>% 
  mutate(family_name = tolower(family_name),
         family_first_letter = str_sub(family_name, 1, 1)) %>% 
  filter(family_first_letter %in% before_m) %>% 
  select(-family_first_letter)
ABCDEFGHIJ0123456789
person_id
<chr>
personal_name
<chr>
family_name
<chr>
dyerWilliamdyer
lakeAndersonlake
danforthFrankdanforth

Another, much more elegant solution, courtesy of Beatriz Milz:

person %>% filter(family_name < "M")
ABCDEFGHIJ0123456789
person_id
<chr>
personal_name
<chr>
family_name
<chr>
dyerWilliamDyer
lakeAndersonLake
danforthFrankDanforth
  1. Display all the rows in person sorted by family name length with the longest name first.
person %>% 
  arrange(desc(str_length(family_name)))
ABCDEFGHIJ0123456789
person_id
<chr>
personal_name
<chr>
family_name
<chr>
danforthFrankDanforth
pbFrankPabodie
roeValentinaRoerich
dyerWilliamDyer
lakeAndersonLake

Cleaning and Counting

  1. Read the file measurements.csv to create a tibble called measurements. (The strings "rad", "sal", and "temp" in the quantity column stand for “radiation”, “salinity”, and “temperature” respectively.)
(measurements <- read_csv(here::here("measurements.csv")))
ABCDEFGHIJ0123456789
visit_id
<dbl>
visitor
<chr>
quantity
<chr>
reading
<dbl>
619dyerrad9.82
619dyersal0.13
622dyerrad7.80
622dyersal0.09
734pbrad8.41
734lakesal0.05
734pbtemp-21.50
735pbrad7.22
735NAsal0.06
735NAtemp-26.00
  1. Create a tibble containing only rows where none of the values are NA and save in a tibble called cleaned.
(cleaned <- measurements %>% 
  drop_na())
ABCDEFGHIJ0123456789
visit_id
<dbl>
visitor
<chr>
quantity
<chr>
reading
<dbl>
619dyerrad9.82
619dyersal0.13
622dyerrad7.80
622dyersal0.09
734pbrad8.41
734lakesal0.05
734pbtemp-21.50
735pbrad7.22
751pbrad4.35
751pbtemp-18.50
  1. Count the number of measurements of each type of quantity in cleaned. Your result should have one row for each quantity "rad", "sal", and "temp".
cleaned %>% 
  count(quantity)
ABCDEFGHIJ0123456789
quantity
<chr>
n
<int>
rad8
sal7
temp3
  1. Display the minimum and maximum value of reading separately for each quantity in cleaned. Your result should have one row for each quantity "rad", "sal", and "temp".
cleaned %>% 
  group_by(quantity) %>% 
  summarize(reading_min = min(reading),
            reading_max = max(reading))
ABCDEFGHIJ0123456789
quantity
<chr>
reading_min
<dbl>
reading_max
<dbl>
rad1.4611.25
sal0.0541.60
temp-21.50-16.00

Note: You could also use dplyr::across() and a named list of functions! 😎

cleaned %>% 
  group_by(quantity) %>% 
  summarize(across(reading, list(min = min, max = max)))
ABCDEFGHIJ0123456789
quantity
<chr>
reading_min
<dbl>
reading_max
<dbl>
rad1.4611.25
sal0.0541.60
temp-21.50-16.00
  1. Create a tibble in which all salinity ("sal") readings greater than 1 are divided by 100. (This is needed because some people wrote percentages as numbers from 0.0 to 1.0, but others wrote them as 0.0 to 100.0.)
cleaned %>% 
  mutate(reading = case_when(
                      quantity == "sal" & reading > 1 ~ reading/100,
                      TRUE ~ reading)
         )
ABCDEFGHIJ0123456789
visit_id
<dbl>
visitor
<chr>
quantity
<chr>
reading
<dbl>
619dyerrad9.820
619dyersal0.130
622dyerrad7.800
622dyersal0.090
734pbrad8.410
734lakesal0.050
734pbtemp-21.500
735pbrad7.220
751pbrad4.350
751pbtemp-18.500

Combining Data

  1. Read visited.csv and drop rows containing any NAs, assigning the result to a new tibble called visited.
(visited <- read_csv(here::here("visited.csv")) %>% 
  drop_na())
ABCDEFGHIJ0123456789
visit_id
<dbl>
site_id
<chr>
visit_date
<date>
619DR-11927-02-08
622DR-11927-02-10
734DR-31930-01-07
735DR-31930-01-12
751DR-31930-02-26
837MSK-41932-01-14
844DR-11932-03-22
  1. Use an inner join to combine visited with cleaned using the visit_id column for matches.
(combined <- inner_join(visited, cleaned, by = "visit_id"))
ABCDEFGHIJ0123456789
visit_id
<dbl>
site_id
<chr>
visit_date
<date>
visitor
<chr>
quantity
<chr>
reading
<dbl>
619DR-11927-02-08dyerrad9.82
619DR-11927-02-08dyersal0.13
622DR-11927-02-10dyerrad7.80
622DR-11927-02-10dyersal0.09
734DR-31930-01-07pbrad8.41
734DR-31930-01-07lakesal0.05
734DR-31930-01-07pbtemp-21.50
735DR-31930-01-12pbrad7.22
751DR-31930-02-26pbrad4.35
751DR-31930-02-26pbtemp-18.50
  1. Find the highest radiation ("rad") reading at each site. (Sites are identified by values in the site_id column.)
(max_rad <- combined %>% 
  pivot_wider(names_from = quantity, values_from = reading) %>% 
  group_by(site_id) %>% 
  summarize(max_rad = max(rad, na.rm = TRUE)))
ABCDEFGHIJ0123456789
site_id
<chr>
max_rad
<dbl>
DR-111.25
DR-38.41
MSK-41.46
  1. Find the date of the highest radiation reading at each site.
combined %>% 
  pivot_wider(names_from = quantity, values_from = reading) %>% 
  group_by(site_id, visit_date) %>% 
  summarize(max_rad = max(rad, na.rm = TRUE)) %>% 
  semi_join(max_rad) %>% 
  select(visit_date, everything())
ABCDEFGHIJ0123456789
visit_date
<date>
site_id
<chr>
max_rad
<dbl>
1932-03-22DR-111.25
1930-01-07DR-38.41
1932-01-14MSK-41.46

Plotting

  1. The code below is supposed to read the file home-range-database.csv to create a tibble called hra_raw, but contains a bug. Describe and fix the problem. (There are several ways to fix it: please use whichever you prefer.)
hra_raw <- read_csv(here::here("data", "home-range-database.csv"))

Note: The file home-range-database.csv is currently saved in the root directory of the project, not in a subdirectory called data, as the code above would suggest. You could either create the data folder and move the data file there, or you could update the code that imports the data (as I demonstrate below).

(hra_raw <- read_csv(here::here("home-range-database.csv")))
ABCDEFGHIJ0123456789
taxon
<chr>
common.name
<chr>
class
<chr>
order
<chr>
lake fishesamerican eelactinopterygiianguilliformes
river fishesblacktail redhorseactinopterygiicypriniformes
river fishescentral stonerolleractinopterygiicypriniformes
river fishesrosyside daceactinopterygiicypriniformes
river fisheslongnose daceactinopterygiicypriniformes
river fishesmuskellungeactinopterygiiesociformes
marine fishespollackactinopterygiigadiformes
marine fishessaitheactinopterygiigadiformes
marine fisheslined surgeonfishactinopterygiiperciformes
marine fishesorangespine unicornfishactinopterygiiperciformes
  1. Convert the class column (which is text) to create a factor column class_fct and assign the result to a tibble hra. Use forcats to order the factor levels as:
    1. mammalia
    2. reptilia
    3. aves
    4. actinopterygii
(hra <- hra_raw %>% 
  mutate(class_fct = factor(class, levels = c("mammalia", 
                                              "reptilia", 
                                              "aves", 
                                              "actinopterygii"))) %>% 
   relocate(class_fct, .after = class))
ABCDEFGHIJ0123456789
taxon
<chr>
common.name
<chr>
class
<chr>
class_fct
<fctr>
lake fishesamerican eelactinopterygiiactinopterygii
river fishesblacktail redhorseactinopterygiiactinopterygii
river fishescentral stonerolleractinopterygiiactinopterygii
river fishesrosyside daceactinopterygiiactinopterygii
river fisheslongnose daceactinopterygiiactinopterygii
river fishesmuskellungeactinopterygiiactinopterygii
marine fishespollackactinopterygiiactinopterygii
marine fishessaitheactinopterygiiactinopterygii
marine fisheslined surgeonfishactinopterygiiactinopterygii
marine fishesorangespine unicornfishactinopterygiiactinopterygii
  1. Create a scatterplot showing the relationship between log10.mass and log10.hra in hra.
hra %>% 
  ggplot(aes(log10.mass, log10.hra)) + 
  geom_point(size = 2, alpha = 0.7) + 
  theme_minimal()

  1. Colorize the points in the scatterplot by class_fct.
hra %>% 
  ggplot(aes(log10.mass, log10.hra, color = class_fct)) + 
  geom_point(size = 2, alpha = 0.7) + 
  scale_color_viridis_d(end = .9) + 
  theme_minimal()

  1. Display a scatterplot showing only data for birds (class aves) and fit a linear regression to that data using the lm function.
hra %>% 
  filter(class == "aves") %>% 
  ggplot(aes(log10.mass, log10.hra)) + 
  geom_point(size = 2, alpha = 0.7) + 
  geom_smooth(method = "lm") + 
  labs(title = "Linear relationship between home range and mass for Aves") + 
  theme_minimal()

Functional Programming

  1. Write a function called summarize_table that takes a title string and a tibble as input and returns a string that says something like, “title has # rows and # columns”. For example, summarize_table('our table', person) should return the string "our table has 5 rows and 3 columns".
summarize_table <- function(title, df) { 
  nrow <- nrow(df)
  ncol <- ncol(df) 
  
  glue::glue("{title} has {nrow} rows and {ncol} columns.")
}

summarize_table("our table", person)
## our table has 5 rows and 3 columns.
  1. Write another function called show_columns that takes a string and a tibble as input and returns a string that says something like, “table has columns name, name, name”. For example, show_columns('person', person) should return the string "person has columns person_id, personal_name, family_name".
show_columns <- function(title, df) { 
  col_names <- names(df) %>% 
    str_c(collapse = ", ")

glue::glue("{title} has columns {col_names}")
}

show_columns('person', person)
## person has columns person_id, personal_name, family_name
  1. The function rows_from_file returns the first N rows from a table in a CSV file given the file’s name and the number of rows desired. Modify it so that if no value is specified for the number of rows, a default of 3 is used.
rows_from_file <- function(filename, num_rows) {
      readr::read_csv(filename) %>% head(n = num_rows)
    }

rows_from_file("measurements.csv") # should show 3 rows
rows_from_file <- function(filename, num_rows = 3) {
  readr::read_csv(filename) %>% head(n = num_rows)
}

rows_from_file("measurements.csv")
ABCDEFGHIJ0123456789
visit_id
<dbl>
visitor
<chr>
quantity
<chr>
reading
<dbl>
619dyerrad9.82
619dyersal0.13
622dyerrad7.80
  1. The function long_name checks whether a string is longer than 4 characters. Use this function and a function from purrr to create a logical vector that contains the value TRUE where family names in the tibble person are longer than 4 characters, and FALSE where they are 4 characters or less.
long_name <- function(name) {
      stringr::str_length(name) > 4
    }
person %>% 
  mutate(long_family_name = map_lgl(family_name, long_name))
ABCDEFGHIJ0123456789
person_id
<chr>
personal_name
<chr>
family_name
<chr>
long_family_name
<lgl>
dyerWilliamDyerFALSE
pbFrankPabodieTRUE
lakeAndersonLakeFALSE
roeValentinaRoerichTRUE
danforthFrankDanforthTRUE

Wrapping Up

  1. Modify the YAML header of this file so that a table of contents is automatically created each time this document is knit, and fix any errors that are preventing the document from knitting cleanly.
---
title: "Tidyverse Exam Version 2.0"
output:
html_document:
    theme: flatly
---

Corrected YAML header:

---
title: "Tidyverse Exam Version 2.0"
output:
  html_document:
    theme: flatly
    toc: TRUE
---

Note: You need to add an indentation after html_document: and add toc: TRUE


You can read more about the RStudio Instructor Training and Certification Program here. There is another sample exam available with solutions, courtesy of Marly Gotti. I wrote about my own experience with the training and shared some of my exam prep materials here. Feel free to reach out with any questions!


