Tidyverse Tips & Tricks Part 1

The Paradox of Naivety

American Pie, The Sopranos, Enema of the State, The Matrix, Tony Hawk’s Pro Skater, the launch of Napster, Y2K! Good times. One thing for which 1999 is not usually remembered though is being the year that the Dunning-Kruger effect was originated. You might be wondering “what does a renowned theory from the field of psychology have to do with scripting in R?”. Well, not that much as it happens. But the eponymous concept coined by David Dunning and Justin Kruger does have some relevance here as it gives rise to the paradox of naivety and in-turn one of my favourite sayings which, despite some misguided attributions floating about the internet, can’t be credited to former US Secretary of Defence Donald Rumsfeld. “You don’t know what you don’t know”. Quite.

It’s true though. In this case, the average data analyst is all too likely unaware of how little of the absolute potential offered by their platform of choice that their skills fulfil. In some cases, this will be due to a genuine overestimation of their ability. Often though it’s because, in a truly expansive and ever-evolving ecosystem like R, many will simply not realise that there’s a package, function, or nicely streamlined solution for a particular task. To at least know what you don’t know would be a good start.

Many seasoned users are walking around with tricks and things that we know how to do in R or the Tidyverse that we might not think of as being relatively obscure, but which could help others achieve or streamline a critical part of their workflow; exactly the kind of thing people should know.

But why tricks? Is the Tidyverse nothing more than a big bag of tricks? Well…yeah. That’s exactly why it’s so good. Have you ever actually tried to reshape data without tidyr? When you begin to combine all the specific functions and nifty little ways of doing things you begin to realise how fluently they all work together. The benefit that this brings is not just a little bit of convenience here and there, but convenience that accumulates to both save time and keep you in the zen-like flow state that is so conducive to productivity.

This is the first tutorial of a series in which I will showcase both the most useful and some lesser know functions from the tidyverse that, along with a bunch of little tips and tricks I have picked up over the years, will elevate your tidyverse capability from the basics of my previous tutorial to a Miyagi-esque level of mastery 🥷

TL;DW

In this tutorial, I use the volcano eruptions data set taken from the TidyTuesday project and cover some more obscure functions from the core tidyverse, some “hidden” ways of using workhorse functions like group_by() and filter() from the dplyr package, a couple of really handy tricks with common operators.

Tutorial Code

As usual, the first code block will load the packages required for the tutorial. Nice and simple this time.

# load libraries
library("tidyverse")

The data set used in this tutorial is the volcano eruptions data set originally posted on the 12th of May 2020 as part of the TidyTuesday project, which is a weekly data project run by the R for Data Science Online Learning Community aimed at users of the R ecosystem.

The data set is comprised of five data tables, but we will only be using two of them:

# load data
volcanoes <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv")
eruptions <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/eruptions.csv")

These tables jointly give detailed information about global volcanoes and their location, geology, and eruptions events over time.

1. pipes all ‘round

Last time around, I introduced the pipe operator %>% which is used in the Tidyverse for chaining series of function calls together. One thing that people often don’t realise is that the pipe operator can be used inside functions as well to chain a series of function calls together instead of nesting functions. In this case, the primary_volcano_type variable gets a bit of a cleaning using some of the functions from stringr inside a call to the dplyr workhorse mutate().

# view the levels in the 'primary_volcano_type' variable
volcanoes %>% 
  pull(primary_volcano_type) %>% 
  unique()

# clean the 'primary_volcano_type' variable using pipes within functions
volcanoes_cln <- 
  volcanoes %>% 
  mutate(
    primary_volcano_type = primary_volcano_type %>% 
      str_remove("\\(.+\\)|\\?") %>% 
      str_to_title()
  )

# confirm cleanup has worked
volcanoes_cln %>% 
  pull(primary_volcano_type) %>% 
  unique()

Chaining series of functions together inside of functions helps to keep code readable, particularly when the correct indentation is used. It is not always appropriate or necessary though, so some judgement is required.

As a rule of thumb, if using more than two functions chain them together with pipes rather than nesting inside a function call.

2. counting

“Most of data science is counting, and sometimes dividing.” — Hadley Wickham

The count() function will count the number of rows for each unique value of a variable or combination of grouping variables specified as the first and only mandatory argument; it is comparable to using group_by(), summarise() and n() functions in conjunction but a lot less verbose.

volcanoes_cln %>% 
  group_by(country, primary_volcano_type) %>% 
  summarise(n())

# same output using count()
volcanoes_cln %>% 
  count(country, primary_volcano_type)

There are useful arguments available that enable a more intuitive variable name to be used for the calculated frequency counts and the option to sort the output in descending order.

volcanoes_cln %>% 
  group_by(country, primary_volcano_type) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count))

# same output using count()
volcanoes_cln %>% 
  count(country, primary_volcano_type, name = "count", sort = TRUE)

There’s also and argument wt that changes the behaviour from calculation of frequency counts to addition of the values in a numeric variable. This is analogous to changing from group_by(), summarise() and n() to group_by(), summarise() and sum().

volcanoes_cln %>% 
  group_by(country, primary_volcano_type) %>% 
  summarise(total_pop_5km = sum(population_within_5_km)) %>% 
  arrange(desc(total_pop_5km))

# same output using count()
volcanoes_cln %>% 
  count(
    country, primary_volcano_type,
    wt = population_within_5_km,
    name = "total_pop_5km",
    sort = TRUE
  )

The related function add_count() has the same arguments and behaviour as count() but is used to add a variable to an existing data frame in an a manner equivalent to using group_by(), mutate(), n() and ungroup() functions.

volcanoes_cln %>% 
  group_by(country) %>% 
  mutate(count = n()) %>% 
  ungroup()

# same output using add_count()
volcanoes_cln %>% 
  add_count(country)

These two functions can be a powerful and time saving option when combined.

volcanoes_cln %>% 
  group_by(country, primary_volcano_type) %>% 
  summarise(count = n()) %>% 
  mutate(total = sum(count)) %>% 
  ungroup() %>% 
  arrange(desc(total), desc(count))

# same output using our power duo
volcanoes_cln %>% 
  count(country, primary_volcano_type, name = "count", sort = TRUE) %>% 
  add_count(country, wt = count, name = "total", sort = TRUE)

Here, we add the total count for a country in addition to the count of each volcano type for that country, and then sort by the total country count and within country by the frequency count for each volcano type. All that with two lines of code!

3. creating variables inside functions

Many regular dplyr users aren’t aware that it is possible to create variables within the count(), group_by() and filter() functions on the fly which are then operated on by the function.

The example given here first shows the long-winded way of creating a decade variable from the start_year variable using integer division (bonus trick!) and using this to obtain frequency counts for the number of eruptions by decade before finally showing that the mutate() call can be skipped entirely by creating the variable inside the count() call with the same outcome.

# number of eruptions per decade - the long(ish) way
eruptions %>% 
  mutate(decade = (start_year %/% 10) * 10) %>% 
  count(decade, name = "n_erupt", sort = TRUE)

# same output creating the variable on-the-fly
eruptions %>% 
  count(
    decade = (start_year %/% 10) * 10,
    name = "n_erupt",
    sort = TRUE
  )

This behaviour of being able to create grouping variables on the fly is also true of the group_by() function; this is evident if the group_by() and summarise() approach is used to get the same summary table created in the code chunk above.

eruptions %>% 
  group_by(decade = (start_year %/% 10) * 10) %>% 
  summarise(n_erupt = n()) %>% 
  arrange(desc(n_erupt))

The filter() function allows analogous behaviour too. Here we filter the data to the observations corresponding to volcanoes with more than 180 recorded eruptions by creating the frequency counts using group_by() and n() while undertaking the filtration step rather than before hand.

eruptions %>% 
  group_by(volcano_name) %>% 
  filter(n() > 180) %>% 
  ungroup()

All of these examples can save you an extra step of creating a variable, thereby saving you time, memory and typing.

Summary()

That brings me to the end of this set of tips and tricks but I will be back very soon with the next part in the series and with it more tidyverse trickery to help get more complex tasks done with relative ease.

In the meantime, I hope you enjoyed the video and the tutorial, please like and subscribe to my YouTube channel if you want to keep up-to-date with my latest tutorial videos, and feel free to get in touch in the comments or via my website.

Have fun with your data analyses and I will catch you next time!

View Session Info

# ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
# setting  value
# version  R version 4.2.1 (2022-06-23)
# os       macOS Ventura 13.4
# system   x86_64, darwin17.0
# ui       RStudio
# language (EN)
# collate  en_US.UTF-8
# ctype    en_US.UTF-8
# tz       Europe/London
# date     2023-08-02
# rstudio  2023.03.2+454 Cherry Blossom (desktop)
# pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
# 
# ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
# package     * version date (UTC) lib source
# bit           4.0.5   2022-11-15 [1] CRAN (R 4.2.0)
# bit64         4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
# cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
# colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.2.0)
# crayon        1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
# curl          5.0.1   2023-06-07 [1] CRAN (R 4.2.0)
# digest        0.6.33  2023-07-07 [1] CRAN (R 4.2.1)
# dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.2.1)
# evaluate      0.21    2023-05-05 [1] CRAN (R 4.2.0)
# fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
# fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.2.0)
# forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.2.0)
# generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
# ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.2.0)
# glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
# gtable        0.3.3   2023-03-21 [1] CRAN (R 4.2.0)
# hms           1.1.3   2023-03-21 [1] CRAN (R 4.2.0)
# htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.2.0)
# knitr         1.43    2023-05-25 [1] CRAN (R 4.2.0)
# lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
# lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.2.0)
# magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
# munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
# pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.0)
# pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
# purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
# R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
# readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.2.0)
# rlang         1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
# rmarkdown     2.23    2023-07-01 [1] CRAN (R 4.2.0)
# rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.2.0)
# scales        1.2.1   2022-08-20 [1] CRAN (R 4.2.0)
# sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
# stringi       1.7.12  2023-01-11 [1] CRAN (R 4.2.0)
# stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
# tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.2.0)
# tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.2.1)
# tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
# tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.2.0)
# timechange    0.2.0   2023-01-11 [1] CRAN (R 4.2.0)
# tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.2.0)
# utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.0)
# vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.2.0)
# vroom         1.6.3   2023-04-28 [1] CRAN (R 4.2.0)
# withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
# xfun          0.39    2023-04-20 [1] CRAN (R 4.2.1)
# yaml          2.3.7   2023-01-23 [1] CRAN (R 4.2.0)
# 
# [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
# 
# ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

. . . . .

Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.

Happy Data Analysis!

. . . . .

Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.

← Previous Post Next Post →