At my previous gig as a junior data scientist trainer, my team was given some time to revamp our base R-based syllabus to be more tidy-friendly while retaining significant base-R content.
This article is not written as an introduction to the tidyverse. It assumes you already work with the tidyverse, and is really just me jotting down my notes from the revamp exercise.
The tidyverse is opinionated
There’s decisions made for your own good, and that’s bound to split people. For example, you can’t make dual y-axis charts in
ggplot2. And that’s probably for the best (imo).
The tidyverse is designed to be readable
After skimming over base data structures, I start out my
dplyr classes not showing users anything but a code chunk something like this
mtcars %>% rownames_to_columns() %>% select(cyl, mpg) %>% filter(cyl == 1) %>% arrange(mpg) %>% mutate(litres_per_100km = mpg * 6.72)
and asking students to guess what the code is doing. While I tested this on only two groups of students, generally this example made sense, and is a great motivation to use the tidyverse.
For me, starting out with skimming over base basics is probably the ideal approach. At the least, cover basic atomic vectors and lists. It’d be difficult to show how to use functions operating on vectors in
mutate() (and maybe the
.data pronoun) with limited experience in base otherwise.
As we designed the course, we discovered that
dplyr had convenient database backends that connect to SQL and Spark, and this was an added motivation for learning
dplyr syntax well.
The tidyverse is designed to be safe
I take safety in the tidyverse to mean anticipate user mistakes and to guard against these mistakes. I think they call it ‘defensive programming’? However I concede most of the progress in safety is with
purrr and less so in
dplyr (my impression).
No side-effects A key principle in the design of
dplyrfunctions is that the function should not affect the original data frame. We designed our material in Jupyter Notebooks, where out-of-order execution was an issue. In the context of material designed in Jupyter Notebooks, a no-side-effects workflow is advantageous, since if you run cells out of order, you could potentially get different results. This is less of a problem in R Markdown because the user is encouraged to knit the document. pandas seems to be headed in a similar direction, deprecating the
inplace = Trueargument altogether and encouraging the use of method chaining.
safer functions: The tidyverse includes type-safe versions of base functions such as
ifelse()and especially type-safe functional programming (
sapply()with its set of simplification rules).
Pipes are good, but to teach?
magrittr pipes were inspired by UNIX pipes. While I’m not familiar with using them, they’ve been around for a while.
Pipes, combined with
dplyr verbs, provide a SQL-like logic to your code. In R for Data Science, the authors compare
a %>% z() %>% y() %>% x()
where the first example is less human-readable, since it doesn’t follow the sequence of operations performed.
Pandas method chaining uses a similar workflow. I’m so used to using
dplyr now, the first thing I look up before doing a data analysis project on tabular data in Python is the corresponding methods in
pandas for the
dplyr verbs. While I’m less familiar with the pandas methods, using them in conjunction with method chaining make for much more readable code, and using them is a priority for me if I’m doing extensive analysis in pandas.
I’m not sure if pipes are easy to teach for most audiences given my limited experience, but they are critical to teaching in the tidyverse imo. It’s worked for me so far, but students often forget about the first argument.
tidyverse functions have generally consistent design. For learners, this generally means learning a few functions allows you to reuse that knowledge with other functions with relative ease. tidyverse functions and packages offer consistent arguments across functions and packages, e.g.
The biggest common denominator across the tidyverse is the data-first paradigm. An example how this helps is how it may be easier to teach functional programming in
purrr compared to base R, since the multivariate versions of the
apply functions contain the data within the ellipsis argument, which is the last argument, versus with
sapply() where the data is the first argument.
tidyverse is designed as a grammar, or perhaps a way to speak about the operations you’re running.
dplyr is the tidyverse’s grammar for manipulating tabular data encapsulated in 6 verbs and their variants, whereas
ggplot2 is the tidyverse’s grammar for constructing charts.
Within the tidyverse grammar, every function is a verb. I particularly like the classification of the join verbs as mutating and filtering joins. It’s a useful heuristic for me to decide which join to use.
Small core function set (relative to pandas?)
A common way to teach
dplyr, as I found in many tutorials, is to help students master the 6 core
dplyr functions first. I emphasise while teaching that functions such as
rename() are shortcuts for these 6 functions, and there are always workarounds to do what you need even if you can’t find/don’t remember the shortcuts. It takes pressure off the student to ‘remember
mtcars %>% rename(miles_per_gallon = mpg) # is equivalent to mtcars %>% select(miles_per_gallon = mpg, everything()) # provided you don't care about the ordering of the columns
Tibbles are simpler
There are only two differences between the base data frame and the tibble: (1) a different print method that considers console size and (2) not allowing rownames. (If you go by R4DS, tibbles are more restrictive and complain more) If anything students don’t need to learn about row names, because frankly they don’t need them. (Wrangling MultiIndex is something I certainly don’t miss from pandas … )
The requirement for tidy data is shared between
seaborn. Tidy data seems to get less emphasis in the Python ecosystem (this pandas course seems to be an excellent exception), and students didn’t have a good mental model for thinking about how to prepare their data before plotting or modelling.
Base R pains
Still having not worked much with R, I didn’t get the pains with base until after reading a few chapters in R for Data Science.
is.vector() for example returns
TRUE for a list, because a list is just a recursive vector. This is annoying if you wanted to simplify the teaching of atomic vectors as vectors and the students try running
stringsAsFactors = TRUE in
read.csv is a less-than-ideal default for data analysis. The argument defines the order of the factor levels before the user manipulates them, and the user is unaware of the order of the factor levels. If the user isn’t aware of the default, data analysis can be a bit of a pain. (Although I concede that
read_csv() has failed me on one occasions. However, I never understood reprex well enough to file a good bug when it happened. )
We’ve been bitten by the ellipsis (
...) argument in functions such as
mean(). Because of the ellipsis argument, the following lines of code run without error:
# rm.na is not a valid named argument, but no error is returned mean(c(1, 2, 3), rm.na = TRUE) # returns 1 mean(1, 2, 3)
Emphasis on code style from the outset
The tidyverse emphasises using a consistent casing for objects. Snake case is preferred across all objects.
janitor::clean_names() is a great way to enforce this, to coerce all column names to snakecase. It’s easy to promote the perks of consistent code style to students given good examples and opinionated design in the tidyverse. Having to use backticks as escape characters otherwise is a good motivation for using
One thing I like about teaching
dplyr functions is that they implicitly discourage integer indexing in favour of making filtering/selecting decisions explicit, although it is still possible, for instance,
df %>% select(1:3) df %>% slice(4:6)
One could argue indexing is still required for data partitions for predictive modelling, but I’d argue
rsample circumvents that
library(rsample) df_split <- initial_split(df) df_train <- training(df_split) df_test <- testing(df_split)
Of course, there’s always
dplyr::sample_frac(), which can be used in conjunction with a filtering join:
df %>% sample_frac(.8) -> df_train df %>% anti_join(df_train) -> df_test
It is very true that debugging in a long series of pipes can be a pain, and I think
dplyr should provide more messages. Stata has useful reports on operations that
dplyr can benefit from implementing, and independent efforts such as tidylog help report the outcomes of data wrangling.
There are parts I still find awkward to teach in the tidyverse.
- Like how to explain how symbols work when used in
dplyrfunctions. Teaching it as ‘exposes the vectors within a data frame within an environment’ just doesn’t seem very beginner-friendly, and I need to find a simpler way.
- The multitude of packages to teach to accomplish different tasks gets annoying at times (I wish I could start off with
here()as soon as we talk about importing data without confusing students with an additional import!).
- I wish
dplyrreturns a warning when removing rownames because users aren’t aware that the rownames are being dropped (but the only use case here is
mtcars, really …).
- I still get
top_n()wrong. All the time.
- And don’t even get me started with tidyeval …
My teaching philosophy is to teach people to work with data efficiently with good data analysis workflows, not to become package authors (although that’d be nice!). I find that
dplyr helps me think about my analysis faster. I don’t deal with datasets so big that I really care how long my code takes to run, so long as I do it right. But my teaching days are over … for now.