3 minute read

What is Pipe?

  • Pipe is a new paradigm for calling multiple functions in R.
  • Pipe is an operator, written as %>% from the dplyr or magrittr package.
  • The magrittr package provides another pipe operator %$%.

Why do we use pipes?

We have an input \(x\). We want to perform two calculations contiguously, for example, calling two functions, \(f\) and \(g\) in sequence:

\[ \begin{eqnarray} x & \\ y &=& f(x) \\ z &=& g(y) \\ \end{eqnarray} \]

Alternatively, if there is no need of making new variables for storing intermediate results, multiple calculations can be written in a single statement by nesting the functions:

\[ g(f(x)) \]

For example, given $x$ being a numerical vector

x <- c(2, 5, 4, 3, 1)
## [1] 2 5 4 3 1

We want to find the square root of each element in \(x\) and return the maximum one. The R commnds are sqrt and max, respectively.

Traditionally, the calculation can be written in a nesting function call

## [1] 2.236068

The R expression above works but somehow impossible for us to read and comprehend like a natural language, especially when your code contains a lot of parentheses and we need to read them from the inside out. With pipes in R, we can write a nesting function naturally and read it from left to right in a chain.

\[ x \rightarrow sqrt \rightarrow max \]

How do we write pipes in R?

The pipe operator has been established in R, especially with the recent popularity of the Tidyverse package.

Before using the pipe operator, install the dplyr package and import the library.

Note: Only need to install a package one time, but must import the package in each script which needs the package.


With the pipe operator, rewrite the previous nesting function call to

x %>% sqrt %>% max
## [1] 2.236068

Additional arguments

Beside the data argument, if a function in a pipe requires one or more additional arguments, group them into the function within parenthesis.

Example 1

x <- c(-1, 0, NA, 22, NA)
x %>% mean(na.rm=TRUE)

Example 2

The diff command calculates the differences between all consecutive values of a vector. The argument lag specifies the lag of the differences.

temperature <- c(30, 0, 12, 40, 28)
diff(temperature, lag=1) #between two consecutive days
diff(temperature, lag=2) #between a day and two days ago

Rewrite the nesting function

\[ round(log(diff(temperature,lag=2)),digits=2) \]

with the pipe:

temperature %>% 
  diff(lag=2) %>% 
  log() %>% 

Another pipe operator %$%

Some R functions do not take the data argument and it would be hard to include them in a pipe. Luckily, the %$% operator from the magrittr package will solve the issue.

In the following code,

iris %>%
  subset(Sepal.Length > mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)
## [1] 0.3361992

the cor function only takes two column vectors from the iris data, instead of the entire data. The %$% operator has to be used to connect cor to the chain.

When should we not use pipes?

  • Do you want to store intermediate results?
  • Do you have more than ten pipes in a single statement?
  • Do you have more than one input data?
  • Do you develop your own package?

These are the cases where we should not use pipes because use of pipes will decrease readbility and flexibility of code, rather than improving the code.

comments powered by Disqus