Installing dplyr 0.3 on Mac OS X (Mavericks)

By Bob Rudis (@hrbrmstr)
Thu 25 September 2014 | tags: r, rstats, dplyr, -- (permalink)

UPDATE Per the author, a devtools::install_github("hadley/devtools") should take care of everything you need prior to installing the latest dplyr (though I did not have postgres libs installed and suspect that might still be needed).

The R dplyr package just turned 0.3 and to get it working in my development environment (OS X Mavericks) I had to do the following:

  • brew install postgresql (you are using homebrew on Macs, right?)
  • install.packages("DBI", type="source")
  • install.packages("RPostgreSQL", type="source")
  • devtools::install_github("rstudio/rmarkdown")
  • devtools::install_github("hadley/lazyeval")
  • devtools::install_github("hadley/dplyr")

Such is the way of things when living on the cutting edge of the Hadleyverse.

Why go through the trouble of using the newest version of dplyr? Take a look at some of the new capabilities available:

  • between() vector function efficiently determines if numeric values fall in a range, and is translated to special form for SQL (#503).

  • count() makes it even easier to do (weighted) counts (#358).

  • data_frame() by @kevinushey is a nicer way of creating data frames. It never coerces column types (no more stringsAsFactors = FALSE!), never munges column names, and never adds row names. You can use previously defined columns to compute new columns (#376).

  • distinct() returns distinct (unique) rows of a tbl (#97). Supply additional variables to return the first row for each unique combination of variables.

  • Set operations, intersect(), union() and setdiff() now have methods for data frames, data tables and SQL database tables (#93). They pass their arguments down to the base functions, which will ensure they raise errors if you pass in two many arguments.

  • Joins (e.g. left_join(), inner_join(), semi_join(), anti_join()) now allow you to join on different variables in x and y tables by supplying a named vector to by. For example, by = c("a" = "b") joins x.a to y.b.

  • n_groups() function tells you how many groups in a tbl. It returns 1 for ungrouped data. (#477)

  • transmute() works like mutate() but drops all variables that you didn’t explicitly refer to (#302).

  • rename() makes it easy to rename variables - it works similarly to select() but it preserves columns that you didn’t otherwise touch.

  • slice() allows you to selecting rows by position (#226). It includes positive integers, drops negative integers and you can use expression like n().

Also, the lazyeval package looks pretty interesting.

