More Than One Way To Skin (and time) A Data Frame Subset

By Bob Rudis (@hrbrmstr)
Thu 03 April 2014 | tags: R, Fundamentals, -- (permalink)

There was an interesting question recently on StackOverlow on how to apply a function over a rolling window on a column in a data frame grouped by subset. It was a pretty vanilla SO question as those things go, but there were no less than four useful and diferent answers to it which, I believe, shows the power and flexibility of R.

All of them used either the rollapply or rollapplyr functions from the zoo package, and each handles the “rolling window” component (the “hard work”). There were three different approaches to performing the subset and applying the rollapply/sd combo across the four solutions:

using ave (either with with or w/o it)
using by
using ddply from the plyr package

The short description of the ave function—“Group Averages Over Level Combinations of Factors”—hides the fact that it’s really a generic function that will do subsetting of the first argument by the factors provided in the following arguments using FUN=mean as the default function to call. You can use any function (like sd in this case) instead, which might not be obvious to new R users. Two of the answers used ave with minor differences (but different enough to show below).

I think by is an oft-neglected function since the plyr functions came about, but it’s a workhorse and gets the job done pretty well here (with the help of unlist).

The ddply solution is equally as straightforward and self-explanatory as the other three.

Given how close they all were syntactic impmementation, I wanted to see if there was a difference under the covers speed-wise, so I modified the orignial example data frame (made it bigger and slightly more complex with four factors instead of two) and used each differnet method to create a new column (100x) and captured the results to compare. The system.time function is often called in a standalone context, but when not printing basic timing stats to stdout it returns an object of class proc_time which has five vales (of which two are only relevant to our testing).

library(zoo)  # for rollapply()

set.seed(1492)

category <- rep(sample(c("A", "B", "C", "D"), 20000, replace = TRUE))
year <- rep(sample(c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993), 20000, 
    replace = TRUE))
value <- rep(sample(c(2, 3, 5, 6, 8, 9, 4, 5), 20000, replace = TRUE))

df <- data.frame(category, year, value)

# run rolling sd calculation 100x for each method and capture results in
# data frames <<- is needed to modify df system.time does the timing lapply
# runs the code 100x and shoves the results into a list ldply turns the list
# into a data frame lather, rinse, repeat

t.with <- ldply(lapply(1:100, function(x) system.time(df$stdev.with <<- with(df, 
    ave(value, category, FUN = function(x) c(NA, rollapply(x, width = 2, sd)))))))

t.ave <- ldply(lapply(1:100, function(x) system.time(df$stdev.ave <<- ave(df$value, 
    df$category, FUN = function(x) rollapplyr(x, 2, sd, fill = NA)))))

t.by <- ldply(lapply(1:100, function(x) system.time(df$stdev.by <<- unlist(by(df, 
    df$category, function(x) c(NA, rollapply(x$value, width = 2, sd)))))))

t.ddply <- ldply(lapply(1:100, function(x) system.time(df$stdev.ddply <<- ddply(df, 
    .(category), mutate, stdev = rollapplyr(value, width = 2, sd, fill = NA))$stdev)))

# prepare & crunch some data
# melt so we can use geom_barplot easily
t.with <- melt(t.with)
t.ave <- melt(t.ave)
t.by <- melt(t.by)
t.ddply <- melt(t.ddply)

# add a column we can facet on to show each method
t.with$method <- "with"
t.ave$method <- "ave"
t.by$method <- "by"
t.ddply$method <- "ddply"

# combine all the timing data sets
dat <- rbind(t.with, t.ave, t.by, t.ddply)
# sys.self, etc aren't useful for this analysis
dat <- dat[!(dat$variable %in% c("sys.self", "user.child", "sys.child")), ]

# put the factors in a particular order for the ggplot
dat$method <- factor(dat$method, levels = c("with", "ave", "by", "ddply"))

# make the actual plot of crunched stats
gg <- ggplot(data = dat, aes(factor(variable), value))
gg <- gg + geom_boxplot(aes(color = method))
gg <- gg + facet_wrap(~method, ncol = 4)
gg <- gg + labs(x = "", y = "# secs")
gg <- gg + theme_bw()
gg <- gg + theme(legend.position = "none", strip.background = element_blank())
gg

While they are all pretty speedy, the with/ave combo consistently “wins” when I run this test code. The by method always comes in second each time as well. The ddply method is consistently the slowest, but none of them are laggards.

This was a pretty fun exercise and reinforces my belief that taking a stab at answering SO questions is a pretty neat way to see how others “think” in R and can help you see solutions from different perspectives. Plus, it can lead you down a path to discovery in terms of finding an optimal way of solving a problem.