There was an interesting question recently on StackOverlow on how to apply a function over a rolling window on a column in a data frame grouped by subset. It was a pretty vanilla SO question as those things go, but there were no less than four useful and diferent answers to it which, I believe, shows the power and flexibility of R.
All of them used either the
rollapplyr functions from the zoo package, and each handles the “rolling window” component (the “hard work”). There were three different approaches to performing the subset and applying the
rollapply/sd combo across the four solutions:
withor w/o it)
The short description of the
ave function—“Group Averages Over Level Combinations of Factors”—hides the fact that it’s really a generic function that will do subsetting of the first argument by the factors provided in the following arguments using
FUN=mean as the default function to call. You can use any function (like
sd in this case) instead, which might not be obvious to new R users. Two of the answers used
ave with minor differences (but different enough to show below).
by is an oft-neglected function since the
plyr functions came about, but it’s a workhorse and gets the job done pretty well here (with the help of
ddply solution is equally as straightforward and self-explanatory as the other three.
Given how close they all were syntactic impmementation, I wanted to see if there was a difference under the covers speed-wise, so I modified the orignial example data frame (made it bigger and slightly more complex with four factors instead of two) and used each differnet method to create a new column (100x) and captured the results to compare. The
system.time function is often called in a standalone context, but when not printing basic timing stats to
stdout it returns an object of class
proc_time which has five vales (of which two are only relevant to our testing).
library(zoo) # for rollapply() set.seed(1492) category <- rep(sample(c("A", "B", "C", "D"), 20000, replace = TRUE)) year <- rep(sample(c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993), 20000, replace = TRUE)) value <- rep(sample(c(2, 3, 5, 6, 8, 9, 4, 5), 20000, replace = TRUE)) df <- data.frame(category, year, value) # run rolling sd calculation 100x for each method and capture results in # data frames <<- is needed to modify df system.time does the timing lapply # runs the code 100x and shoves the results into a list ldply turns the list # into a data frame lather, rinse, repeat t.with <- ldply(lapply(1:100, function(x) system.time(df$stdev.with <<- with(df, ave(value, category, FUN = function(x) c(NA, rollapply(x, width = 2, sd))))))) t.ave <- ldply(lapply(1:100, function(x) system.time(df$stdev.ave <<- ave(df$value, df$category, FUN = function(x) rollapplyr(x, 2, sd, fill = NA))))) t.by <- ldply(lapply(1:100, function(x) system.time(df$stdev.by <<- unlist(by(df, df$category, function(x) c(NA, rollapply(x$value, width = 2, sd))))))) t.ddply <- ldply(lapply(1:100, function(x) system.time(df$stdev.ddply <<- ddply(df, .(category), mutate, stdev = rollapplyr(value, width = 2, sd, fill = NA))$stdev))) # prepare & crunch some data # melt so we can use geom_barplot easily t.with <- melt(t.with) t.ave <- melt(t.ave) t.by <- melt(t.by) t.ddply <- melt(t.ddply) # add a column we can facet on to show each method t.with$method <- "with" t.ave$method <- "ave" t.by$method <- "by" t.ddply$method <- "ddply" # combine all the timing data sets dat <- rbind(t.with, t.ave, t.by, t.ddply) # sys.self, etc aren't useful for this analysis dat <- dat[!(dat$variable %in% c("sys.self", "user.child", "sys.child")), ] # put the factors in a particular order for the ggplot dat$method <- factor(dat$method, levels = c("with", "ave", "by", "ddply"))
# make the actual plot of crunched stats gg <- ggplot(data = dat, aes(factor(variable), value)) gg <- gg + geom_boxplot(aes(color = method)) gg <- gg + facet_wrap(~method, ncol = 4) gg <- gg + labs(x = "", y = "# secs") gg <- gg + theme_bw() gg <- gg + theme(legend.position = "none", strip.background = element_blank()) gg
While they are all pretty speedy, the
with/ave combo consistently “wins” when I run this test code. The
by method always comes in second each time as well. The
ddply method is consistently the slowest, but none of them are laggards.
This was a pretty fun exercise and reinforces my belief that taking a stab at answering SO questions is a pretty neat way to see how others “think” in R and can help you see solutions from different perspectives. Plus, it can lead you down a path to discovery in terms of finding an optimal way of solving a problem.Tweet