By Bob Rudis (@hrbrmstr)
Sat 08 February 2014
|
tags:
R,
reproducible research,
botnet,
-- (permalink)
It’s super-#spiffy
to see organizations like Sucuri share data and insight. Since they did some great work (both in data capture and sharing of their analyses), I thought it might be fun (yes, Jay & I have a strange notion of “fun”) to “show the work” in R. You should read their post first before playing along at home. We’ll provide links to the data file at the end of this post.
I combined the three Darkleech bit.ly files and stuck a proper header it, which makes it much easier to handle with read.csv()
. I also normalized all the timestamp formats (they are all “%Y:%m:%d %H:%M:%S
“ now).
library(plyr)
require(RCurl)
grantdad.URL = "https://raw.github.com/ddsbook/blog/master/data/2014/02/sucuri/grantdad.txt"
grantdad <- read.csv(textConnection(getURL(grantdad.URL)), stringsAsFactors = FALSE, sep = "\t")
# to factor by host
grantdad$host <- factor(gsub("^[a-zA-Z0-9]*\\.", "", gsub("(http[s]*://|/.*$)", "", grantdad$long.url)))
# for aggregating by minute
grantdad$ts.min <- as.POSIXct(gsub("\\:[0-9][0-9]$", "", grantdad$ts))
I initially made an assumption the timestamp in the original files is the creation date+time of the short URL and that the click count is there just for convenience (neither the post nor pastebin ‘splains). Looking at the long.url
field, though, it seems that if this assumption is right, we might have an issue with the way the data was collected:
# show duplicate long URLs entries short link and time stamp
g.dups <- grantdad[grantdad$long.url %in% grantdad[duplicated(grantdad$long.url),]$long.url, c(2, 3, 1)]
g.dups[order(g.dups$bitly.link.id), ]
## ts click.count bitly.link.id
## 6431 2014-01-28 18:50:17 2 19YJNDs
## 6433 2014-01-28 18:50:17 2 19YJNDs
## 9812 2014-01-28 10:05:05 2 1bu6vhO
## 9813 2014-01-28 10:05:05 2 1bu6vhO
## 9802 2014-01-28 10:05:07 13 1bu6vyj
## 9804 2014-01-28 10:05:07 13 1bu6vyj
## 9442 2014-01-28 11:00:12 33 1budiYU
## 9444 2014-01-28 11:00:12 33 1budiYU
## 9332 2014-01-28 11:15:07 0 1bueQ57
## 9333 2014-01-28 11:15:07 0 1bueQ57
## 9322 2014-01-28 11:15:09 8 1bueT0T
## 9323 2014-01-28 11:15:09 8 1bueT0T
## 9212 2014-01-28 11:30:07 19 1bugqnt
## 9214 2014-01-28 11:30:07 19 1bugqnt
## 9222 2014-01-28 11:30:05 2 1bugsvy
## 9224 2014-01-28 11:30:05 2 1bugsvy
## 9032 2014-01-28 11:55:09 0 1buixrs
## 9033 2014-01-28 11:55:09 0 1buixrs
## 9020 2014-01-28 11:55:12 1 1buizPW
## 9023 2014-01-28 11:55:12 1 1buizPW
## 8631 2014-01-28 13:00:05 8 1bunIb1
## 8633 2014-01-28 13:00:05 8 1bunIb1
## 8622 2014-01-28 13:00:09 0 1bunIrx
## 8623 2014-01-28 13:00:09 0 1bunIrx
## 10618 2014-02-05 02:15:10 3 1c0EVcm
## 10619 2014-02-05 02:15:10 3 1c0EVcm
## 11672 2014-02-04 23:35:15 0 1evhUT3
## 11675 2014-02-04 23:35:15 0 1evhUT3
## 10796 2014-02-05 01:50:09 3 1evyC4O
## 10797 2014-02-05 01:50:09 3 1evyC4O
## 4400 2014-01-25 07:40:05 3 1hUgqCn
## 4410 2014-01-25 07:40:05 3 1hUgqCn
## 3671 2014-01-25 09:25:15 0 1hUyNHt
## 3675 2014-01-25 09:25:15 0 1hUyNHt
## 5490 2014-01-28 21:40:10 12 1i7jPOj
## 5496 2014-01-28 21:40:10 12 1i7jPOj
## 5485 2014-01-28 21:40:10 4 1i7jRFX
## 5494 2014-01-28 21:40:10 4 1i7jRFX
## 5487 2014-01-28 21:40:08 2 1i7jScV
## 5498 2014-01-28 21:40:08 2 1i7jScV
## 1652 2014-01-25 14:45:12 2 KTOyCl
## 1654 2014-01-25 14:45:12 2 KTOyCl
## 1053 2014-01-25 16:25:03 4 KU505B
## 1055 2014-01-25 16:25:03 4 KU505B
## 952 2014-01-25 16:40:13 0 KU7i4F
## 954 2014-01-25 16:40:13 0 KU7i4F
## 172 2014-01-25 18:45:06 0 LVmfEb
## 174 2014-01-25 18:45:06 0 LVmfEb
## 163 2014-01-25 18:45:08 0 LVmgYU
## 166 2014-01-25 18:45:08 0 LVmgYU
Those click.count
numbers are close enough (OK, exact) that it looks like it might be a data collection/management issue (these RESTful APIs can be annoying at times). From my own examination of the bit.ly API, I’m pretty sure it’s supposed to be the creation time of the link, so we’ll remove the duplicates before continuing:
grantdad <- grantdad[!duplicated(grantdad$long.url), ]
With the data cleaned up we can aggregate clicks
and counts
(short URL creations) by anything we want. We’ll start with by-minute aggregation:
# aggregate URL creation and clicks by minute
clicks <- count(grantdad, c("ts.min", "host"), wt_var = "click.count")
colnames(clicks) <- c("ts", "host", "clicks")
counts <- count(grantdad, c("ts.min", "host"))
colnames(counts) <- c("ts", "host", "counts")
by.min <- merge(clicks, counts)
# across all hosts
summary(by.min$counts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 58.0 59.0 57.8 59.0 60.0
# per-host
by(by.min, by.min$host, function(x) {
summary(x$counts)
})
## by.min$host: myftp.biz
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 58.0 58.0 57.6 59.0 60.0
## --------------------------------------------------------
## by.min$host: myftp.org
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.0 58.0 59.0 58.5 59.0 60.0
## --------------------------------------------------------
## by.min$host: serveftp.com
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.0 58.0 58.0 57.2 59.0 60.0
gg <- ggplot(by.min, aes(factor(host), counts))
gg <- gg + geom_boxplot(aes(fill = host))
gg <- gg + theme_bw()
gg <- gg + labs(x = "Target Host", y = "Click Count (per min)", title = "Click Counts (per-host, by minute)")
gg <- gg + theme(legend.position = "none")
gg
The bit.ly API best practices page does not explicitly state what the per-minute link-creation rate limit is, but it sure looks like grantdad
at least assumed it was 60 short-links per minute. (NOTE: grantdad
could have been under a no-ip.com
API rate limit threshold requirement as well…I didn’t look at no-ip.com
API details)
Before we do further work with per-minute information (i.e. try do hourly aggregation), we should examine the source data a bit more closely. Since the original post documents that the time periods are:
- 18:30 and 19:10 (Jan 25)
- 09:40 and 22:10 (Jan 28)
- 23:20 and 23:59 (Feb 04)
- 00:00 and 03:40 (Feb 05)
let’s look at block of the complete hours of January 28th (the longest contiguous stretch in the data set) to see if there might be more methods to grantdad
‘s maliciousness (and to get a feel for how we should do any extrapolation):
# extract 10:00 up to (but not including) 20:00
jan28 <- grantdad[grep("2014-01-28 1[0-9]", grantdad$ts), ]
by(jan28, jan28$host, function(x) {
summary(factor(gsub("(^2014-01-28 |\\:00$)", "", as.character(x$ts.min))))
})
## jan28$host: myftp.biz
## 10:05 11:00 11:20 11:30 12:20 13:40 14:35 14:55 15:30 15:40 16:15 16:40
## 57 58 59 57 59 59 59 59 59 59 59 59
## 17:15 17:20 17:30 18:00 18:10 18:25 19:20 19:25 19:30 19:50
## 59 59 59 59 59 59 59 1 60 60
## --------------------------------------------------------
## jan28$host: myftp.org
## 10:20 10:30 10:40 10:50 11:15 11:40 11:50 12:05 12:30 12:40 12:50 13:05
## 59 59 59 59 57 59 59 59 59 59 59 59
## 13:25 13:50 14:05 14:15 14:30 15:50 16:25 16:30 16:50 17:50 18:15 18:35
## 59 59 59 59 59 59 59 59 59 59 59 59
## 19:10
## 59
## --------------------------------------------------------
## jan28$host: serveftp.com
## 10:15 11:05 11:55 12:10 13:00 13:20 14:00 14:45 15:10 15:20 16:00 16:05
## 59 59 57 59 57 59 59 59 59 59 59 59
## 17:05 17:40 18:45 18:50 19:00 19:40
## 59 59 59 58 59 60
While this continues to show grantdad
kept below the (again, assumed) 60 link creations per minute rate limit, he/she also spaced out the creation—albeit somewhat inconsistently—to every 5- or 10-minutes (and needed a bathroom break or fell asleep at 19:25, perhaps suggesting they were firing a script off by hand).
We can look at each “minute chunk” in aggregate for that time period as well:
jan28.bymin <- factor(gsub("(^2014-01-28 1[0-9]\\:|\\:00$)", "", as.character(jan28$ts.min)))
summary(jan28.bymin)
## 00 05 10 15 20 25 30 35 40 45 50 55
## 351 411 236 352 413 178 471 118 473 118 531 116
plot(jan28.bymin, col = "steelblue", xlab = "Minute", ylab = "Links Created",
main = "Links Created", sub = "Jan 28 (1000-1959) grouped by Minute-in-hour")
This is either the world’s most inconsistent (or crafty) cron-job or grantdad
like to press ↑ + ENTER
alot.
We can use this summary to get get an idea of the average number of links being created in a five-minute period:
# nine days
summary(as.numeric(table(jan28.bymin)/9))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.9 18.1 39.1 34.9 47.5 59.0
If we use the mean
, we then have ~35 links created every five minutes and can use that fact to do the extrapolation over the Jan 25-Feb 5 time period suggested in the article to get 120,576 total estimated links created during that 12-day period which is about 10K more than estimated in the Sucuri post and puts the complete estimate of created malicious links (assuming a start on Dec 16th) at 512,448.
It looks like my assumption of the fields in the data files was accurate and both Sucuri and DDSec came to roughly the same conclusions (both are estimates, so neither is “right”).
We may delve into the rest of the data provided by Sucuri, but want to express kudos again for sharing it and helping further the reproducible research movement in the security domain.
You can grab all of the data files, including our combined grantdad.txt
file over on github. We stuck the Rmd
file used to create this post there as well. #reproducibleresearch