Data Driven Security

Mike Sconzo’s Ten Commandments of Python Data Science

2016-03-08T17:00:00-05:00

Straight from the Book of PEP

thou shalt have no other languages before me
thou shalt not compare me to R
thou shalt not take the name of python or scikit-learn in vain
keep holy the juypter notebook
honour thy pip and thy modules
thou shalt not ^C any running program, but shall exit cleanly
thou shalt not “experiment” with R
thou shalt utilize the whole CPU for thine is a single thread
thou shalt not defame lesser languages (e.g. all of them)
thou shalt not attempt to reproduce in python what others do in R

Your Data-Driven Guide To 2016 RSA Conference (U.S. Edition)

2016-01-23T23:00:00-05:00

While I may not be able to attend the 2016 RSA Conference, I can provide some recommendations for those seeking a more data-driven schedule between parties and recovery breakfasts.

There is a high likelihood that Advancing Information Risk Practices Seminar will have sage & practical advice on how to use data to best manage risk in your organization.
The always amazing Anton Chuvakin’s session on Demystifying Security Analytics: Data, Methods, Use Cases will be a great primer for those who have struggled to get a successful analytics practice off the ground.
I’ve been assured no IPv4 addresses, malware hashes or crafty URLs were harmed in the making of Wade Baker’s talk on Bridging the Gap between Threat Intelligence and Risk Management. If there’s anyone who is more data-driven than Jay & I it’s Wade.
“Maturity models” always terrify me as they are prone to simplicty. But, if you’re starting from scratch, they can be an effective gateway drugs into more advanced data-driven security practices. Give Grow Up: A Maturity Model and Roadmap for Vulnerability Management (Core Security) a listen if you’re just starting on the path.
I also wince at the mere hint of “big data”, but Security Monitoring in the Real World with Petabytes of Data may be worth a listen if you’re in a large org and are tired of fighting (and paying for) Splunk.
If data-driven devops is your thing, Scott Kennedy’s DevSecOps—The Tao of Security Science was spot-instanced just for you.
Moar “big data” at this one, but at-scale data classification is a real issue in large orgs. Applying Auto-Data Classification Techniques for Large Data Sets by Anchit Arora may help you carve your towering data peaks down to size.
The economics of security go beyond security department budgets. Destabilizing the cybercrime economy is an approach orgs don’t often think about. You may find key elements of how to do that at Malware as a Service: Kill the Supply Chain.
Jack Jones seems to be arguing against maturity models in his talk: How Infosec Maturity Models Are Missing the Point. Go to both and decide for yourself!
Data-Driven App Sec. With a title like that, it has to be on the list, no?
Hubbard wrote the book on measuring anything and his new book on doing so in cyber is sure to be a hit with the data-driven community. Get a preview of it at How to Measure Anything in Cybersecurity Risk.
Lance² will definitely be including the use of data in their talk on Transforming Your Security Culture: From Awareness to Practice to Maturity.
I don’t know Clay and “best practices” terrify me more FBI iOS hacking, but Building Security Data Science Capability may be chock full of sage advice.
One more where the title alone seems to mandate inclusion: Data Science Transforming Security Operations. It’s by an RSAer at an RSA conference, so caveat spectator.
You might want to check out Effectively Measuring Cybersecurity Improvement: A CSF Use Case for good-er-ah-measure?
Despite now working for a router company, the former OpenDNS folks always have interesting talks. While there’s yet moar “big data” in Using Large Scale Data to Provide Attacker Attribution for Unknown IoCs it will most likely be a fun and informative session.
There seems to be a whole lotta measuring going on this year at RSA and Lisa’s talk on Measuring What Matters may help you focus on asking the right questions so you can get your metrics program back on track (or start one!).
I don’t know how data-driven This Doesn’t End Well: The TLD Explosion will be but I despise these new silly TLDs and if you walk away from this talk hating them, too then it’s Mission: Accomplished for me.
Leveraging Analytics for Data Protection Decisions is a guarantted 5-star talk (NOTE: David did not pay me in pastries to say that).

Did I miss any? Disagree with my chocies? Drop me a note in the comments or on Twitter!

If you do attend any or all of these and like to be on the podcast to give us your first-person review drop us a note or find Jay at RSA and get us your contact info.

Data-Driven Security Podcast & Book Update

2016-01-11T12:00:00-05:00

We’re starting off the new year with two new ways to listen to the Data-Driven Security Podcast!

First, we have our own Overcast station fully loaded with the previous two seasons of shows. You can listen to them online right on Overcast.fm or use their minimalist but highly functional app for iOS.

You can also find and add the podcast on TuneIn! It was crazy-cool to be able to tell the Amazon Echo: “Alexa, tune in to the Data-Driven Security podcast” and have it actually work. You can also play it directly from the TuneIn site or with their Android & iOS apps.

We’re working hard to showcase great and useful work on the podcast and welcome topic requests. If you’re not crazy about Google Forms, just drop us a note on Twitter or long-form in e-mail bob@datadrivensecurity.info. Tell us about cool work you’re involved with or point us in the direction of other work you’ve found helpful or interesting.

In case you missed our holiday, “Tools” special, complete with “The Grinch Who Stole Data Science”, give it a listen below.

Data-Driven Security, The Book (which started this whole thing) is back in stock at Amazon (and other online booksellers). We received some complaints about poor-quality versions shipping in November/December 2015 but Wiley & Amazon have rectified the issue. If you were a recipient of a non-color copy of the printed version, please contact Bob or Jay for informaiton on how to get a color copy.

Both Jay & Bob will be hitting up various conferences this year and look forward to meeting listeners and readers.

The Fallacy of Sample Size

2015-11-07T15:26:52-05:00

There is a lot of misperception around sample sizes and the confusion happens on both sides of the research. A common question when researchers are starting out is, “How big should my sample size be?.” To help with that, there are handy calculators all over the Internet. But the more troubling part of misunderstanding sample size happens when people consume research and attempt to dismiss it claiming the sample size is too small. To make matters worse, we are in the age of big data where millions of samples are the norm, and so seeing a study with “just” 500 samples seems easy to dismiss. But the data just don’t work that way and I wanted to provide some context around sample size and experimentation.

What’s a good sample size? How many samples should a study have? Unfortunately, the answer depends on how much confidence or accuracy the research needs and the size of the effect being measured. Additionally, these are generally balanced against the cost of additional data. It is impossible to look at any sample size and determine if it’s “statistically significant”. Let me repeat and rephrase that differently: You can never say a sample size is too small if you just know the sample size. And if the researcher is working with a convenience sample (where they take all the data you can get), they should include estimations of uncertainty in their inferences that account for the sample size, even if the sample isn’t small.

Small samples can easily detect large differences

Another way to say this, is that as the experimenter increases the number of samples, they are able to detect smaller and smaller differences. If an experimenter is looking at two things that are vastly different (such as perhaps opinions between “experts” and non-experts), the large difference should be obvious even with a small sample. However, if the experimenter is trying to compare two samples that are very similar (yet still different), it may take a larger sample to find that difference. These are factored into sample size calculations. As a thought experiment, imagine flipping a novelty coin that produced heads 90% of the time. How many flips would it take before you (even intuitively) raised an eyebrow on the difference between heads and tails? It’d be weird (that’s a technical term) if you flipped a coin ten times and only got one tails. Maybe you wouldn’t make any claims about the coin after ten flips, but as you continue to flip the coin, your confidence to say something is wrong would increase, right? And with a hugely unfair coin (that flips heads 90% of the time), it wouldn’t take too many flips before you are convinced. Sometimes, just a handful of samples is still enough to detect a difference.

Samples size dictates the amount of confidence in an estimate

Let’s continue the coin flip thought experiment and say we don’t want test if it’s fair or not (we know it’s not). Instead, we want to estimate the probability of flipping a heads with this coin. Let’s say we flip it 10 times and get 9 heads, can we say the probability is 90%? Perhaps, but it’d be reckless. Because with a little math, we find that the actual probability of getting a heads could be anywhere between 55% and 99% given 9 heads out of 10 flips. If we doubled that to 20 flips and got 18 heads, we could still only say the range is still only 66% to 99%. We could even run a simulation and make a picture of what the number of flips does to the confidence we have in the estimate (with 90% probably of heads).

Look at the left side of that plot, look how our confidence increases rapidly as we add a few more samples. Then look at the rate of improvement between 100 and 200 samples. Statisticians refer to the amount of confidence in an experiment as the “power” of that experiment. Power is defined (in simple terms) as the “likelihood that a study will detect an effect when there is an effect there to be detected.”

Nobody turns down more data

Okay, there are cases where someone would turn down more data, but my point here is sample size is never limited with a casual decision. Collecting data has real costs associated with it. There are either direct costs (such as paying participants, salaries, etc.) or indirect costs of time and effort to gather and clean the data. At some point, it becomes infeasible (perhaps even impossible) to get more data. The cost of that data must be balanced with the benefit of more data. But keep in mind that the benefit of getting more data isn’t linear. To reduce the uncertainty (confidence interval) by half, the sample size must quadruple. So, if you collect 30 samples you can double your precision by adding 120 more samples, but if you are at 500 samples, you’d have to collect and clean 1,500 more samples to have the same proportional benefit in the effect.

Some points of reference

R.A. Fisher, who developed the design of experiements and who’s techniques are used in most every modern experiment, designed his famous “Lady Tasting Tea” experiment with just 8 cups of tea.
Anyone who’s researched risk analysis undoubtedly has come across Kahneman and Tversky’s Prospect theory. Their initial study pdf had a sample size of 95 students.
Ivan Pavlov had 40 dogs (“Pavlov’s Dogs”) from which he developed his Classical Conditioning work.
Asch’s conformity experiments, influential research on social and peer pressure, used 50 subjects.

Getting into the zone(s) with R + jsonlite

2015-10-07T00:00:00-04:00

We have some strange data in cybersecurity. One of the (IMO) stranger data files is a Domain Name System (DNS) zone file. This file contains mappings between domain names and IP addresses (and other things) represented by “resource records”.

Here’s an example for the dummy/example domain example.com:

$ORIGIN example.com.     ; designates the start of this zone file in the namespace
$TTL 1h                  ; default expiration time of all resource records without their own TTL value
example.com.  IN  SOA   ns.example.com. username.example.com. ( 2007120710 1d 2h 4w 1h )
example.com.  IN  NS    ns                    ; ns.example.com is a nameserver for example.com
example.com.  IN  NS    ns.somewhere.example. ; ns.somewhere.example is a backup nameserver for example.com
example.com.  IN  MX    10 mail.example.com.  ; mail.example.com is the mailserver for example.com
@             IN  MX    20 mail2.example.com. ; equivalent to above line, "@" represents zone origin
@             IN  MX    50 mail3              ; equivalent to above line, but using a relative host name
example.com.  IN  A     192.0.2.1             ; IPv4 address for example.com
              IN  AAAA  2001:db8:10::1        ; IPv6 address for example.com
ns            IN  A     192.0.2.2             ; IPv4 address for ns.example.com
              IN  AAAA  2001:db8:10::2        ; IPv6 address for ns.example.com
www           IN  CNAME example.com.          ; www.example.com is an alias for example.com
wwwtest       IN  CNAME www                   ; wwwtest.example.com is another alias for www.example.com
mail          IN  A     192.0.2.3             ; IPv4 address for mail.example.com
mail2         IN  A     192.0.2.4             ; IPv4 address for mail2.example.com
mail3         IN  A     192.0.2.5             ; IPv4 address for mail3.example.com

(that came from the Wikipedia link above).

DNS is a hierarchical, distributed service and companies reel in the Benjamins by parsing these files from the top level domains (TLDs) and providing data in a more structured format. Some also capture passive DNS data (i.e. data obtained from the queries to—usually—large-scale DNS server deployments) and integrate it into the massive data set.

The TLD zones are really what make the internet “go”. They provide pointers to everyting below them so the entire system knows where to route requests. Monitoring these TLD zone files for changes can reveal many things both operationally benign and malicious. Thankfully, you can get access to some of the (now hundreds of) TLD zones by filling out a form over at ICANN. You won’t get approval for all of the TLD zone files and you’ll need to go to other sites to try to get the big guns like .com, .net & .org.

Once you have a zone file you need to be able to do something with it. R did not have a zone file parser, but now it does thanks to the V8 package and a modified version of the Node.js dns-zonefile module.

Why V8?

I had a dual purpose for this post. One was to introduce the zoneparser package, but the other was to show how you can add missing functionality to R with V8. Shimming JavaScript (or even Java or other languages for that matter) won’t necessarily get you the bare-metal performance of implementing something in R or Rcpp, but it will get you functional quickly and you can focus on Getting Things Done now and performance later. This recently happened with the package humanparser that I wrote to answer a question on Stack Overflow. It’s based on a Node.js module of the same name and is written using V8. Oliver Keyes spun that into the Rcpp-backed humaniformat package (and added some functionality) that is much faster.

For these TLD zone files, I only need to process them once a day and there aren’t thousands or tens of thousands of them. Rather than code up a parser in R or munge some existing C/C++ domain parser code into an R package, All I had to do was this:

#' Parse a Domain Name System (DNS) zone file
#'
#' @param path path to DNS zone file to parse
#' @return \code{list} with DNS zone parsed
#' @export
#' @examples
#' parse_zone(system.file("zones/20151001-wtf-zone-data.txt", package="zoneparser"))
parse_zone <- function(path) {
  ct$call("zonefile.parse", paste(readLines(path), collapse="\n"))
}

.onAttach <- function(libname, pkgname) {

  ct <<- V8::new_context()
  ct$source(system.file("js/zoneparser.js", package="zoneparser"))

}

Those are the only two functions in the package. The .onAttach sets up a V8 JavaScript context for parse_zone to use and loads the slightly modified zoneparser.js browserified JavaScript file which makes the function zonefile.parse() available to the context.

The parse_zone function takes in a file path to a zone file and returns a parsed structure. And, it’s as easy to use as:

library(zoneparser)

example <- parse_zone("example-tld.txt")

# see all the resource records types that were parsed
(names(example))

## [1] "$origin" "$ttl"    "soa"     "ns"      "mx"      "a"       "aaaa"   
## [8] "cname"

# look at the mail exchangers
(example$mx)

##           name preference               host
## 1 example.com.         10  mail.example.com.
## 2            @         20 mail2.example.com.
## 3            @         50              mail3

Those can be easily exported into a database or structured plain text files for further data science-y processing.

Fin

As of this post, there are ~198,000 Node.js modules out there and tons of browser-oriented JavaScript libraries. Many of these can be easily made to work in V8 (some cannot due to lack of functionality in the V8 engine).

If you have a “plumbing” task missing from R that needs implementing, try the V8/JavaScript route first since it took me less than 10 minutes to code up that package (I tweaked documentation, etc afterwards, though). You don’t want to be three days into an Rcpp implementation when you “could have just used V8”!

Modern Honey Network Machinations with R, Python, phantomjs, HTML & JavaScript

2015-08-23T10:00:00-04:00

This was (initially) going to be a blog post announcing the new mhn R package (more on what that is in a bit) but somewhere along the way we ended up taking a left turn at Albuquerque (as we often do here at ddsec hq) and had an adventure in a twisty maze of Modern Honey Network passages that we thought we’d relate to everyone.

Episode 0 : The Quest!

We find our ~~intrepid heroes~~ data scientists finally getting around to playing with the Modern Honey Network (MHN) software that they promised Jason Trost they’d do ages ago. MHN makes it easy to [freely] centrally setup, control, monitor and collect data from one or more honeypots. Once you have this data you can generate threat indicator feeds from it and also do analysis on it (which is what we’re interested in eventually doing and what ThreatStream does do with their global network of MHN contributors).

Jason has a Vagrant quickstart version of MHN which lets you kick the tyres locally, safely and securely before venturing out into the enterprise (or internet). You stand up the server (mostly Python-y things), then tell it what type of honeypot you want to deploy. You get a handy cut-and-paste-able string which you paste-and-execute on a system that will become an actual honeypot (which can be a “real” box, a VM or even a RaspberryPi!). When the honeypot is finished installing the necessary components it registers with your MHN server and you’re ready to start catching cyber bad guys.

(cyber bad guy)

Episode 1 : Live! R! Package!

We decided to deploy a test MHN server and series of honeypots on Digital Ocean since they work OK on the smallest droplet size (not recommended for a production MHN setup).

While it’s great to peruse the incoming attacks:

we wanted programmatic access to the data, so we took a look at all the routes in their API and threw together an R package to let us work with it.

library(mhn)

attacks <- sessions(hours_ago=24)$data
tail(attacks)

##                           _id destination_ip destination_port honeypot
## 3325 55d93cb8b5b9843e9bb34c75 111.222.33.111               22      p0f
## 3326 55d93cb8b5b9843e9bb34c74 111.222.33.111               22      p0f
## 3327 55d93d30b5b9843e9bb34c77 111.222.33.111               22      p0f
## 3328 55d93da9b5b9843e9bb34c79           <NA>             6379  dionaea
## 3329 55d93f1db5b9843e9bb34c7b           <NA>             9200  dionaea
## 3330 55d94062b5b9843e9bb34c7d           <NA>               23  dionaea
##                                identifier protocol       source_ip source_port
## 3325 bf7a3c5e-48e7-11e5-9fcf-040166a73101     pcap    45.114.11.23       58621
## 3326 bf7a3c5e-48e7-11e5-9fcf-040166a73101     pcap    45.114.11.23       58621
## 3327 bf7a3c5e-48e7-11e5-9fcf-040166a73101     pcap    93.174.95.81       44784
## 3328 83e2f4e0-4876-11e5-9fcf-040166a73101     pcap 184.105.139.108       43000
## 3329 83e2f4e0-4876-11e5-9fcf-040166a73101     pcap  222.186.34.160        6000
## 3330 83e2f4e0-4876-11e5-9fcf-040166a73101     pcap   113.89.184.24       44028
##                       timestamp
## 3325 2015-08-23T03:23:34.671000
## 3326 2015-08-23T03:23:34.681000
## 3327 2015-08-23T03:25:33.975000
## 3328 2015-08-23T03:27:36.810000
## 3329 2015-08-23T03:33:48.665000
## 3330 2015-08-23T03:39:13.899000

NOTE: that’s not the real destination_ip so don’t go poking since it’s probably someone else’s real system (if it’s even up).

You can also get details about the attackers (this is just one example):

attacker_stats("45.114.11.23")$data

## $count
## [1] 1861
## 
## $first_seen
## [1] "2015-08-22T16:43:59.654000"
## 
## $honeypots
## [1] "p0f"
## 
## $last_seen
## [1] "2015-08-23T03:23:34.681000"
## 
## $num_sensors
## [1] 1
## 
## $ports
## [1] 22

The package makes it really easy (OK, we’re probably a bit biased) to grab giant chunks of time series and associated metadata for further analysis.

While cranking out the API package we noticed that there were no endpoints for the MHN HoneyMap. Yes, they do the “attacks on a map” thing but don’t think too badly of them since most of you seem to want them.

After poking around the MHN source a bit more (and navigating the view-source of the map page) we discovered that they use a Go-based websocket server to push the honeypot hits out to the map. (You can probably see where this is going, but it takes that turn first).

Episode 2 : Hacking the Anti-Hackers

The other thing we noticed is that—unlike the MHN-server proper—the websocket component does not require authentication. Now, to be fair, it’s also not really spitting out seekrit data, just (pretty useless) geocoded attack source/dest and type of honeypot involved.

Still, this got us wondering if we could find other MHN servers out there in the cold, dark internet. So, we fired up RStudio again and took a look using the shodan package:

library(shodan)

# the most obvious way to look for MHN servers is to 
# scour port 3000 looking for content that is HTML
# then look for "HoneyMap" in the <title>

# See how many (if any) there are
host_count('port:3000 title:HoneyMap')$total
## [1] 141

# Grab the first 100
hm_1 <- shodan_search('port:3000 title:HoneyMap')

# Grab the last 41
hm_2 <- shodan_search('port:3000 title:HoneyMap', page=2)

head(hm_1)

##                                           hostnames    title
## 1                                                   HoneyMap
## 2                                  hb.c2hosting.com HoneyMap
## 3                                                   HoneyMap
## 4                                          fxxx.you HoneyMap
## 5            ip-192-169-234-171.ip.secureserver.net HoneyMap
## 6 ec2-54-148-80-241.us-west-2.compute.amazonaws.com HoneyMap
##                    timestamp                isp transport
## 1 2015-08-22T17:14:25.173291               <NA>       tcp
## 2 2015-08-22T17:00:12.872171 Hosting Consulting       tcp
## 3 2015-08-22T16:49:40.392523      Digital Ocean       tcp
## 4 2015-08-22T15:27:29.661104      KW Datacenter       tcp
## 5 2015-08-22T14:01:21.014893   GoDaddy.com, LLC       tcp
## 6 2015-08-22T12:01:52.207879             Amazon       tcp
##                                                                                                                                                                                                       data
## 1 HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nLast-Modified: Sun, 02 Nov 2014 21:16:17 GMT\r\nDate: Sat, 22 Aug 2015 17:14:22 GMT\r\n\r\n
## 2 HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nLast-Modified: Wed, 12 Nov 2014 18:52:21 GMT\r\nDate: Sat, 22 Aug 2015 17:01:25 GMT\r\n\r\n
## 3 HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nLast-Modified: Mon, 04 Aug 2014 18:07:00 GMT\r\nDate: Sat, 22 Aug 2015 16:49:38 GMT\r\n\r\n
## 4 HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nDate: Sat, 22 Aug 2015 15:22:23 GMT\r\nLast-Modified: Sun, 27 Jul 2014 01:04:41 GMT\r\n\r\n
## 5 HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nLast-Modified: Wed, 29 Oct 2014 17:12:22 GMT\r\nDate: Sat, 22 Aug 2015 14:01:20 GMT\r\n\r\n
## 6 HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nContent-Length: 1572\r\nContent-Type: text/html; charset=utf-8\r\nDate: Sat, 22 Aug 2015 12:06:15 GMT\r\nLast-Modified: Mon, 08 Dec 2014 21:25:26 GMT\r\n\r\n
##   port location.city location.region_code location.area_code location.longitude
## 1 3000          <NA>                 <NA>                 NA                 NA
## 2 3000   Miami Beach                   FL                305           -80.1300
## 3 3000 San Francisco                   CA                415          -122.3826
## 4 3000     Kitchener                   ON                 NA           -80.4800
## 5 3000    Scottsdale                   AZ                480          -111.8906
## 6 3000      Boardman                   OR                541          -119.5290
##   location.country_code3 location.latitude location.postal_code location.dma_code
## 1                   <NA>                NA                 <NA>                NA
## 2                    USA           25.7906                33109               528
## 3                    USA           37.7312                94124               807
## 4                    CAN           43.4236                  N2E                NA
## 5                    USA           33.6119                85260               753
## 6                    USA           45.7788                97818               810
##   location.country_code location.country_name                           ipv6
## 1                  <NA>                  <NA> 2600:3c02::f03c:91ff:fe73:4d8b
## 2                    US         United States                           <NA>
## 3                    US         United States                           <NA>
## 4                    CA                Canada                           <NA>
## 5                    US         United States                           <NA>
## 6                    US         United States                           <NA>
##            domains                org   os module                         ip_str
## 1                                <NA> <NA>   http 2600:3c02::f03c:91ff:fe73:4d8b
## 2    c2hosting.com Hosting Consulting <NA>   http                  199.88.60.245
## 3                       Digital Ocean <NA>   http                104.131.142.171
## 4         fxxx.you      KW Datacenter <NA>   http                  162.244.29.65
## 5 secureserver.net   GoDaddy.com, LLC <NA>   http                192.169.234.171
## 6    amazonaws.com             Amazon <NA>   http                  54.148.80.241
##           ip     asn link uptime
## 1         NA    <NA> <NA>     NA
## 2 3344448757 AS40539 <NA>     NA
## 3 1753452203    <NA> <NA>     NA
## 4 2733907265    <NA> <NA>     NA
## 5 3232361131 AS26496 <NA>     NA
## 6  915689713    <NA> <NA>     NA

Yikes! 141 servers just on the default port (3000) alone! While these systems may be shown as existing in Shodan, we really needed to confirm that they were, indeed, live MHN HoneyMap [websocket] servers.

Episode 3 : Picture [Im]Perfect

Rather than just test for existence of the websocket/data feed we decided to take a screen shot of every server, which is pretty easy to do with a crude-but-effective mashup of R and phantomjs. For this, we made a script which is just a call—for each of the websocket URLs—to the “built-in” phantomjs rasterize.js script that we’ve slightly modified to wait 30 seconds from page open to snapshot creation. We did that in the hopes that we’d see live attacks in the captures.

cat(sprintf("phantomjs rasterize.js http://%s:%s %s.png 800px*600px\n",
            hm_1$matches$ip_str,
            hm_1$matches$port,
            hm_1$matches$ip_str), file="capture.sh")

That makes capture.sh look something like:

phantomjs rasterize.js http://199.88.60.245:3000 199.88.60.245.png 800px*600px
phantomjs rasterize.js http://104.131.142.171:3000 104.131.142.171.png 800px*600px
phantomjs rasterize.js http://162.244.29.65:3000 162.244.29.65.png 800px*600px
phantomjs rasterize.js http://192.169.234.171:3000 192.169.234.171.png 800px*600px
phantomjs rasterize.js http://54.148.80.241:3000 54.148.80.241.png 800px*600px
phantomjs rasterize.js http://95.97.211.86:3000 95.97.211.86.png 800px*600px

Yes, there are far more elegant ways to do this, but the number of URLs was small and we had no time constraints. We could have used a pure phantomjs solution (list of URLs in phantomjs JavaScript) or used GNU parallel to speed up the image captures as well.

Sifting through ~140 images manually to see if any had “hits” would not have been too bad, bit a glance at the directory listing showed that many had the exact same size, meaning those were probably showing a default/blank map. We uniq‘d them by MD5 hash and made an image gallery of them:

It was interesting to see Mexico CERT and OpenDNS in the mix.

Most of the 141 were active/live MHN HoneyMap sites. We can only imagine what a full Shodan search for HoneyMaps on other ports would come back with (mostly since we only have the basic API access and don’t want to burn the credits).

Episode 3 : With “Meh” Data Comes Great Irresponsibility

For those who may not have been with DDSec for it’s entirety, you may not be aware that we have our own attack map (github).

We thought it would be interesting to see if we could mashup MHN HoneyMap data with our creation. We first had to see what the websocket returned. Here’s a bit of Python to do that (the R websockets package was abandoned by it’s creator, but keep an eye out for another @hrbrmstr resurrection):

import websocket
import thread
import time

def on_message(ws, message):
    print message

def on_error(ws, error):
    print error

def on_close(ws):
    print "### closed ###"


websocket.enableTrace(True)
ws = websocket.WebSocketApp("ws://128.199.121.95:3000/data/websocket",
                            on_message = on_message,
                            on_error = on_error,
                            on_close = on_close)
ws.run_forever()

That particular server is very active, hence why we chose to use it.

The output should look something like:

$ python ws.py
--- request header ---
GET /data/websocket HTTP/1.1
Upgrade: websocket
Connection: Upgrade
Host: 128.199.121.95:3000
Origin: http://128.199.121.95:3000
Sec-WebSocket-Key: 07EFbUtTS4ubl2mmHS1ntQ==
Sec-WebSocket-Version: 13


-----------------------
--- response header ---
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: nvTKSyCh+k1Rl5HzxkVNAZjZZUA=
-----------------------
{"city":"Clarks Summit","city2":"San Francisco","countrycode":"US","countrycode2":"US","latitude":41.44860076904297,"latitude2":37.774898529052734,"longitude":-75.72799682617188,"longitude2":-122.41940307617188,"type":"p0f.events"}
{"city":"Clarks Summit","city2":"San Francisco","countrycode":"US","countrycode2":"US","latitude":41.44860076904297,"latitude2":37.774898529052734,"longitude":-75.72799682617188,"longitude2":-122.41940307617188,"type":"p0f.events"}
{"city":null,"city2":"Singapore","countrycode":"US","countrycode2":"SG","latitude":32.78310012817383,"latitude2":1.2930999994277954,"longitude":-96.80670166015625,"longitude2":103.85579681396484,"type":"p0f.events"}

Those are near-perfect JSON records for our map, so we figured out a way to tell iPew/PewPew (whatever folks are calling it these days) to take any accessible MHN HoneyMap as a live data source. For example, to plug this highly active HoneyMap into iPew all you need to do is this:

http://ocularwarfare.com/ipew/?mhnsource=http://128.199.121.95:3000/data/

Once we make the websockets component of the iPew map a bit more resilient we’ll post it to GitHub (you can just view the source to try it on your own now).

Fin

As we stated up front, the main goal of this post is to introduce the mhn package. But, our diversion has us curious. Are the open instances of HoneyMap deliberate or accidental? If any of them are “real” honeypot research or actual production environments, does such an open presence of the MHN controller reduce the utility of the honeypot nodes? Is Greenland paying ThreatStream to use that map projection instead of a better one?

If you use the new package, found this post helpful (or, at least, amusing) or know the answers to any of those questions, drop a note in the comments.

New R Package - domaintools (access the DomainTools.com WHOIS API)

2015-08-09T15:11:00-04:00

We just did a github release for an R package that provides an interface to the DomainTools API. It provides access to the core API functions that aren’t restricted (i.e. the ones we have access to):

domaintools_api_key: Get or set DOMAINTOOLS_API_KEY value
domaintools_username: Get or set DOMAINTOOLS_API_USERNAME value
domain_profile: Domain Profile
hosting_history: Hosting History
parsed_whois: Parsed Whois
reverse_ip: Reverse IP
reverse_ns: Reverse Nameserver
shared_ips: Shared IPs
whois: Whois Lookup
whois_history: Whois History

Each function has a full description and sample call, so feel free to kick the typres and provide feedback on github.

If you have access to the API elements we do not, please either contribute a PR or help us out with some testing.

This is one more package on our path towards a complete set of “cybersecurity” R packages to help information security folk get their (hopefully) data-driven jobs done in R. I believe @quominus may be working on a macro “whois” pacakge to unify access to all the various WHOIS services, too.

The New and Improved R Shodan Package

2015-08-07T11:30:00-04:00

For those not involved with all things “cyber”, let me start with a description of what Shodan is (though visiting the site is probably the best introduction to what secrets it holds).

Shodan is—at it’s core—a search engine. Unlike Google, Shodan indexes what I’ll call “cyber” metadata and content about everything accessible via a public IP address. This means things like

routers, switches and cable/DSL/FiOS modems (which are the underpinnings of our innternet access)
internet web, ftp, mail, etc servers
public (protected or otherwise) CCTV & home surveillance & web camears
desktops, printers and other things that may end up in public IP space
gas station pumps and industrial control systems
VoIP phones & more

Shodan contacts the IP addresses associated with all the devices, sees what ports and protocols might be in use and then tries to retrieve content from those ports and protocols (which could be anything from webcam snapshots to web server HTML to actual header responses from internet servers to banners from routers and switches). It indexes all that metadata and content and makes it available in a search engine and API for securiy researchers (I was so tempted to put that word in quotes).

To give you an idea what it can do, take a look at this query for webcams and/or read this full explanation of what you can do with that data.

While you can have fun with Shodan, it does have real value to security folk and R needed a real API interface to it (I did a half-hearted one a couiple years ago). Hence the rebirth of the shodan package.

The package is brand-new, but it has basic, full coverage of the Shodan API except for the streaming functions. But, a line of code is worth a thousand blatherings, so let’s find all the IIS servers in Maine.

# devtools::install_github("hrbrmstr/shodan")
library(shodan)

# perform the query for IIS servers in Maine
maine_iis <- shodan_search("iis state:me")

# get the total number of IIS servers in Maine that Shodan found
print(maine_iis$total) 
## [1] 2948

# how many did it return in this page of the query?
print(nrow(maine_iis$matches))
## [1] 100

# what else does it know about these servers?
print(colnames(maine_iis$matches))

##  [1] "product"   "hostnames" "version"   "title"     "ip"        "org"      
##  [7] "isp"       "cpe"       "data"      "asn"       "port"      "transport"
## [13] "timestamp" "domains"   "ip_str"    "os"        "_shodan"   "location" 
## [19] "ssl"       "link"

Now, the data frame in maine_iis$matches is somewhat ugly for the moment. Some columns have lists and data frames since the Shodan REST API returns (like many APIs do) nested JSON. I’m actually looking for collaboration on what would be the most useful format for the returned data structures so hit me up if you have ideas that would benefit your use of it.

I’ll violate my own rule about mapping IP addresses just to show you Shodan also does geolocation for you (and, hey, y’all seem to like maps). We’ll make it a bit more useful and add some metadata about what it found to the location popups:

library(leaflet)
library(htmltools)

for_map <- cbind.data.frame(maine_iis$matches$location, 
                            ip=maine_iis$matches$ip,
                            isp=maine_iis$matches$isp,
                            title=maine_iis$matches$title,
                            org=maine_iis$matches$org,
                            data=maine_iis$matches$data,
                            stringsAsFactors=FALSE)

leaflet(for_map, width="600", height="600") %>% 
  addTiles() %>% 
  setView(-69.233328, 45.250556, 7) %>% 
  addCircles(data=for_map, lng=~longitude , lat=~latitude, 
             popup=~sprintf("<b>%s</b><br/>%s, Maine</b><br/>ISP: %s<br/><hr noshade size='1'/><pre>%s\n\n%s", 
                            htmlEscape(org), htmlEscape(city), htmlEscape(isp), 
                            htmlEscape(title), htmlEscape(data)))

IIS Servers in Maine

Remember that’s only 100 of ~3,000 servers, but it should give you an idea of the types of data Shodan can return.

The pacakge is up on github for now, and here’s a list of functions it makes available:

account_profile: Account Profile
api_info: API Plan Information
host_count: Search Shodan without Results
host_info: Host Information
my_ip: My IP Address
query_tags: List the most popular tags
resolve: DNS Lookup
reverse: Reverse DNS Lookup
shodan_api_key: Get or set SHODAN_API_KEY value
shodan_exploit_search: Search for Exploits
shodan_exploit_search_count: Search for Exploits without Results
shodan_ports: List all ports that Shodan is crawling on the Internet.
shodan_protocols: List all protocols that can be used when performing on-demand Internet scans via Shodan.
shodan_query_list: List the saved search queries
shodan_query_search: Search the directory of saved search queries.
shodan_scan: Request Shodan to crawl an IP/ netblock
shodan_scan_internet: Crawl the Internet for a specific port and protocol using Shodan
shodan_search: Search Shodan
shodan_search_tokens: Break the search query into tokens
shodan_services: List all services that Shodan crawls

Each of those maps to the API endpoints described on the official Shodan site.

You are invited to tag along on this package as much or as little as you like. Drop a note in the comments if you find it useful or have suggestions! Please file all feature requests or problems on github. Have fun exporing the API in R!.

RBerkeley Was Just Pining For The Fjords

2015-07-27T11:13:00-04:00

UPDATE: RBerkeley is now on CRAN

If you made it to Chapter 8 of Data-Driven Security after ~October 2014 and tried to run the BerkeleyDB R example, you were greeted with:

Warning in install.packages :
  package ‘RBerkely’ is not available (for R version [YOUR_R_VERSION])

That’s due to the fact that it was removed from CRAN at the end of September, 2014 because the package author & maintainer did not respond to requests from the CRAN team to update the package to conform to new requirements (specifically the way package vignettes are handled).

Sharon Machlis (@sharon000 on Twitter) let me know about this recently. Since then I’ve had a few more pings about it (thank you all for reading the book! :-). So, I resurrected the package. ~~It’s not on CRAN yet, but I did submit an update to it, so we’ll see how that goes.~~

I did a bit more than move the vignette. It has a proper autoconf setup now and I fixed some of the warnings it was throwing on compilation. I also tweaked the configuration so it should work without whining on libdb 4+.

I highly doubt there were many other packages or projects relying on this package, but it seemed only fair to try to keep it alive while the book is still going strong (either that or I would have had to write a new example for that chapter, which may have been easier now that I’ve mucked with the package innards).

Post all issues/etc on github as usual - https://github.com/hrbrmstr/RBerkeley

Introducing the cymruservices R Package

2015-07-22T17:19:00-04:00

The R world has come a long way since Jay & I wrote Data-Driven Security. We had to make a conscious decision to stick with R 2.14.0 (R is at version 3.2.1 now) and packages such as knitr and dplyr either didn’t exist or were in their infancy.

In Chapter 4, we showed some very basic exploratory data analysis and visualization. One of those examples showed how to do a basic network visualization of the ZeuS botnet nodes, clustered by country of origin.

We turned some of the functions that collected metadata on the ZeuS IP addresses into a new R package - cymruservices which will be on CRAN soon. If you’re new to installing from github, you’ll need to install and load the devtools package then do a devtools::install_github("hrbrmstr/cymruservices") to work with that package until it gets on CRAN. (UPDATE: It’s on CRAN.)

We’ll re-create the first network visualization from listing 4-12 (page 94) using this package and also modify the code to use dplyr functions and visualize the graph with networkD3, a super-spiffy htmlwidget package. You’ll be able to pan & zoom the visualization and hopefully get some inspiration to “Try This At Home”.

We’ve placed the ZeuS botnet data used in the book on our website to make it easier to replicate the example. The code is (unsurprisingly) similar to the listing in the book:

library(igraph)
library(dplyr)
library(cymruservices)
library(networkD3)

# reading the IP list in a slightly different way
ips <- grep("^#|^$", readLines("http://datadrivensecurity.info/data/zeus-book.csv"), 
            value=TRUE, invert=TRUE)

# get metadata
origin <- bulk_origin(ips)

# build graph
g <- graph.empty()
g <- g + vertices(ips, group=1)
g <- g + vertices(origin$cc, group=2)

# there are other ways to build this edgelist, but I'm keeping with 
# the example in the book for consistency

ip_cc_edges <- lapply(ips, function(x) {
  i_cc <- filter(origin, ip==x) %>% .$cc
  lapply(i_cc, function(y) {
    c(x, y)
  })
})

g <- g + edges(unlist(ip_cc_edges))

# simplify graph
g <- simplify(g, edge.attr.comb=list(weight="sum"))
g <- delete.vertices(g, which(degree(g) < 1))

# get ready to make javascript vis
gd <- get.data.frame(g, what = "edges")

simpleNetwork(gd, linkDistance=20, charge=-100,
              nodeColour="#377eb8", textColour="black",
              fontSize=7, fontFamily="sans-serif",
              height=600, width=600, zoom=TRUE)

If you have the book, take a look at some of the subtle changes and also see how easy it is to make existing, static R visualizations dynamic.

There are a few more interesting functions in that package that will get you tons of useful metadata for your security data science projects. The package should be helpful when creating features for classification or just building relationships between objects that you may never know have exists. Plus, you now have a new visualization toy to play with!