Data Driven Securityhttp://datadrivensecurity.info/blog/2016-03-08T17:00:00-05:00Mike Sconzo’s Ten Commandments of Python Data Science2016-03-08T17:00:00-05:00Mike Sconzo (@sooshie)tag:datadrivensecurity.info/blog,2016-03-08:posts/2016/Mar/sconzo-ten-commandments/<p>Straight from the Book of <span class="caps">PEP</span></p>
<ol>
<li>thou shalt have no other languages before me</li>
<li>thou shalt not compare me to R</li>
<li>thou shalt not take the name of python or scikit-learn in vain</li>
<li>keep holy the juypter notebook</li>
<li>honour thy pip and thy modules</li>
<li>thou shalt not ^C any running program, but shall exit cleanly</li>
<li>thou shalt not “experiment” with R</li>
<li>thou shalt utilize the whole <span class="caps">CPU</span> for thine is a single thread</li>
<li>thou shalt not defame lesser languages (e.g. all of them)</li>
<li>thou shalt not attempt to reproduce in python what others do in R</li>
</ol>Your Data-Driven Guide To 2016 RSA Conference (U.S. Edition)2016-01-23T23:00:00-05:00Bob Rudis (@hrbrmstr)tag:datadrivensecurity.info/blog,2016-01-23:posts/2016/Jan/rsac-2016-guide/<p>While I may not be able to attend the 2016 <span class="caps">RSA</span> Conference, I <em>can</em> provide some recommendations for those seeking a more data-driven schedule between parties and recovery breakfasts.</p>
<ul>
<li>There is a high likelihood that <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2343/advancing-information-risk-practices-seminar">Advancing Information Risk Practices Seminar</a> will have sage <span class="amp">&</span> practical advice on how to use data to best manage risk in your organization.</li>
<li>The always amazing Anton Chuvakin’s session on <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2452/demystifying-security-analytics-data-methods-use">Demystifying Security Analytics: Data, Methods, Use Cases</a> will be a great primer for those who have struggled to get a successful analytics practice off the ground.</li>
<li>I’ve been assured no IPv4 addresses, malware hashes or crafty URLs were harmed in the making of Wade Baker’s talk on <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2364/bridging-the-gap-between-threat-intelligence-and">Bridging the Gap between Threat Intelligence and Risk Management</a>. If there’s anyone who is more data-driven than Jay <span class="amp">&</span> I it’s Wade.</li>
<li><span class="dquo">“</span>Maturity models” always terrify me as they are prone to simplicty. But, if you’re starting from scratch, they <em>can</em> be an effective gateway drugs into more advanced data-driven security practices. Give <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2509/grow-up-a-maturity-model-and-roadmap-for">Grow Up: A Maturity Model and Roadmap for Vulnerability Management (Core Security)</a> a listen if you’re just starting on the path.</li>
<li>I also wince at the mere hint of “big data”, but <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2771/security-monitoring-in-the-real-world-with">Security Monitoring in the Real World with Petabytes of Data</a> may be worth a listen if you’re in a large org and are tired of fighting (and paying for) Splunk.</li>
<li>If data-driven devops is your thing, Scott Kennedy’s <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2297/devsecops-the-tao-of-security-science">DevSecOps—The Tao of Security Science</a> was spot-instanced just for you.</li>
<li>Moar “big data” at this one, but at-scale data classification is a real issue in large orgs. <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2610/applying-auto-data-classification-techniques-for">Applying Auto-Data Classification Techniques for Large Data Sets</a> by Anchit Arora may help you carve your towering data peaks down to size.</li>
<li>The economics of security go beyond security department budgets. Destabilizing the cybercrime economy is an approach orgs don’t often think about. You may find key elements of how to do that at <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2302/malware-as-a-service-kill-the-supply-chain">Malware as a Service: Kill the Supply Chain</a>.</li>
<li>Jack Jones seems to be arguing against maturity models in his talk: <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2354/how-infosec-maturity-models-are-missing-the-point">How Infosec Maturity Models Are Missing the Point</a>. Go to both and decide for yourself!</li>
<li><a href="https://www.rsaconference.com/events/us16/agenda/sessions/2790/data-driven-app-sec">Data-Driven App Sec</a>. With a title like that, it has to be on the list, no?</li>
<li>Hubbard wrote the book on measuring anything and his new book on doing so in cyber is sure to be a hit with the data-driven community. Get a preview of it at <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2384/how-to-measure-anything-in-cybersecurity-risk">How to Measure Anything in Cybersecurity Risk</a>.</li>
<li>Lance<sup>2</sup> will definitely be including the use of data in their talk on <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2308/transforming-your-security-culture-from-awareness">Transforming Your Security Culture: From Awareness to Practice to Maturity</a>.</li>
<li>I don’t know Clay and “best practices” terrify me more <span class="caps">FBI</span> iOS hacking, but <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2759/building-security-data-science-capability">Building Security Data Science Capability</a> may be chock full of sage advice.</li>
<li>One more where the title alone seems to mandate inclusion: <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2358/data-science-transforming-security-operations">Data Science Transforming Security Operations</a>. It’s by an RSAer at an <span class="caps">RSA</span> conference, so <em>caveat spectator</em>.</li>
<li>You might want to check out <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2357/effectively-measuring-cybersecurity-improvement-a">Effectively Measuring Cybersecurity Improvement: A <span class="caps">CSF</span> Use Case</a> for good-er-ah-<em>measure</em>?</li>
<li>Despite now working for a router company, the former OpenDNS folks always have interesting talks. While there’s yet moar “big data” in <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2336/using-large-scale-data-to-provide-attacker">Using Large Scale Data to Provide Attacker Attribution for Unknown IoCs</a> it will most likely be a fun and informative session.</li>
<li>There seems to be a whole lotta <em>measuring</em> going on this year at <span class="caps">RSA</span> and Lisa’s talk on <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2524/measuring-what-matters">Measuring What Matters</a> may help you focus on asking the right questions so you can get your metrics program back on track (or start one!).</li>
<li>I don’t know how data-driven <a href="https://www.rsaconference.com/events/us16/agenda/sessions/2479/this-doesnt-end-well-the-tld-explosion">This Doesn’t End Well: The <span class="caps">TLD</span> Explosion</a> will be but I despise these new silly TLDs and if you walk away from this talk hating them, too then it’s Mission: Accomplished for me.</li>
<li><a href="https://www.rsaconference.com/events/us16/agenda/sessions/2412/leveraging-analytics-for-data-protection-decisions">Leveraging Analytics for Data Protection Decisions</a> is a guarantted 5-star talk (<span class="caps">NOTE</span>: David did not pay me in pastries to say that).</li>
</ul>
<p>Did I miss any? Disagree with my chocies? Drop me a note in the comments or on Twitter!</p>
<p>If you do attend any or all of these and like to be on the podcast to give us your first-person review drop us a note or find Jay at <span class="caps">RSA</span> and get us your contact info.</p>Data-Driven Security Podcast & Book Update2016-01-11T12:00:00-05:00Bob Rudis (@hrbrmstr)tag:datadrivensecurity.info/blog,2016-01-11:posts/2016/Jan/ddsec-podcast-update/<p>We’re starting off the new year with two new ways to listen to the <a href="http://podcast.datadrivensecurity.info">Data-Driven Security Podcast</a>!</p>
<p>First, we have our own <a href="https://overcast.fm/itunes791001982/data-driven-security">Overcast station</a> fully loaded with the previous two seasons of shows. You can listen to them online right on Overcast.fm or use their minimalist but highly functional app for <a href="https://overcast.fm/itunes791001982/data-driven-security">iOS</a>.</p>
<p>You can also find and add the podcast on <a href="http://tunein.com/radio/Data-Driven-Security-p824534/">TuneIn</a>! It was crazy-cool to be able to tell the Amazon Echo: “<code>Alexa, tune in to the Data-Driven Security podcast</code>” and have it actually work. You can also play it directly from the TuneIn site or with their <a href="https://play.google.com/store/apps/details?id=tunein.player&hl=en&referrer=utm_medium%3dreferral%26utm_source%3dtunein.com%26utm_campaign%3dweb_redirect%26utm_content%3dp824534%26sourceGuideId%3dp824534">Android</a> <span class="amp">&</span> <a href="https://itunes.apple.com/us/app/tunein-radio/id418987775?mt=8&uo=4&at=1l3v4iy&referrer=utm_medium%3dreferral%26utm_source%3dtunein.com%26utm_campaign%3dweb_redirect%26utm_content%3dp824534%26sourceGuideId%3dp824534">iOS</a> apps.</p>
<p>We’re working hard to showcase great and useful work on the podcast and welcome <a href="http://datadrivensecurity.info/blog/pages/topic-request.html">topic requests</a>. If you’re not crazy about Google Forms, just drop us a note on Twitter or long-form in e-mail <a href="mailto:bob@datadrivensecurity.info">bob@datadrivensecurity.info</a>. Tell us about cool work you’re involved with or point us in the direction of other work you’ve found helpful or interesting.</p>
<p>In case you missed our holiday, “Tools” special, complete with <em>“The Grinch Who Stole Data Science”</em>, give it a listen below.</p>
<p><center><iframe src="http://tunein.com/embed/player/t102877112/" style="width:100%;height:110px;" scrolling="no" frameborder="no"></iframe></center></p>
<p>Data-Driven Security, <a href="http://datadrivensecurity.info/amzn">The Book</a> (which started this whole thing) is back in stock at Amazon (and other online booksellers). We received some complaints about poor-quality versions shipping in November/December 2015 but Wiley <span class="amp">&</span> Amazon have rectified the issue. If you were a recipient of a non-color copy of the printed version, please contact Bob or Jay for informaiton on how to get a color copy.</p>
<p>Both Jay <span class="amp">&</span> Bob will be hitting up various conferences this year and look forward to meeting listeners and readers.</p>The Fallacy of Sample Size2015-11-07T15:26:52-05:00Jay Jacobs (@jayjacobs)tag:datadrivensecurity.info/blog,2015-11-07:posts/2015/Nov/samplesize/<p>There is a lot of misperception around sample sizes and the confusion
happens on both sides of the research. A common question when
researchers are starting out is, “<a href="http://stats.stackexchange.com/search?q=sample+size">How big should my sample size
be?</a>.” To help with
that, there are handy calculators all over the Internet. But the more
troubling part of misunderstanding sample size happens when people
consume research and attempt to dismiss it claiming <a href="http://www.csoonline.com/article/2931839/data-breach/154-or-58-cents-whats-the-real-cost-of-a-breached-data-record.html">the sample size is
too
small</a>.
To make matters worse, we are in the age of big data where millions of
samples are the norm, and so seeing a study with “just” 500 samples
seems easy to dismiss. But the data just don’t work that way and I
wanted to provide some context around sample size and experimentation.</p>
<p><em>What’s a good sample size? How many samples should a study have?</em>
Unfortunately, the answer depends on how much confidence or accuracy the
research needs and the size of the effect being measured. Additionally,
these are generally balanced against the cost of additional data. It is
impossible to look at any sample size and determine if it’s
“statistically significant”. Let me repeat and rephrase that
differently: <em>You can never say a sample size is too small if you just
know the sample size.</em> And if the researcher is working with a
convenience sample (where they take all the data you can get), they
should include estimations of uncertainty in their inferences that
account for the sample size, even if the sample isn’t small.</p>
<h3 id="small-samples-can-easily-detect-large-differences">Small samples can easily detect large differences</h3>
<p>Another way to say this, is that as the experimenter increases the
number of samples, they are able to detect smaller and smaller
differences. If an experimenter is looking at two things that are vastly
different (such as perhaps opinions between “experts” and non-experts),
the large difference should be obvious even with a small sample.
However, if the experimenter is trying to compare two samples that are
very similar (yet still different), it may take a larger sample to find
that difference. These are factored into sample size calculations. As a
thought experiment, imagine flipping a novelty coin that produced heads
90% of the time. How many flips would it take before you (even
intuitively) raised an eyebrow on the difference between heads and
tails? It’d be weird (that’s a technical term) if you flipped a coin ten
times and only got one tails. Maybe you wouldn’t make any claims about
the coin after ten flips, but as you continue to flip the coin, your
confidence to say something is wrong would increase, right? And with a
hugely unfair coin (that flips heads 90% of the time), it wouldn’t take
too many flips before you are convinced. Sometimes, just a handful of
samples is still enough to detect a difference.</p>
<h3 id="samples-size-dictates-the-amount-of-confidence-in-an-estimate">Samples size dictates the amount of confidence in an estimate</h3>
<p>Let’s continue the coin flip thought experiment and say we don’t want
test if it’s fair or not (we know it’s not). Instead, we want to
estimate the probability of flipping a heads with this coin. Let’s say
we flip it 10 times and get 9 heads, can we say the probability is 90%?
Perhaps, but it’d be reckless. Because with <a href="http://www.danielsoper.com/statcalc3/calc.aspx?id=85">a little
math</a>, we find
that the actual probability of getting a heads could be anywhere between
55% and 99% given 9 heads out of 10 flips. If we doubled that to 20
flips and got 18 heads, we could still only say the range is still only
66% to 99%. We could even run a simulation and make a picture of what
the number of flips does to the confidence we have in the estimate (with
90% probably of heads).</p>
<p><img alt="Sample size and confidence interval for an unfair coin" src="/blog/images/2015/11/confidence-sample-size.png" /></p>
<p>Look at the left side of that plot, look how our confidence increases
rapidly as we add a few more samples. Then look at the rate of
improvement between 100 and 200 samples. Statisticians refer to the
amount of confidence in an experiment as the
“<a href="http://www.statmethods.net/stats/power.html">power</a>” of that
experiment. Power is defined (in simple terms) as the “<a href="http://effectsizefaq.com/2010/05/31/what-is-statistical-power/">likelihood that
a study will detect an effect when there is an effect there to be
detected</a>.”</p>
<h3 id="nobody-turns-down-more-data">Nobody turns down more data</h3>
<p>Okay, there are cases where someone would turn down more data, but my
point here is sample size is never limited with a casual decision.
Collecting data has real costs associated with it. There are either
direct costs (such as paying participants, salaries, etc.) or indirect
costs of time and effort to gather and clean the data. At some point, it
becomes infeasible (perhaps even impossible) to get more data. The cost
of that data must be balanced with the benefit of more data. But keep in
mind that the benefit of getting more data isn’t linear. To reduce the
uncertainty (confidence interval) by half, the sample size must
quadruple. So, if you collect 30 samples you can double your precision
by adding 120 more samples, but if you are at 500 samples, you’d have to
collect and clean 1,500 more samples to have the same proportional
benefit in the effect.</p>
<h3 id="some-points-of-reference">Some points of reference</h3>
<ul>
<li><span class="caps">R.A.</span> Fisher, who developed the <a href="https://en.wikipedia.org/wiki/The_Design_of_Experiments">design of
experiements</a>
and who’s techniques are used in most every modern experiment,
designed his famous “<a href="https://en.wikipedia.org/wiki/Lady_tasting_tea">Lady Tasting
Tea</a>” experiment
with just 8 cups of tea.</li>
<li>Anyone who’s researched risk analysis undoubtedly has come across
<a href="http://www.simplypsychology.org/loftus-palmer.html">Kahneman and Tversky’s Prospect
theory</a>. Their
initial study
<a href="http://www.princeton.edu/~kahneman/docs/Publications/prospect_theory.pdf">pdf</a>
had a sample size of 95 students.</li>
<li>Ivan Pavlov had 40 dogs (“Pavlov’s Dogs”) from which he developed
his <a href="https://en.wikipedia.org/wiki/Classical_conditioning">Classical
Conditioning</a> work.</li>
<li><a href="https://en.wikipedia.org/wiki/Asch_conformity_experiments">Asch’s conformity
experiments</a>,
influential research on social and peer pressure, used 50 subjects.</li>
</ul>Getting into the zone(s) with R + jsonlite2015-10-07T00:00:00-04:00Bob Rudis (@hrbrmstr)tag:datadrivensecurity.info/blog,2015-10-07:posts/2015/Oct/getting-into-the-zones-with-r-jsonlite/<p>We have some <em>strange</em> data in cybersecurity. One of the (<span class="caps">IMO</span>) stranger data files is a Domain Name System (<span class="caps">DNS</span>) <a href="https://en.wikipedia.org/wiki/Zone_file">zone file</a>. This file contains mappings between domain names and <span class="caps">IP</span> addresses (and other things) represented by “resource records”.</p>
<p>Here’s an example for the dummy/example domain <code>example.com</code>:</p>
<div class="highlight"><pre>$ORIGIN example.com. ; designates the start of this zone file in the namespace
$TTL 1h ; default expiration time of all resource records without their own TTL value
example.com. IN SOA ns.example.com. username.example.com. ( 2007120710 1d 2h 4w 1h )
example.com. IN NS ns ; ns.example.com is a nameserver for example.com
example.com. IN NS ns.somewhere.example. ; ns.somewhere.example is a backup nameserver for example.com
example.com. IN MX 10 mail.example.com. ; mail.example.com is the mailserver for example.com
@ IN MX 20 mail2.example.com. ; equivalent to above line, "@" represents zone origin
@ IN MX 50 mail3 ; equivalent to above line, but using a relative host name
example.com. IN A 192.0.2.1 ; IPv4 address for example.com
IN AAAA 2001:db8:10::1 ; IPv6 address for example.com
ns IN A 192.0.2.2 ; IPv4 address for ns.example.com
IN AAAA 2001:db8:10::2 ; IPv6 address for ns.example.com
www IN CNAME example.com. ; www.example.com is an alias for example.com
wwwtest IN CNAME www ; wwwtest.example.com is another alias for www.example.com
mail IN A 192.0.2.3 ; IPv4 address for mail.example.com
mail2 IN A 192.0.2.4 ; IPv4 address for mail2.example.com
mail3 IN A 192.0.2.5 ; IPv4 address for mail3.example.com
</pre></div>
<p>(that came from the Wikipedia link above).</p>
<p><span class="caps">DNS</span> is a hierarchical, distributed service and companies reel in the Benjamins by parsing these files from the <a href="https://en.wikipedia.org/wiki/Top-level_domain">top level domains</a> (TLDs) and providing data in a more structured format. Some also capture passive <span class="caps">DNS</span> data (i.e. data obtained from the queries to—usually—large-scale <span class="caps">DNS</span> server deployments) and integrate it into the massive data set.</p>
<p>The <span class="caps">TLD</span> zones are really what make the internet “go”. They provide pointers to everyting below them so the entire system knows where to route requests. Monitoring these <span class="caps">TLD</span> zone files for changes can reveal many things both operationally benign and malicious. Thankfully, you can get access to some of the (now <em>hundreds</em> of) <span class="caps">TLD</span> zones by filling out a form over <a href="https://czds.icann.org/">at <span class="caps">ICANN</span></a>. You won’t get approval for all of the <span class="caps">TLD</span> zone files and you’ll need to go to other sites to try to get the big guns like <code>.com</code>, <code>.net</code> <span class="amp">&</span> <code>.org</code>.</p>
<p>Once you have a zone file you need to be able to do something with it. R did not have a zone file parser, but <a href="https://github.com/hrbrmstr/zoneparser">now it does</a> thanks to the <a href="https://cran.rstudio.com/web/packages/V8/index.html">V8 package</a> and a modified version of the Node.js <a href="https://github.com/elgs/dns-zonefile">dns-zonefile module</a>.</p>
<h3 id="why-v8">Why V8?</h3>
<p>I had a dual purpose for this post. One was to introduce the <code>zoneparser</code> package, but the other was to show how you can add missing functionality to R with V8. Shimming JavaScript (or even Java or other languages for that matter) won’t necessarily get you the bare-metal performance of implementing something in R or Rcpp, but it <em>will</em> get you functional <em>quickly</em> and you can focus on Getting Things Done now and performance later. This recently happened with the package <a href="https://github.com/hrbrmstr/humanparser"><code>humanparser</code></a> that I wrote to answer a question on Stack Overflow. It’s based on a Node.js module of the same name and is written using V8. Oliver Keyes spun that into the Rcpp-backed <a href="https://github.com/hrbrmstr/humaniformat"><code>humaniformat</code></a> package (and added some functionality) that is <em>much</em> faster.</p>
<p>For these <span class="caps">TLD</span> zone files, I only need to process them once a day and there aren’t thousands or tens of thousands of them. Rather than code up a parser in R or munge some existing C/C++ domain parser code into an R package, All I had to do was this:</p>
<div class="highlight"><pre><span class="c1">#' Parse a Domain Name System (<span class="caps">DNS</span>) zone file</span>
<span class="c1">#'</span>
<span class="c1">#' @param path path to <span class="caps">DNS</span> zone file to parse</span>
<span class="c1">#' @return \code{list} with <span class="caps">DNS</span> zone parsed</span>
<span class="c1">#' @export</span>
<span class="c1">#' @examples</span>
<span class="c1">#' parse_zone(system.file("zones/20151001-wtf-zone-data.txt", package="zoneparser"))</span>
parse_zone <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>path<span class="p">)</span> <span class="p">{</span>
ct<span class="o">$</span>call<span class="p">(</span><span class="s">"zonefile.parse"</span><span class="p">,</span> paste<span class="p">(</span>readLines<span class="p">(</span>path<span class="p">),</span> collapse<span class="o">=</span><span class="s">"\n"</span><span class="p">))</span>
<span class="p">}</span>
.onAttach <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>libname<span class="p">,</span> pkgname<span class="p">)</span> <span class="p">{</span>
ct <span class="o"><<-</span> V8<span class="o">::</span>new_context<span class="p">()</span>
ct<span class="o">$</span>source<span class="p">(</span>system.file<span class="p">(</span><span class="s">"js/zoneparser.js"</span><span class="p">,</span> package<span class="o">=</span><span class="s">"zoneparser"</span><span class="p">))</span>
<span class="p">}</span>
</pre></div>
<p>Those are the only two functions in the package. The <code>.onAttach</code> sets up a V8 JavaScript context for <code>parse_zone</code> to use and loads the slightly modified <code>zoneparser.js</code> <a href="https://cran.rstudio.org/web/packages/V8/vignettes/npm.html">browserified</a> JavaScript file which makes the function <code>zonefile.parse()</code> available to the context.</p>
<p>The <code>parse_zone</code> function takes in a file path to a zone file and returns a parsed structure. And, it’s as easy to use as:</p>
<div class="highlight"><pre>library<span class="p">(</span>zoneparser<span class="p">)</span>
example <span class="o"><-</span> parse_zone<span class="p">(</span><span class="s">"example-tld.txt"</span><span class="p">)</span>
<span class="c1"># see all the resource records types that were parsed</span>
<span class="p">(</span>names<span class="p">(</span>example<span class="p">))</span>
</pre></div>
<div class="highlight"><pre><span class="c">## [1] "$origin" "$ttl" "soa" "ns" "mx" "a" "aaaa" </span>
<span class="c">## [8] "cname"</span>
</pre></div>
<div class="highlight"><pre><span class="c1"># look at the mail exchangers</span>
<span class="p">(</span>example<span class="o">$</span>mx<span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span class="c">## name preference host</span>
<span class="c">## 1 example.com. 10 mail.example.com.</span>
<span class="c">## 2 @ 20 mail2.example.com.</span>
<span class="c">## 3 @ 50 mail3</span>
</pre></div>
<p>Those can be easily exported into a database or structured plain text files for further data science-y processing.</p>
<h3 id="fin">Fin</h3>
<p>As of this post, there are ~198,000 Node.js modules out there and tons of browser-oriented JavaScript libraries. Many of these can be easily made to work in V8 (some cannot due to lack of functionality in the V8 engine).</p>
<p>If you have a “plumbing” task missing from R that needs implementing, try the V8/JavaScript route first since it took me less than 10 minutes to code up that package (I tweaked documentation, etc afterwards, though). You don’t want to be three days into an Rcpp implementation when you “could have just used V8”!</p>
<p><center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/PUPdW3ba6F4" frameborder="0" allowfullscreen>
</iframe>
</center></p>Modern Honey Network Machinations with R, Python, phantomjs, HTML & JavaScript2015-08-23T10:00:00-04:00Bob Rudis (@hrbrmstr)tag:datadrivensecurity.info/blog,2015-08-23:posts/2015/Aug/mhn-machinations-r-python-javascript/<p>This was (initially) going to be a blog post announcing the new <a href="http://github.com/hrbrmstr/mhn">mhn R package</a> (more on what that is in a bit) but somewhere along the way we ended up taking a left turn at Albuquerque (as we often do here at ddsec hq) and had an adventure in a twisty maze of <a href="http://threatstream.github.io/mhn/">Modern Honey Network</a> passages that we thought we’d relate to everyone.</p>
<h3 id="episode-0-the-quest">Episode 0 : The Quest!</h3>
<p>We find our <strike>intrepid heroes</strike> data scientists finally getting around to playing with the Modern Honey Network (<span class="caps">MHN</span>) software that they promised <a href="https://twitter.com/jason_trost">Jason Trost</a> they’d do <em>ages</em> ago. <span class="caps">MHN</span> makes it easy to [freely] centrally setup, control, monitor and collect data from one or more <a href="https://en.wikipedia.org/wiki/Honeypot_(computing)">honeypots</a>. Once you have this data you can generate threat indicator feeds from it and also do analysis on it (which is what we’re interested in eventually doing and what <a href="https://www.threatstream.com/">ThreatStream</a> <em>does do</em> with their global network of <span class="caps">MHN</span> contributors).</p>
<p>Jason has a <a href="https://www.vagrantup.com/">Vagrant</a> <a href="https://github.com/threatstream/mhn/wiki/Getting-up-and-running-using-Vagrant">quickstart</a> version of <span class="caps">MHN</span> which lets you kick the tyres locally, safely and securely before venturing out into the enterprise (or internet). You stand up the server (mostly Python-y things), then tell it what type of honeypot you want to deploy. You get a handy cut-and-paste-able string which you paste-and-execute on a system that will become an actual honeypot (which can be a “real” box, a <span class="caps">VM</span> or even a RaspberryPi!). When the honeypot is finished installing the necessary components it registers with your <span class="caps">MHN</span> server and you’re ready to start catching cyber bad guys.</p>
<p><center><img src="https://farm8.staticflickr.com/7035/6437570877_cf5b1a35de_o_d.jpg"/><br/>(cyber bad guy)</center></p>
<h3 id="episode-1-live-r-package">Episode 1 : Live! R! Package!</h3>
<p>We decided to deploy a test <span class="caps">MHN</span> server and series of honeypots on <a href="https://www.digitalocean.com/?refcode=4bb3577c3b73">Digital Ocean</a> since they work <em><span class="caps">OK</span></em> on the smallest droplet size (not recommended for a production <span class="caps">MHN</span> setup).</p>
<p>While it’s great to peruse the incoming attacks:</p>
<p><center><a href="attacks.png"><img style="max-width:100%" src="http://datadrivensecurity.info/blog/images/2015/08/attacks.png"/></a></center></p>
<p>we wanted programmatic access to the data, so we took a look at all the <a href="https://github.com/threatstream/mhn/blob/master/server/mhn/api/views.py">routes in their <span class="caps">API</span></a> and threw together an <a href="https://github.com/hrbrmstr/mhn">R package</a> to let us work with it.</p>
<div class="highlight"><pre>library<span class="p">(</span>mhn<span class="p">)</span>
attacks <span class="o"><-</span> sessions<span class="p">(</span>hours_ago<span class="o">=</span><span class="m">24</span><span class="p">)</span><span class="o">$</span>data
tail<span class="p">(</span>attacks<span class="p">)</span>
<span class="c1">## _id destination_ip destination_port honeypot</span>
<span class="c1">## 3325 55d93cb8b5b9843e9bb34c75 111.222.33.111 22 p0f</span>
<span class="c1">## 3326 55d93cb8b5b9843e9bb34c74 111.222.33.111 22 p0f</span>
<span class="c1">## 3327 55d93d30b5b9843e9bb34c77 111.222.33.111 22 p0f</span>
<span class="c1">## 3328 55d93da9b5b9843e9bb34c79 <<span class="caps">NA</span>> 6379 dionaea</span>
<span class="c1">## 3329 55d93f1db5b9843e9bb34c7b <<span class="caps">NA</span>> 9200 dionaea</span>
<span class="c1">## 3330 55d94062b5b9843e9bb34c7d <<span class="caps">NA</span>> 23 dionaea</span>
<span class="c1">## identifier protocol source_ip source_port</span>
<span class="c1">## 3325 bf7a3c5e-48e7-11e5-9fcf-040166a73101 pcap 45.114.11.23 58621</span>
<span class="c1">## 3326 bf7a3c5e-48e7-11e5-9fcf-040166a73101 pcap 45.114.11.23 58621</span>
<span class="c1">## 3327 bf7a3c5e-48e7-11e5-9fcf-040166a73101 pcap 93.174.95.81 44784</span>
<span class="c1">## 3328 83e2f4e0-4876-11e5-9fcf-040166a73101 pcap 184.105.139.108 43000</span>
<span class="c1">## 3329 83e2f4e0-4876-11e5-9fcf-040166a73101 pcap 222.186.34.160 6000</span>
<span class="c1">## 3330 83e2f4e0-4876-11e5-9fcf-040166a73101 pcap 113.89.184.24 44028</span>
<span class="c1">## timestamp</span>
<span class="c1">## 3325 2015-08-23T03:23:34.671000</span>
<span class="c1">## 3326 2015-08-23T03:23:34.681000</span>
<span class="c1">## 3327 2015-08-23T03:25:33.975000</span>
<span class="c1">## 3328 2015-08-23T03:27:36.810000</span>
<span class="c1">## 3329 2015-08-23T03:33:48.665000</span>
<span class="c1">## 3330 2015-08-23T03:39:13.899000</span>
</pre></div>
<p><span class="caps">NOTE</span>: that’s not the real <code>destination_ip</code> so don’t go poking since it’s probably someone else’s real system (if it’s even up).</p>
<p>You can also get details about the attackers (this is just one example):</p>
<div class="highlight"><pre>attacker_stats<span class="p">(</span><span class="s">"45.114.11.23"</span><span class="p">)</span><span class="o">$</span>data
<span class="c1">## $count</span>
<span class="c1">## [1] 1861</span>
<span class="c1">## </span>
<span class="c1">## $first_seen</span>
<span class="c1">## [1] "2015-08-22T16:43:59.654000"</span>
<span class="c1">## </span>
<span class="c1">## $honeypots</span>
<span class="c1">## [1] "p0f"</span>
<span class="c1">## </span>
<span class="c1">## $last_seen</span>
<span class="c1">## [1] "2015-08-23T03:23:34.681000"</span>
<span class="c1">## </span>
<span class="c1">## $num_sensors</span>
<span class="c1">## [1] 1</span>
<span class="c1">## </span>
<span class="c1">## $ports</span>
<span class="c1">## [1] 22</span>
</pre></div>
<p>The package makes it really easy (<span class="caps">OK</span>, we’re probably a <em>bit</em> biased) to grab giant chunks of time series and associated metadata for further analysis.</p>
<p>While cranking out the <span class="caps">API</span> package we noticed that there were no endpoints for the <span class="caps">MHN</span> HoneyMap. <em>Yes</em>, they do the “attacks on a map” thing but don’t think too badly of them since most of you seem to want them.</p>
<p><center><a href="map.png"><img style="max-width:100%" src="http://datadrivensecurity.info/blog/images/2015/08/map.png"/></a></center></p>
<p>After poking around the <span class="caps">MHN</span> source a bit more (and navigating the <code>view-source</code> of the map page) we discovered that they use a <a href="https://github.com/threatstream/mhn/blob/master/scripts/install_honeymap.sh">Go-based websocket server</a> to push the honeypot hits out to the map. (You can probably see where this is going, but it takes that turn first).</p>
<h3 id="episode-2-hacking-the-anti-hackers">Episode 2 : Hacking the Anti-Hackers</h3>
<p>The <em>other</em> thing we noticed is that—unlike the <span class="caps">MHN</span>-server proper—the websocket component <em>does not require authentication</em>. Now, to be fair, it’s also not really spitting out seekrit data, just (pretty useless) geocoded attack source/dest and type of honeypot involved.</p>
<p>Still, this got us wondering if we could find other <span class="caps">MHN</span> servers out there in the cold, dark internet. So, we fired up RStudio again and took a look using the <a href="http://github.com/hrbrmstr/shodan">shodan package</a>:</p>
<div class="highlight"><pre>library<span class="p">(</span>shodan<span class="p">)</span>
<span class="c1"># the most obvious way to look for <span class="caps">MHN</span> servers is to </span>
<span class="c1"># scour port 3000 looking for content that is <span class="caps">HTML</span></span>
<span class="c1"># then look for "HoneyMap" in the <title></span>
<span class="c1"># See how many (if any) there are</span>
host_count<span class="p">(</span><span class="s">'port:3000 title:HoneyMap'</span><span class="p">)</span><span class="o">$</span>total
<span class="c1">## [1] 141</span>
<span class="c1"># Grab the first 100</span>
hm_1 <span class="o"><-</span> shodan_search<span class="p">(</span><span class="s">'port:3000 title:HoneyMap'</span><span class="p">)</span>
<span class="c1"># Grab the last 41</span>
hm_2 <span class="o"><-</span> shodan_search<span class="p">(</span><span class="s">'port:3000 title:HoneyMap'</span><span class="p">,</span> page<span class="o">=</span><span class="m">2</span><span class="p">)</span>
head<span class="p">(</span>hm_1<span class="p">)</span>
<span class="c1">## hostnames title</span>
<span class="c1">## 1 HoneyMap</span>
<span class="c1">## 2 hb.c2hosting.com HoneyMap</span>
<span class="c1">## 3 HoneyMap</span>
<span class="c1">## 4 fxxx.you HoneyMap</span>
<span class="c1">## 5 ip-192-169-234-171.ip.secureserver.net HoneyMap</span>
<span class="c1">## 6 ec2-54-148-80-241.us-west-2.compute.amazonaws.com HoneyMap</span>
<span class="c1">## timestamp isp transport</span>
<span class="c1">## 1 2015-08-22T17:14:25.173291 <<span class="caps">NA</span>> tcp</span>
<span class="c1">## 2 2015-08-22T17:00:12.872171 Hosting Consulting tcp</span>
<span class="c1">## 3 2015-08-22T16:49:40.392523 Digital Ocean tcp</span>
<span class="c1">## 4 2015-08-22T15:27:29.661104 <span class="caps">KW</span> Datacenter tcp</span>
<span class="c1">## 5 2015-08-22T14:01:21.014893 GoDaddy.com, <span class="caps">LLC</span> tcp</span>
<span class="c1">## 6 2015-08-22T12:01:52.207879 Amazon tcp</span>
<span class="c1">## data</span>
<span class="c1">## 1 <span class="caps">HTTP</span>/1.1 200 <span class="caps">OK</span>\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nLast-Modified: Sun, 02 Nov 2014 21:16:17 <span class="caps">GMT</span>\r\nDate: Sat, 22 Aug 2015 17:14:22 <span class="caps">GMT</span>\r\n\r\n</span>
<span class="c1">## 2 <span class="caps">HTTP</span>/1.1 200 <span class="caps">OK</span>\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nLast-Modified: Wed, 12 Nov 2014 18:52:21 <span class="caps">GMT</span>\r\nDate: Sat, 22 Aug 2015 17:01:25 <span class="caps">GMT</span>\r\n\r\n</span>
<span class="c1">## 3 <span class="caps">HTTP</span>/1.1 200 <span class="caps">OK</span>\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nLast-Modified: Mon, 04 Aug 2014 18:07:00 <span class="caps">GMT</span>\r\nDate: Sat, 22 Aug 2015 16:49:38 <span class="caps">GMT</span>\r\n\r\n</span>
<span class="c1">## 4 <span class="caps">HTTP</span>/1.1 200 <span class="caps">OK</span>\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nDate: Sat, 22 Aug 2015 15:22:23 <span class="caps">GMT</span>\r\nLast-Modified: Sun, 27 Jul 2014 01:04:41 <span class="caps">GMT</span>\r\n\r\n</span>
<span class="c1">## 5 <span class="caps">HTTP</span>/1.1 200 <span class="caps">OK</span>\r\nAccept-Ranges: bytes\r\nContent-Length: 2278\r\nContent-Type: text/html; charset=utf-8\r\nLast-Modified: Wed, 29 Oct 2014 17:12:22 <span class="caps">GMT</span>\r\nDate: Sat, 22 Aug 2015 14:01:20 <span class="caps">GMT</span>\r\n\r\n</span>
<span class="c1">## 6 <span class="caps">HTTP</span>/1.1 200 <span class="caps">OK</span>\r\nAccept-Ranges: bytes\r\nContent-Length: 1572\r\nContent-Type: text/html; charset=utf-8\r\nDate: Sat, 22 Aug 2015 12:06:15 <span class="caps">GMT</span>\r\nLast-Modified: Mon, 08 Dec 2014 21:25:26 <span class="caps">GMT</span>\r\n\r\n</span>
<span class="c1">## port location.city location.region_code location.area_code location.longitude</span>
<span class="c1">## 1 3000 <<span class="caps">NA</span>> <<span class="caps">NA</span>> <span class="caps">NA</span> <span class="caps">NA</span></span>
<span class="c1">## 2 3000 Miami Beach <span class="caps">FL</span> 305 -80.1300</span>
<span class="c1">## 3 3000 San Francisco <span class="caps">CA</span> 415 -122.3826</span>
<span class="c1">## 4 3000 Kitchener <span class="caps">ON</span> <span class="caps">NA</span> -80.4800</span>
<span class="c1">## 5 3000 Scottsdale <span class="caps">AZ</span> 480 -111.8906</span>
<span class="c1">## 6 3000 Boardman <span class="caps">OR</span> 541 -119.5290</span>
<span class="c1">## location.country_code3 location.latitude location.postal_code location.dma_code</span>
<span class="c1">## 1 <<span class="caps">NA</span>> <span class="caps">NA</span> <<span class="caps">NA</span>> <span class="caps">NA</span></span>
<span class="c1">## 2 <span class="caps">USA</span> 25.7906 33109 528</span>
<span class="c1">## 3 <span class="caps">USA</span> 37.7312 94124 807</span>
<span class="c1">## 4 <span class="caps">CAN</span> 43.4236 <span class="caps">N2E</span> <span class="caps">NA</span></span>
<span class="c1">## 5 <span class="caps">USA</span> 33.6119 85260 753</span>
<span class="c1">## 6 <span class="caps">USA</span> 45.7788 97818 810</span>
<span class="c1">## location.country_code location.country_name ipv6</span>
<span class="c1">## 1 <<span class="caps">NA</span>> <<span class="caps">NA</span>> 2600:3c02::f03c:91ff:fe73:4d8b</span>
<span class="c1">## 2 <span class="caps">US</span> United States <<span class="caps">NA</span>></span>
<span class="c1">## 3 <span class="caps">US</span> United States <<span class="caps">NA</span>></span>
<span class="c1">## 4 <span class="caps">CA</span> Canada <<span class="caps">NA</span>></span>
<span class="c1">## 5 <span class="caps">US</span> United States <<span class="caps">NA</span>></span>
<span class="c1">## 6 <span class="caps">US</span> United States <<span class="caps">NA</span>></span>
<span class="c1">## domains org os module ip_str</span>
<span class="c1">## 1 <<span class="caps">NA</span>> <<span class="caps">NA</span>> http 2600:3c02::f03c:91ff:fe73:4d8b</span>
<span class="c1">## 2 c2hosting.com Hosting Consulting <<span class="caps">NA</span>> http 199.88.60.245</span>
<span class="c1">## 3 Digital Ocean <<span class="caps">NA</span>> http 104.131.142.171</span>
<span class="c1">## 4 fxxx.you <span class="caps">KW</span> Datacenter <<span class="caps">NA</span>> http 162.244.29.65</span>
<span class="c1">## 5 secureserver.net GoDaddy.com, <span class="caps">LLC</span> <<span class="caps">NA</span>> http 192.169.234.171</span>
<span class="c1">## 6 amazonaws.com Amazon <<span class="caps">NA</span>> http 54.148.80.241</span>
<span class="c1">## ip asn link uptime</span>
<span class="c1">## 1 <span class="caps">NA</span> <<span class="caps">NA</span>> <<span class="caps">NA</span>> <span class="caps">NA</span></span>
<span class="c1">## 2 3344448757 <span class="caps">AS40539</span> <<span class="caps">NA</span>> <span class="caps">NA</span></span>
<span class="c1">## 3 1753452203 <<span class="caps">NA</span>> <<span class="caps">NA</span>> <span class="caps">NA</span></span>
<span class="c1">## 4 2733907265 <<span class="caps">NA</span>> <<span class="caps">NA</span>> <span class="caps">NA</span></span>
<span class="c1">## 5 3232361131 <span class="caps">AS26496</span> <<span class="caps">NA</span>> <span class="caps">NA</span></span>
<span class="c1">## 6 915689713 <<span class="caps">NA</span>> <<span class="caps">NA</span>> <span class="caps">NA</span></span>
</pre></div>
<p>Yikes! 141 servers just on the default port (3000) alone! While these systems may be shown as existing in Shodan, we really needed to confirm that they were, indeed, live <span class="caps">MHN</span> HoneyMap [websocket] servers. </p>
<h3 id="episode-3-picture-imperfect">Episode 3 : Picture [Im]Perfect</h3>
<p>Rather than just test for existence of the websocket/data feed we decided to take a screen shot of every server, which is pretty easy to do with a crude-but-effective mashup of R and <code>phantomjs</code>. For this, we made a script which is just a call—for each of the websocket URLs—to the “built-in” phantomjs <a href="https://gist.github.com/hrbrmstr/6b119648739cd275a69e#file-ourrasterize-js-L45">rasterize.js script</a> that we’ve slightly modified to wait 30 seconds from page open to snapshot creation. We did that in the hopes that we’d see live attacks in the captures.</p>
<div class="highlight"><pre>cat<span class="p">(</span>sprintf<span class="p">(</span><span class="s">"phantomjs rasterize.js http://%s:%s %s.png 800px*600px\n"</span><span class="p">,</span>
hm_1<span class="o">$</span>matches<span class="o">$</span>ip_str<span class="p">,</span>
hm_1<span class="o">$</span>matches<span class="o">$</span>port<span class="p">,</span>
hm_1<span class="o">$</span>matches<span class="o">$</span>ip_str<span class="p">),</span> file<span class="o">=</span><span class="s">"capture.sh"</span><span class="p">)</span>
</pre></div>
<p>That makes <code>capture.sh</code> look something like:</p>
<div class="highlight"><pre>phantomjs rasterize.js http://199.88.60.245:3000 199.88.60.245.png 800px*600px
phantomjs rasterize.js http://104.131.142.171:3000 104.131.142.171.png 800px*600px
phantomjs rasterize.js http://162.244.29.65:3000 162.244.29.65.png 800px*600px
phantomjs rasterize.js http://192.169.234.171:3000 192.169.234.171.png 800px*600px
phantomjs rasterize.js http://54.148.80.241:3000 54.148.80.241.png 800px*600px
phantomjs rasterize.js http://95.97.211.86:3000 95.97.211.86.png 800px*600px
</pre></div>
<p>Yes, there <em>are</em> far more elegant ways to do this, but the number of URLs was small and we had no time constraints. We could have used a
pure phantomjs solution (list of URLs in phantomjs JavaScript) or used
<span class="caps">GNU</span> parallel to speed up the image captures as well.</p>
<p>Sifting through ~140 images manually to see if any had “hits” would not have been <em>too</em> bad, bit a glance at the directory listing showed that many had the exact same size, meaning those were probably showing a default/blank map. We <code>uniq</code>‘d them by <span class="caps">MD5</span> hash and made an image gallery of them:</p>
<p><center>
<iframe style="max-width=100%"
src="/iframes/mhn.html"
sandbox="allow-same-origin
allow-scripts" width="100%"
height="500"
scrolling="no"
seamless="seamless"
frameBorder="0"></iframe>
</center></p>
<p>It was interesting to see Mexico <span class="caps">CERT</span> and OpenDNS in the mix.</p>
<p>Most of the 141 were active/live <span class="caps">MHN</span> HoneyMap sites. We can only imagine what a full Shodan search for HoneyMaps on other ports would come back with (mostly since we only have the basic <span class="caps">API</span> access and don’t want to burn the credits).</p>
<h3 id="episode-3-with-meh-data-comes-great-irresponsibility">Episode 3 : With “Meh” Data Comes Great Irresponsibility</h3>
<p>For those who may not have been with DDSec for it’s entirety, you may not be aware that we have our <em>own</em> <a href="http://ocularwarfare.com/ipew/">attack map</a> (<a href="https://github.com/hrbrmstr/pewpew">github</a>).</p>
<p>We thought it would be interesting to see if we could mashup <span class="caps">MHN</span> HoneyMap data with our creation. We first had to see what the websocket returned. Here’s a bit of Python to do that (the R <code>websockets</code> package was abandoned by it’s creator, but keep an eye out for another @hrbrmstr resurrection):</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">websocket</span>
<span class="kn">import</span> <span class="nn">thread</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="k">def</span> <span class="nf">on_message</span><span class="p">(</span><span class="n">ws</span><span class="p">,</span> <span class="n">message</span><span class="p">):</span>
<span class="k">print</span> <span class="n">message</span>
<span class="k">def</span> <span class="nf">on_error</span><span class="p">(</span><span class="n">ws</span><span class="p">,</span> <span class="n">error</span><span class="p">):</span>
<span class="k">print</span> <span class="n">error</span>
<span class="k">def</span> <span class="nf">on_close</span><span class="p">(</span><span class="n">ws</span><span class="p">):</span>
<span class="k">print</span> <span class="s">"### closed ###"</span>
<span class="n">websocket</span><span class="o">.</span><span class="n">enableTrace</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">ws</span> <span class="o">=</span> <span class="n">websocket</span><span class="o">.</span><span class="n">WebSocketApp</span><span class="p">(</span><span class="s">"ws://128.199.121.95:3000/data/websocket"</span><span class="p">,</span>
<span class="n">on_message</span> <span class="o">=</span> <span class="n">on_message</span><span class="p">,</span>
<span class="n">on_error</span> <span class="o">=</span> <span class="n">on_error</span><span class="p">,</span>
<span class="n">on_close</span> <span class="o">=</span> <span class="n">on_close</span><span class="p">)</span>
<span class="n">ws</span><span class="o">.</span><span class="n">run_forever</span><span class="p">()</span>
</pre></div>
<p>That particular server is <em>very</em> active, hence why we chose to use it.</p>
<p>The output should look something like:</p>
<div class="highlight"><pre><span class="nv">$ </span>python ws.py
--- request header ---
<span class="caps">GET</span> /data/websocket <span class="caps">HTTP</span>/1.1
Upgrade: websocket
Connection: Upgrade
Host: 128.199.121.95:3000
Origin: http://128.199.121.95:3000
Sec-WebSocket-Key: <span class="nv">07EFbUtTS4ubl2mmHS1ntQ</span><span class="o">==</span>
Sec-WebSocket-Version: 13
-----------------------
--- response header ---
<span class="caps">HTTP</span>/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: nvTKSyCh+k1Rl5HzxkVNAZjZZUA<span class="o">=</span>
-----------------------
<span class="o">{</span><span class="s2">"city"</span>:<span class="s2">"Clarks Summit"</span>,<span class="s2">"city2"</span>:<span class="s2">"San Francisco"</span>,<span class="s2">"countrycode"</span>:<span class="s2">"<span class="caps">US</span>"</span>,<span class="s2">"countrycode2"</span>:<span class="s2">"<span class="caps">US</span>"</span>,<span class="s2">"latitude"</span>:41.44860076904297,<span class="s2">"latitude2"</span>:37.774898529052734,<span class="s2">"longitude"</span>:-75.72799682617188,<span class="s2">"longitude2"</span>:-122.41940307617188,<span class="s2">"type"</span>:<span class="s2">"p0f.events"</span><span class="o">}</span>
<span class="o">{</span><span class="s2">"city"</span>:<span class="s2">"Clarks Summit"</span>,<span class="s2">"city2"</span>:<span class="s2">"San Francisco"</span>,<span class="s2">"countrycode"</span>:<span class="s2">"<span class="caps">US</span>"</span>,<span class="s2">"countrycode2"</span>:<span class="s2">"<span class="caps">US</span>"</span>,<span class="s2">"latitude"</span>:41.44860076904297,<span class="s2">"latitude2"</span>:37.774898529052734,<span class="s2">"longitude"</span>:-75.72799682617188,<span class="s2">"longitude2"</span>:-122.41940307617188,<span class="s2">"type"</span>:<span class="s2">"p0f.events"</span><span class="o">}</span>
<span class="o">{</span><span class="s2">"city"</span>:null,<span class="s2">"city2"</span>:<span class="s2">"Singapore"</span>,<span class="s2">"countrycode"</span>:<span class="s2">"<span class="caps">US</span>"</span>,<span class="s2">"countrycode2"</span>:<span class="s2">"<span class="caps">SG</span>"</span>,<span class="s2">"latitude"</span>:32.78310012817383,<span class="s2">"latitude2"</span>:1.2930999994277954,<span class="s2">"longitude"</span>:-96.80670166015625,<span class="s2">"longitude2"</span>:103.85579681396484,<span class="s2">"type"</span>:<span class="s2">"p0f.events"</span><span class="o">}</span>
</pre></div>
<p>Those are near-perfect <span class="caps">JSON</span> records for our map, so we figured out a way to tell iPew/PewPew (whatever folks are calling it these days) to take any accessible <span class="caps">MHN</span> HoneyMap as a live data source. For example, to plug this highly active HoneyMap into iPew all you need to do is <a href="http://ocularwarfare.com/ipew/?mhnsource=http://128.199.121.95:3000/data/">this</a>:</p>
<blockquote>
<p><code>http://ocularwarfare.com/ipew/?mhnsource=http://128.199.121.95:3000/data/</code></p>
</blockquote>
<p>Once we make the websockets component of the iPew map a bit more resilient we’ll post it to GitHub (you can just view the source to try it on your own now).</p>
<h3 id="fin">Fin</h3>
<p>As we stated up front, the main goal of this post is to introduce the <a href="http://github.com/hrbrmstr/mhn">mhn package</a>. But, our diversion has us curious. Are the open instances of HoneyMap deliberate or accidental? If any of them are “real” honeypot research or actual production environments, does such an open presence of the <span class="caps">MHN</span> controller reduce the utility of the honeypot nodes? Is Greenland paying ThreatStream to use that map projection instead of a better one?</p>
<p>If you use the new package, found this post helpful (or, at least, amusing) or know the answers to any of those questions, drop a note in the comments.</p>New R Package - domaintools (access the DomainTools.com WHOIS API)2015-08-09T15:11:00-04:00Bob Rudis (@hrbrmstr)tag:datadrivensecurity.info/blog,2015-08-09:posts/2015/Aug/new-r-package-domaintools/<p>We just did a <a href="https://github.com/hrbrmstr/domaintools">github release</a> for an R package that provides an interface to the <a href="http://www.domaintools.com/resources/api-documentation/">DomainTools <span class="caps">API</span></a>. It provides access to the core <span class="caps">API</span> functions that aren’t restricted (i.e. the ones we have access to):</p>
<ul>
<li><code>domaintools_api_key</code>: Get or set <code>DOMAINTOOLS_API_KEY</code> value</li>
<li><code>domaintools_username</code>: Get or set <code>DOMAINTOOLS_API_USERNAME</code> value</li>
<li><code>domain_profile</code>: Domain Profile</li>
<li><code>hosting_history</code>: Hosting History</li>
<li><code>parsed_whois</code>: Parsed Whois</li>
<li><code>reverse_ip</code>: Reverse <span class="caps">IP</span></li>
<li><code>reverse_ns</code>: Reverse Nameserver</li>
<li><code>shared_ips</code>: Shared IPs</li>
<li><code>whois</code>: Whois Lookup</li>
<li><code>whois_history</code>: Whois History</li>
</ul>
<p>Each function has a full description and sample call, so feel free to kick the typres and provide feedback on github.</p>
<p>If you have access to the <span class="caps">API</span> elements we do not, please either contribute a <span class="caps">PR</span> or help us out with some testing.</p>
<p>This is one more package on our path towards a complete set of “cybersecurity” R packages to help information security folk get their (hopefully) data-driven jobs done in R. I believe @<a href="twitter.com/quominus">quominus</a> <em>may</em> be working on a macro “whois” pacakge to unify access to all the various <span class="caps">WHOIS</span> services, too.</p>The New and Improved R Shodan Package2015-08-07T11:30:00-04:00Bob Rudis (@hrbrmstr)tag:datadrivensecurity.info/blog,2015-08-07:posts/2015/Aug/the-new-and-improved-r-shodan-package/<p>For those not involved with all things “cyber”, let me start with a description of what <a href="http://shodan.io/">Shodan</a> is (though visiting the site is probably the best introduction to what secrets it holds).</p>
<p>Shodan is—at it’s core—a search engine. Unlike Google, Shodan indexes what I’ll call “cyber” metadata and content about everything accessible via a public <span class="caps">IP</span> address. This means things like</p>
<ul>
<li>routers, switches and cable/<span class="caps">DSL</span>/FiOS modems (which are the underpinnings of our innternet access)</li>
<li>internet web, ftp, mail, etc servers</li>
<li>public (protected or otherwise) <span class="caps">CCTV</span> <span class="amp">&</span> home surveillance <span class="amp">&</span> web camears</li>
<li>desktops, printers and other things that may end up in public <span class="caps">IP</span> space</li>
<li>gas station pumps and industrial control systems</li>
<li>VoIP phones <span class="amp">&</span> more</li>
</ul>
<p>Shodan contacts the <span class="caps">IP</span> addresses associated with all the devices, sees what <a href="https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers">ports</a> and <a href="https://en.wikipedia.org/wiki/Internet_Protocol">protocols</a> might be in use and then tries to retrieve content from those ports and protocols (which could be anything from webcam snapshots to web server <span class="caps">HTML</span> to actual header responses from internet servers to banners from routers and switches). It indexes all that metadata and content and makes it available in a search engine and <span class="caps">API</span> for securiy researchers (I was <em>so</em> tempted to put that word in quotes).</p>
<p>To give you an idea what it can do, take a look at <a href="https://www.shodan.io/search?query=Server%3A+SQ-WEBCAM">this query for webcams</a> and/or read this <a href="http://null-byte.wonderhowto.com/how-to/hack-like-pro-find-vulnerable-webcams-across-globe-using-shodan-0154830/">full explanation of what you can do with that data</a>.</p>
<p>While you can have fun with Shodan, it does have real value to security folk and R needed a real <span class="caps">API</span> interface to it (I did a half-hearted one a couiple years ago). Hence the rebirth of the <a href="https://github.com/hrbrmstr/shodan">shodan package</a>.</p>
<p>The package is brand-new, but it has basic, full coverage of the <a href="https://developer.shodan.io/api">Shodan <span class="caps">API</span></a> <em>except</em> for the streaming functions. But, a line of code is worth a thousand blatherings, so let’s find all the <span class="caps">IIS</span> servers in Maine.</p>
<div class="highlight"><pre><span class="c1"># devtools::install_github("hrbrmstr/shodan")</span>
library<span class="p">(</span>shodan<span class="p">)</span>
<span class="c1"># perform the query for <span class="caps">IIS</span> servers in Maine</span>
maine_iis <span class="o"><-</span> shodan_search<span class="p">(</span><span class="s">"iis state:me"</span><span class="p">)</span>
<span class="c1"># get the total number of <span class="caps">IIS</span> servers in Maine that Shodan found</span>
print<span class="p">(</span>maine_iis<span class="o">$</span>total<span class="p">)</span>
<span class="c1">## [1] 2948</span>
<span class="c1"># how many did it return in this page of the query?</span>
print<span class="p">(</span>nrow<span class="p">(</span>maine_iis<span class="o">$</span>matches<span class="p">))</span>
<span class="c1">## [1] 100</span>
<span class="c1"># what else does it know about these servers?</span>
print<span class="p">(</span>colnames<span class="p">(</span>maine_iis<span class="o">$</span>matches<span class="p">))</span>
<span class="c1">## [1] "product" "hostnames" "version" "title" "ip" "org" </span>
<span class="c1">## [7] "isp" "cpe" "data" "asn" "port" "transport"</span>
<span class="c1">## [13] "timestamp" "domains" "ip_str" "os" "_shodan" "location" </span>
<span class="c1">## [19] "ssl" "link"</span>
</pre></div>
<p>Now, the data frame in <code>maine_iis$matches</code> is somewhat ugly for the moment. Some columns have lists and data frames since the Shodan <span class="caps">REST</span> <span class="caps">API</span> returns (like many APIs do) nested <span class="caps">JSON</span>. I’m actually looking for collaboration on what would be the most useful format for the returned data structures so hit me up if you have ideas that would benefit your use of it.</p>
<p>I’ll violate my own rule about mapping <span class="caps">IP</span> addresses just to show you Shodan also does geolocation for you (and, hey, y’all seem to like maps). We’ll make it a <em>bit</em> more useful and add some metadata about what it found to the location popups:</p>
<div class="highlight"><pre>library<span class="p">(</span>leaflet<span class="p">)</span>
library<span class="p">(</span>htmltools<span class="p">)</span>
for_map <span class="o"><-</span> cbind.data.frame<span class="p">(</span>maine_iis<span class="o">$</span>matches<span class="o">$</span>location<span class="p">,</span>
ip<span class="o">=</span>maine_iis<span class="o">$</span>matches<span class="o">$</span>ip<span class="p">,</span>
isp<span class="o">=</span>maine_iis<span class="o">$</span>matches<span class="o">$</span>isp<span class="p">,</span>
title<span class="o">=</span>maine_iis<span class="o">$</span>matches<span class="o">$</span>title<span class="p">,</span>
org<span class="o">=</span>maine_iis<span class="o">$</span>matches<span class="o">$</span>org<span class="p">,</span>
data<span class="o">=</span>maine_iis<span class="o">$</span>matches<span class="o">$</span>data<span class="p">,</span>
stringsAsFactors<span class="o">=</span><span class="kc"><span class="caps">FALSE</span></span><span class="p">)</span>
leaflet<span class="p">(</span>for_map<span class="p">,</span> width<span class="o">=</span><span class="s">"600"</span><span class="p">,</span> height<span class="o">=</span><span class="s">"600"</span><span class="p">)</span> <span class="o">%>%</span>
addTiles<span class="p">()</span> <span class="o">%>%</span>
setView<span class="p">(</span><span class="m">-69.233328</span><span class="p">,</span> <span class="m">45.250556</span><span class="p">,</span> <span class="m">7</span><span class="p">)</span> <span class="o">%>%</span>
addCircles<span class="p">(</span>data<span class="o">=</span>for_map<span class="p">,</span> lng<span class="o">=~</span>longitude <span class="p">,</span> lat<span class="o">=~</span>latitude<span class="p">,</span>
popup<span class="o">=~</span>sprintf<span class="p">(</span><span class="s">"<b>%s</b><br/>%s, Maine</b><br/><span class="caps">ISP</span>: %s<br/><hr noshade size='1'/><pre>%s\n\n%s"</span><span class="p">,</span>
htmlEscape<span class="p">(</span>org<span class="p">),</span> htmlEscape<span class="p">(</span>city<span class="p">),</span> htmlEscape<span class="p">(</span>isp<span class="p">),</span>
htmlEscape<span class="p">(</span>title<span class="p">),</span> htmlEscape<span class="p">(</span>data<span class="p">)))</span>
</pre></div>
<p><center>
<b><span class="caps">IIS</span> Servers in Maine</b>
<iframe style="max-width=100%"
src="/widgets/2015-08-08-shodan-01.html"
sandbox="allow-same-origin
allow-scripts" width="600"
height="600"
scrolling="no"
seamless="seamless"
frameBorder="0"></iframe>
</center></p>
<p>Remember that’s only 100 of ~3,000 servers, but it should give you an idea of the types of data Shodan can return.</p>
<p>The pacakge is <a href="https://github.com/hrbrmstr/shodan">up on github</a> for now, and here’s a list of functions it makes available:</p>
<ul>
<li><code>account_profile</code>: Account Profile</li>
<li><code>api_info</code>: <span class="caps">API</span> Plan Information</li>
<li><code>host_count</code>: Search Shodan without Results</li>
<li><code>host_info</code>: Host Information</li>
<li><code>my_ip</code>: My <span class="caps">IP</span> Address</li>
<li><code>query_tags</code>: List the most popular tags</li>
<li><code>resolve</code>: <span class="caps">DNS</span> Lookup</li>
<li><code>reverse</code>: Reverse <span class="caps">DNS</span> Lookup</li>
<li><code>shodan_api_key</code>: Get or set SHODAN_API_KEY value</li>
<li><code>shodan_exploit_search</code>: Search for Exploits</li>
<li><code>shodan_exploit_search_count</code>: Search for Exploits without Results</li>
<li><code>shodan_ports</code>: List all ports that Shodan is crawling on the Internet.</li>
<li><code>shodan_protocols</code>: List all protocols that can be used when performing on-demand Internet scans via Shodan.</li>
<li><code>shodan_query_list</code>: List the saved search queries</li>
<li><code>shodan_query_search</code>: Search the directory of saved search queries.</li>
<li><code>shodan_scan</code>: Request Shodan to crawl an <span class="caps">IP</span>/ netblock</li>
<li><code>shodan_scan_internet</code>: Crawl the Internet for a specific port and protocol using Shodan</li>
<li><code>shodan_search</code>: Search Shodan</li>
<li><code>shodan_search_tokens</code>: Break the search query into tokens</li>
<li><code>shodan_services</code>: List all services that Shodan crawls</li>
</ul>
<p>Each of those maps to the <a href="https://developer.shodan.io/api"><span class="caps">API</span> endpoints</a> described on the official Shodan site.</p>
<p>You are invited to tag along on this package as much or as little as you like. Drop a note in the comments if you find it useful or have suggestions! Please file all feature requests or problems on github. Have fun exporing the <span class="caps">API</span> in R!.</p>RBerkeley Was Just Pining For The Fjords2015-07-27T11:13:00-04:00Bob Rudis (@hrbrmstr)tag:datadrivensecurity.info/blog,2015-07-27:posts/2015/Jul/rberkeley-was-just-pining-for-the-fjords/<p><b><span class="caps">UPDATE</span>:</b> <code>RBerkeley</code> is now <a href="https://cran.r-project.org/web/packages/RBerkeley/index.html">on <span class="caps">CRAN</span></a></p>
<p>If you made it to Chapter 8 of <a href="http://datadrivensecurity.info/amzn">Data-Driven Security</a> after ~October 2014 and tried to run the BerkeleyDB R example, you were greeted with:</p>
<div class="highlight"><pre>Warning in install.packages :
package ‘RBerkely’ is not available (for R version [YOUR_R_VERSION])
</pre></div>
<p>That’s due to the fact that it was removed from <span class="caps">CRAN</span> at the end of September, 2014 because the package author <span class="amp">&</span> maintainer did not respond to requests from the <span class="caps">CRAN</span> team to update the package to conform to new requirements (specifically the way package vignettes are handled).</p>
<p>Sharon Machlis (@<a href="https://twitter.com/sharon000">sharon000</a> on Twitter) let me know about this recently. Since then I’ve had a few more pings about it (thank you all for reading the book! :-). So, I <a href="https://github.com/hrbrmstr/RBerkeley">resurrected the package</a>. <strike>It’s not on <span class="caps">CRAN</span> yet, but I did submit an update to it, so we’ll see how that goes.</strike></p>
<p>I did a bit more than move the vignette. It has a proper <code>autoconf</code> setup now and I fixed some of the warnings it was throwing on compilation. I also tweaked the configuration so it should work without whining on <code>libdb</code> 4+. </p>
<p>I highly doubt there were many other packages or projects relying on this package, but it seemed only fair to try to keep it alive while the book is still going strong (either that or I would have had to write a new example for that chapter, which <em>may</em> have been easier now that I’ve mucked with the package innards).</p>
<p>Post all issues/etc on github as usual - https://github.com/hrbrmstr/RBerkeley</p>
<iframe width="640" height="390" src="https://www.youtube.com/embed/npjOSLCR2hE" frameborder="0" allowfullscreen></iframe>Introducing the cymruservices R Package2015-07-22T17:19:00-04:00Bob Rudis (@hrbrmstr)tag:datadrivensecurity.info/blog,2015-07-22:posts/2015/Jul/introducing-the-cymruservices-r-package/<p>The R world has come a long way since Jay <span class="amp">&</span> I wrote <a href="http://datadrivensecurity.info/amzn">Data-Driven Security</a>. We had to make a conscious decision to stick with R 2.14.0 (R is at version 3.2.1 now) and packages such as knitr and dplyr either didn’t exist or were in their infancy.</p>
<p>In Chapter 4, we showed some very basic exploratory data analysis and visualization. One of those examples showed how to do a basic network visualization of the ZeuS botnet nodes, clustered by country of origin.</p>
<p>We turned some of the functions that collected metadata on the ZeuS <span class="caps">IP</span> addresses into a new R package - <a href="https://github.com/hrbrmstr/cymruservices">cymruservices</a> which will be on <span class="caps">CRAN</span> soon. If you’re new to installing from github, you’ll need to install and load the <code>devtools</code> package then do a <code>devtools::install_github("hrbrmstr/cymruservices")</code> to work with that package until it gets on <span class="caps">CRAN</span>. (<span class="caps">UPDATE</span>: It’s <a href="http://cran.r-project.org/web/packages/cymruservices/index.html">on <span class="caps">CRAN</span></a>.)</p>
<p>We’ll re-create the first network visualization from listing 4-12 (page 94) using this package and also modify the code to use <code>dplyr</code> functions and visualize the graph with <code>networkD3</code>, a super-spiffy <code>htmlwidget</code> package. You’ll be able to pan <span class="amp">&</span> zoom the visualization and hopefully get some inspiration to “Try This At Home”.</p>
<p>We’ve placed the ZeuS botnet data used in the book on our website to make it easier to replicate the example. The code is (unsurprisingly) similar to the listing in the book:</p>
<div class="highlight"><pre>library<span class="p">(</span>igraph<span class="p">)</span>
library<span class="p">(</span>dplyr<span class="p">)</span>
library<span class="p">(</span>cymruservices<span class="p">)</span>
library<span class="p">(</span>networkD3<span class="p">)</span>
<span class="c1"># reading the <span class="caps">IP</span> list in a slightly different way</span>
ips <span class="o"><-</span> grep<span class="p">(</span><span class="s">"^#|^$"</span><span class="p">,</span> readLines<span class="p">(</span><span class="s">"http://datadrivensecurity.info/data/zeus-book.csv"</span><span class="p">),</span>
value<span class="o">=</span><span class="kc"><span class="caps">TRUE</span></span><span class="p">,</span> invert<span class="o">=</span><span class="kc"><span class="caps">TRUE</span></span><span class="p">)</span>
<span class="c1"># get metadata</span>
origin <span class="o"><-</span> bulk_origin<span class="p">(</span>ips<span class="p">)</span>
<span class="c1"># build graph</span>
g <span class="o"><-</span> graph.empty<span class="p">()</span>
g <span class="o"><-</span> g <span class="o">+</span> vertices<span class="p">(</span>ips<span class="p">,</span> group<span class="o">=</span><span class="m">1</span><span class="p">)</span>
g <span class="o"><-</span> g <span class="o">+</span> vertices<span class="p">(</span>origin<span class="o">$</span>cc<span class="p">,</span> group<span class="o">=</span><span class="m">2</span><span class="p">)</span>
<span class="c1"># there are other ways to build this edgelist, but I'm keeping with </span>
<span class="c1"># the example in the book for consistency</span>
ip_cc_edges <span class="o"><-</span> lapply<span class="p">(</span>ips<span class="p">,</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> <span class="p">{</span>
i_cc <span class="o"><-</span> filter<span class="p">(</span>origin<span class="p">,</span> ip<span class="o">==</span>x<span class="p">)</span> <span class="o">%>%</span> .<span class="o">$</span>cc
lapply<span class="p">(</span>i_cc<span class="p">,</span> <span class="kr">function</span><span class="p">(</span>y<span class="p">)</span> <span class="p">{</span>
c<span class="p">(</span>x<span class="p">,</span> y<span class="p">)</span>
<span class="p">})</span>
<span class="p">})</span>
g <span class="o"><-</span> g <span class="o">+</span> edges<span class="p">(</span>unlist<span class="p">(</span>ip_cc_edges<span class="p">))</span>
<span class="c1"># simplify graph</span>
g <span class="o"><-</span> simplify<span class="p">(</span>g<span class="p">,</span> edge.attr.comb<span class="o">=</span>list<span class="p">(</span>weight<span class="o">=</span><span class="s">"sum"</span><span class="p">))</span>
g <span class="o"><-</span> delete.vertices<span class="p">(</span>g<span class="p">,</span> which<span class="p">(</span>degree<span class="p">(</span>g<span class="p">)</span> <span class="o"><</span> <span class="m">1</span><span class="p">))</span>
<span class="c1"># get ready to make javascript vis</span>
gd <span class="o"><-</span> get.data.frame<span class="p">(</span>g<span class="p">,</span> what <span class="o">=</span> <span class="s">"edges"</span><span class="p">)</span>
simpleNetwork<span class="p">(</span>gd<span class="p">,</span> linkDistance<span class="o">=</span><span class="m">20</span><span class="p">,</span> charge<span class="o">=</span><span class="m">-100</span><span class="p">,</span>
nodeColour<span class="o">=</span><span class="s">"#377eb8"</span><span class="p">,</span> textColour<span class="o">=</span><span class="s">"black"</span><span class="p">,</span>
fontSize<span class="o">=</span><span class="m">7</span><span class="p">,</span> fontFamily<span class="o">=</span><span class="s">"sans-serif"</span><span class="p">,</span>
height<span class="o">=</span><span class="m">600</span><span class="p">,</span> width<span class="o">=</span><span class="m">600</span><span class="p">,</span> zoom<span class="o">=</span><span class="kc"><span class="caps">TRUE</span></span><span class="p">)</span>
</pre></div>
<p>If you have the book, take a look at some of the subtle changes and also see how easy it is to make existing, static R visualizations dynamic.</p>
<p><center><iframe height=600 width=600 style="width:600;height=600" frameborder=0 seamless src="http://datadrivensecurity.info/frames/201507cymru.html"></iframe></center></p>
<p>There are a few more interesting functions in that package that will get you tons of useful metadata for your security data science projects. The package should be helpful when creating features for classification or just building relationships between objects that you may never know have exists. Plus, you now have a new visualization toy to play with!</p>