Data Driven Security: The Podcast

A podast on the journey to discovery and decision making through data in information security by Bob Rudis and Jay Jacobs. In this bi-weekly show, Jay & Bob cover all levels and aspects of data analysis & visualization, plus speak with pioneers in the field of security data science and seasoned veterans in information security.

The show is a companion resource to Data Driven Security: The Book & Data Driven Security: The Blog

Data Driven Security - Episode 10

A podcast on the journey to discovery through data in information security by Jay Jacobs and Bob Rudis. Episode 10: Community Discussion with John Langton & Alex Baker of VisiTrend

Episode 10

In this episode, Jay & Bob have a community discussion with John Langton & Alex Baker about their security data analysis & visualization startup: VisiTrend, and take a look at what's made the headlines in the data science community since last show.

Resources / people featured in the show:

Link Insights from VisiTrend

VERIS/VCDB general vis - we have a tree map version of the actors, actions, assets, and attributes breakdown which better shows the distribution of events (description on snapshot).
Snapshot - can be posted and viewed without logging in
Actual analysis and data you can load after signing up and logging in

VERIS/VCDB clustering - each square is an event in the data set. Squares are first grouped based on # of employees (e.g. companies with 1k employees will be grouped together), and then based on industry. Squares are colored based on clustering output - we found 7 clusters. We will provide more detail on what defines these clusters in a blog post. It’s interesting to see that particular industries do have particular attack types according to clustering, shown by blocks of similar color.
Snapshot - Actual analysis and data

Honeypot overview - this is really cool (I think). Black, square nodes are the honey pots. Node size is based on the # of packets they’re sending. Computers use more different ports are colored red (big red guy doing massive port scan drowns out the others). The force directed layout clusters nodes if they hit the same honeypots. For instance, click a node in an “outer ring” twice to highlight the honeypot it’s hitting, and it will be one. All other nodes in that ring hit the same one. Double click one of the center nodes and you’ll se they’re hitting all of the honeypots. Treemap groups nodes according to subnet addressing. The timeline view shows time-based histogram of packets coming in colored by destination port. The red guy is selected in the snapshot, so you can see that he blasts all the honey pots at relatively same time.
Snapshot - Actual analysis and data

Honeypot port highlighting - Square nodes are attackers, and circle nodes are ports. Size of the port is how many times packets were sent to that port. Mouse over big purple circle and you see port 1433 is the most popular. You could double click it to see all machines hitting that port. There are two color layers for the node-link graph, you can toggle between them. They both show a version of variability over time (more red = more variable port usage). Treemap shows subnet addressing again but colors a green heat map based on # of diff ports each machine uses. Size based on # of packets they send.
Snapshot - Actual analysis and data

Finally, a great mentor and visionary pioneer of InfoVis named Matt Ward passed away last weekend. He wrote the most recent, comprehensive infovis book with some other really big guys in the field including Keim and Grinnel. Link to the book.

In This Episode

Time IndexTitle
Data Driven Security
Intro + SIRAcon recap
Introducing John Langton and Alex Baker of VisiTrend
What is VisiTrend?
What types of analytics back-ends does VisiTrend support?
Were you surprised at the state of infosec-specific data exporation tools?
Building the VisiTrend Community
The science behind visualization
Comparing visualizations on the RSA floor
Balance of visualization for analysis and visualization for communication
How do you handle concerns over putting production data in the cloud?
How can we improve visualization literacy?
Do folks need to be "hadoop experts" to work with your tool?
Data Science Headlines
Building a DGA Classifier three-fer
Data Science Can't be Point & Click
Machine Learning in 15 hours
Data Playlists
Docker for Data Science
Wrapping up