AlienVault Longitudinal Study Part 1

By Steve Patton (@spttnnh)
Mon 17 February 2014 | tags: datavis, dataviz, AlienVault, data analysis, data management, -- (permalink)

Chapter 3 of Data-Driven Security explores one download of the AlienVault reputation database. As you can see from the book, the reputation database has many interesting aspects to explore, including country profiles, risk versus reliability, and a variety of threat types. Of course, one download represents a simple snapshot in time. Yet we know threats are continually changing, moving, intensifying and waning. How could we expand our analysis to look at the reputation database over time?

In this series of occasional posts, we will take a time series of the database, and essentially conduct a brief longitudinal study of the reputation database and its changes. One of the first challenges is how to get a picture of change over time, and to manage the resulting dataset. For our purposes, a cron job captured the reputation database hourly for three months (October through December 2013). The resulting files are the primary dataset for our study. It can be overwhelming to manage this number of elements:

250,000 records
8 fields
24 times daily
92 days

produces a truckload of elements: 4.4 billion, to be somewhat precise.

While all this data needs to be processed for our analysis, there are certain simplifications that can reduce the amount of data needed to draw meaningful conclusions. For example, out of approximately 885,000 unique IP addresses found over the study period, less than 4% of the addresses had any change to their description. Additionally, less than one tenth of one percent of records had a change to the reliability or risk rating of an address. We can use this relative stability of records to responsibly prune the number of records we track. The raw number of records is roughly 250,000 times 24 hours times 92 days, or 552 million rows. But by choosing our smallest unit of analysis to be a day, we can smooth out any gaps in our hourly collection, and reduce our record count to about 20 million.

For all the wonderful features of open source R, there are some tasks it is not well suited for. When dealing with millions of records, a relational database is an ideal tool for managing and aggregating records. Here is a brief overview of the foundational scripts used to make the primary dataset for our study. The bash script running hourly in cron is below:

YMDHM=`date +%Y%m%d%H%M`
cd /data/avrep
# move previous version file
mv reputation.rev reputation.prev
# get current version file
wget https://reputation.alienvault.com/reputation.rev
UPDATE=`diff reputation.prev reputation.rev | wc -l`
# if version files do not match, download reputation file
if [ $UPDATE -gt 0 ]; then
    echo get
    wget -O avrep${YMDHM}.data https://reputation.alienvault.com/reputation.data
    bzip2 avrep${YMDHM}.data
fi

Here is the table definition script for MariaDB:

DROP TABLE IF EXISTS avip;
CREATE TABLE avip (
    iind    INT UNSIGNED NOT NULL AUTO_INCREMENT,
    ip      INT UNSIGNED NOT NULL,
    type    VARCHAR(50),
    cc      VARCHAR(2),
    city    VARCHAR(30),
    latlon  VARCHAR(30),
    PRIMARY KEY (iind),
    UNIQUE INDEX avip_ind (ip, type, cc, city, latlon)
);

DROP TABLE IF EXISTS avtrack;
CREATE TABLE avtrack (
    d       DATE NOT NULL,
    iind    INT UNSIGNED NOT NULL,
    risk    TINYINT UNSIGNED,
    rel     TINYINT UNSIGNED,
    minrisk TINYINT UNSIGNED,
    minrel  TINYINT UNSIGNED,
    maxrisk TINYINT UNSIGNED,
    maxrel  TINYINT UNSIGNED,
    PRIMARY KEY (d, iind)
);

Here is the bash script for loading the files into a database:

ls avrep*.data.bz2 > avrep.lst
while read line
do
    bzcat $line > avrep_tmp.data
    DATESTR=`echo ${line} | cut -c11-18`
    sed -e "s/<DAY>/${DATESTR}/g" avtrack.mysql.template > avtrack.mysql
    mysql avrep -u aaa --password=bbb < avtrack.mysql
done < avrep.lst

Here is the processing script called for each AlienVault download:

DROP TABLE IF EXISTS avrep;
CREATE TABLE avrep (
    ip      VARCHAR(20),
    risk    TINYINT UNSIGNED,
    rel     TINYINT UNSIGNED,
    type    VARCHAR(50),
    cc      VARCHAR(2),
    city    VARCHAR(30),
    latlon  VARCHAR(30),
    x       VARCHAR(10)
);

LOAD DATA LOCAL INFILE '/data/avrep_tmp.data' INTO TABLE avrep
     FIELDS TERMINATED BY '#';
SHOW WARNINGS;

INSERT IGNORE INTO avip
SELECT NULL, INET_ATON(ip), type, cc, city, latlon
  FROM avrep;

SET @day = '<DAY>';

INSERT INTO avtrack
SELECT @day AS d,
       (SELECT iind FROM avip WHERE INET_ATON(avrep.ip)=avip.ip AND 
                                    avrep.type=avip.type AND avrep.cc=avip.cc AND
                                    avrep.city=avip.city AND avrep.latlon=avip.latlon),
       risk, rel, risk, rel, risk, rel
  FROM avrep
ON DUPLICATE KEY UPDATE
    avtrack.risk=avrep.risk, avtrack.rel=avrep.rel,
    avtrack.minrisk=IF(avrep.risk<avtrack.minrisk, avrep.risk, avtrack.minrisk),
    avtrack.minrel=IF(avrep.rel<avtrack.minrel, avrep.rel, avtrack.minrel),
    avtrack.maxrisk=IF(avrep.risk>avtrack.maxrisk, avrep.risk, avtrack.maxrisk),
    avtrack.maxrel=IF(avrep.rel>avtrack.maxrel, avrep.rel, avtrack.maxrel);

With these scripts, we can collect AlienVault downloads and load them into two tables, avip and avtrack. In our next installment, we’ll start looking at the data we have aggregated. With a dataset covering a time period, in contrast to a single fixed-time sample, we can look at changes over time.