By Jay Jacobs (@jayjacobs)
Mon 15 September 2014
|
tags:
blog,
r,
rstats,
-- (permalink)
During July I was working with a commercial data source that provides extra data around IP addresses and it dawned on me: rather than pinging billions of IP addresses and creating map, I could create a map from all the geolocation data I had at my finger tips. At a high level I could answer “Where are all the IPv4 addresses worldwide?” But in reality what I created was a map communicating “Where does the geo-location services think all the IPv4 address are worldwide?” And at the end of July I put together a plot in about an hour and tossed it onto twitter. It is still getting retweets over a month later in spite of the redundancy in the title.
Bob and I have talked quite a bit before about the (questionable) value of maps and how they can be eye-catching, but they often lack the substance to communicate a clear message. The problem may be compounded when IP geolocation is the data source for maps. Hopefully I can point out some of the issues in this post as we walk through how to gather and map every IPv4 address in the world.
Step 2: Get the data
I already did step 1 by defining our goal and as a question it is,
“Where does the geo-location service think all the ipv4 addresses are
worldwide?” Step 2 then is getting data to support our research. When I
created the original map I used data from a commercial geolocation
service. Since most readers won’t have a subscription, we can reference
Maxmind and their free geolocation
data. Start by
downloading the “GeoLite City” database in
CSV/zip
format (28meg download) and unzip it to get the
“GeoLiteCity-Location.csv” file. Since the first line of the CSV is a
copyright statement, you have to read it in and skip 1 line. Because
this is quite a bit file, you should leverage the data.table command
fread()
library(data.table)
geo <- fread("data/GeoLiteCity-Location.csv", header=T, skip=1)
# how many rows?
geoRows <- nrow(geo)
print(geoRows)
## [1] 557986
Right away here, you can see some challenges with IP geolocation. There are around 4.2 billion total IP address, 3.7 billion are routable (half a billion are reserved) and yet the data only has a total of 557,986 unique rows. It’s probably a safe bet to say some of these may be aggregated together.
You can jump right to a map here and plot the latitude/longitude in that file, but to save processing time, you can remove duplicate points with the unique function. Then load up a world map, and plot the points on it.
geomap1 <- unique(geo, by=c("latitude", "longitude"))
library(maps)
library(ggplot2)
# load the world
world_map<-map_data("world")
# strip off antartica for aesthetics
world_map <- subset(world_map, region != "Antarctica") # sorry penguins
# set up the plot with the map data
gg <- ggplot(world_map)
# now add a map layer
gg <- gg + geom_map(dat=world_map, map = world_map,
aes(map_id=region), fill="white", color="gray70")
# and the ip geolocation points
gg <- gg + geom_point(data=geomap1, aes(longitude, latitude),
colour="#AA3333", alpha=1/10, size=0.5)
# basic theme
gg <- gg + theme_bw()
# show the map
print(gg)
That’s interesting, and if you notice the alpha on the points is set to 1/10th, meaning it will take ten point on top of one another to make the color solid (red in this case). One thing we didn’t do though is account for the density of the IP addresses. Some of those points may have thousands while others may have just a handful and the map doesn’t show that. In order to account for that you have to load up the other file in the zip file, the GeoLiteCity-Blocks file and merge it with the first file loaded.
blocks <- fread("data/GeoLiteCity-Blocks.csv", header=T, skip=1)
# these columns are read is as character for some reason
# make them numeric
blocks <- blocks[, lapply(.SD, as.numeric)]
fullgeo <- merge(blocks, geo, all=TRUE, by="locId") # "all" is important here
# trim out the columns we need
fullgeo <- fullgeo[ ,c(2,3,8,9), with=FALSE]
# set column names that are easier to type
setnames(fullgeo, c("begin", "end", "latitude", "longitude"))
# look at the data
print(fullgeo)
## begin end latitude longitude
## 1: NA NA 0.00 0.000
## 2: 2.655e+08 2.655e+08 35.00 105.000
## 3: 2.655e+08 2.655e+08 35.00 105.000
## 4: 5.408e+08 5.408e+08 35.00 105.000
## 5: 5.870e+08 5.870e+08 35.00 105.000
## ---
## 2428090: 3.646e+09 3.646e+09 52.52 13.400
## 2428091: 3.647e+09 3.647e+09 52.88 9.683
## 2428092: 3.735e+09 3.735e+09 -39.93 175.050
## 2428093: 3.735e+09 3.735e+09 -41.30 174.783
## 2428094: 3.730e+09 3.730e+09 24.67 118.458
What you are looking at here is four columns, the begining and ending
address in an IP block with the latitude and longitude of that block.
The IP addresses are stored in long format, which is both easier to work
with and smaller for memory/storage of the data. We’ll make use of the
long format in a bit, for now we still have more clean up to do. Notice
the first line where begin and end are both NA
? That either means
there were empty values in the CSV or the merge command didn’t have a
matching record for that location ID and because you set all
to true
in the merge command above, it filled in the row with NA’s. The default
behavior is to drop any rows that aren’t in both tables, but we overrode
that by setting all=TRUE
. We could take care of these NA’s but
removing the all
from the merge command and accept the default of
FALSE for all. But this is interesting, because in our first plot we
just took all the latitude and longitude and plotted them… how many
don’t have corresponding IP address blocks?
sum(is.na(fullgeo$begin))
## [1] 430051
430 thousand orphaned locations? That seems like a lot of unassociated lat/long pairs, doesn’t it?
But keep going, you’ll want to do two more things with this data: 1)
count the number of IP’s in each block and 2) total up the number of
IP’s for each location. In order to do that efficiently from both a code
and time perspective we’ll leverage dplyr
. Let’s clean up the NA’s
while we are at it.
library(dplyr)
# tbl_dt loads a data.table object into dplyr
# and the %>% is the "pipe" command to send the output
# to the next command.
finalgeo <- tbl_dt(fullgeo) %>%
filter(!is.na(begin)) %>% # remove the NA's.
mutate(count = end - begin + 1) %>% # count the # of IPs
group_by(latitude, longitude) %>% # aggregate by unique lat and long
summarise(ipcount=sum(count)) # add up all counts
# what do we have?
print(finalgeo)
## Source: local data table [105,304 x 3]
## Groups: latitude
##
## latitude longitude ipcount
## 1 -90.00 0.00 4419
## 2 -54.80 -68.30 6560
## 3 -54.16 -36.72 5
## 4 -53.87 -67.78 1280
## 5 -53.79 -67.71 8960
## 6 -53.67 -68.47 8
## 7 -53.33 -68.95 24
## 8 -53.15 -70.92 6152
## 9 -51.75 -59.00 1154
## 10 -51.73 -72.52 256
## .. ... ... ...
Notice how we have 105,304 rows? That’s a far cry from the 557,986 rows we had in the original latitude/longitude pairings you mapped.
Explore the data
What does the distribution of the counts look like? Chances are good there is a heavy skew to the data. To create a plot where you can see the distribution, you’ll have to change the axis showing the distribution of addresses per lat/long pair to a logorithmic scale.
library(scales)
gg <- ggplot(finalgeo, aes(x=ipcount))
gg <- gg + geom_density(fill="slateblue")
gg <- gg + scale_x_log10("Addresses per Block", expand=c(0,0),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
gg <- gg + scale_y_continuous("Density", expand=c(0,0))
gg <- gg + ggtitle("Density of Lat/Long IP Blocks")
gg <- gg + theme(axis.ticks=element_blank(),
panel.grid=element_blank(),
panel.background=element_rect(color=NA,
fill=NA))
print(gg)
I would guess that the spikes are around and we can check by converting
the count field to a factor and running summary
against it.
summary(factor(finalgeo$ipcount), maxsum=10)
## 256 512 128 768 1024 384 1280 640 1536
## 21905 8744 5142 4583 3543 3019 2154 1936 1545
## (Other)
## 52733
While that’s interesting, it’s not surprising or all that informative.
Back to the map, right now you have three variables, the latitude, longitude and count of addreses at the location. Lat and long are easy enough, those are points on a map. How do represent density at that point? I think there are three viable options: color (hue), size or opacity (color brightness). In my original plot, I leverage the alpha setting on the points. Trying to use hue would just get jumbled together since at the world view many of the points overlap and individual colors would be impossible to see. with a hundred thousand+ points, size also will overlap and be indistinguishable.
Since all we want to see is the relative density over the entire map, the reader won’t care if there is a lot of IP addresses at a point or a whole lot of IP addresses at point. Showing density is what’s important, so let’s use the alpha (opacity) setting of the point to show density. The alpha setting is a value between 0 and 1, and our counts are large numbers with a heavy skew. To wrangle the range into an alpha setting we should first take the log of the count and then scale it between 0 and 1. I chose to apply log twice to shift the skew the distribution so most of the values are less than 0.5. Since the points overlap, this should make a nice range of opacity for the points.
temp <- log(log(finalgeo$ipcount+1)+1)
finalgeo$alpha <- (max(temp)-temp)/max(temp)
hist(finalgeo$alpha)
And now let’s map those!
world_map<-map_data("world")
world_map <- subset(world_map, region != "Antarctica") # inteRcouRse AntaRctica
gg <- ggplot(world_map)
gg <- gg + geom_map(dat=world_map, map = world_map,
aes(map_id=region), fill="white", color="gray70")
gg <- gg + geom_point(data=finalgeo, aes(longitude, latitude),
colour="#AA3333", alpha=finalgeo$alpha, size=0.3)
gg <- gg + scale_x_continuous(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0))
gg <- gg + expand_limits(x = world_map$long, y = world_map$lat)
gg <- gg + ggtitle("IPv4 addresses worldwide")
gg <- gg + theme(axis.text=element_blank(),
axis.title=element_blank(),
legend.position="none",
axis.ticks=element_blank(),
panel.grid=element_blank(),
panel.background=element_rect(color=NA,
fill=NA),
plot.background=element_rect(fill="#A0CFEC66"))
gg
And there you have it! There are several tweaks that could be done to this. If you notice in this final map, I set the point size to be 0.3. If you raise that up you can create a map that is very dense with color and the size of the point is relative to the size of the output plot. If you export at 4x6 a point size of 0.3 may be huge, but they may barely show up if you export at 15x20. There is no set formula and you can play around with the values, but just be sure the final product stays as close to the data as possible.
A Final Thought
We talked about the “Potwin effect” in our book and Bob mentioned it in his Statebins blog post as well. But if you notice some of the lat/long pairs are rounded off to whole integers. That may be a good indication that the only thing known about the geolocation of the IP address is the country. Further work may be to remove or otherwise account for the uncertain points by matching. Chances are good they are the only lat/long pairs that are both whole numbers throughout the data.
Tweet