By Bob Rudis (@hrbrmstr)
Tue 04 February 2014
|
tags:
R,
passwords,
-- (permalink)
The pipal
utility is one of the standard, “go-to” tools when analyzing cracked password dumps. It’s a command-line program written in Ruby and I thought it would be interesting to port the base functionality to R and then build upon that base over time (R has some really handy advanced textual analysis tools).
This first relase duplicates most of pipal
‘s functionality and will hopefully serve as an extended introduction to R for those just approaching the language. Our book provides a solid, basic introduction to R, but this example incorporates more complex data structures and packages and some additional *apply()
function machinations that we didn’t really cover in detail in the text (due to lack of time/space).
We’ll start by loading two packages for assistance. The data.table
package gives us the wicked-fast fread()
function and stringr
makes working with strings a bit less messy.
library(data.table)
library(stringr)
First, we’ll read in a sample file (the infamous “phpbb” hack dump). The fread()
function performs just like the built-in read.delim()
function but operates much faster and produces a data.table
, which is an optimized data.frame
.
NOTE: there are some multi-byte characters towards the end of the list so make sure to convert the file to UTF-8 before trying this at home:
passwords <- fread("phpbb.txt")
setnames(passwords ,"V1", "orig") # decent label for the column
tot <- nrow(passwords) # we compute many ratios with this
So, it turns out
fread()
is not optimal for this use-case since (as the docs clearly state but I obviously ignored) that it will stop reading on the first blank line. The following is updated code (changes posted to github as well) for reading in the data file and converting it to adata.table
:
passwords <- data.table(orig=readLines(file("phbb.txt")))
We can get the “top 10” passwords pretty easily in R using the summary()
function on a factor()
(a function that creates a special enumerated reference type of a set of input values) created from the original passwords, but the phpbb
file was pre-uniq
‘d so it isn’t very interesting in this case:
top.10 <- factor(passwords$orig)
summary(top.10, maxsum=11)
## __rob_rao _1phpbb _7114461 _87698 _amber_
## 1 1 1 1 1
## _apr1l1a _babytje_ _bean _blubber_ _disablemenow_
## 1 1 1 1 1
## (Other)
## 184379
But, we want a tad nicer output, so we need to transform the data just a bit to build a data.frame
and also include the percentages:
top.10 <- as.data.frame(head(sort(table(passwords$orig), decreasing=TRUE),10))
top.10$Password <- rownames(top.10)
rownames(top.10) <- NULL
top.10 <- top.10[,c(2,1)]
colnames(top.10) <- c("Password","Count")
top.10$Percent <- sprintf("%3.2f%%", ((top.10$Count / tot) * 100))
print(top.10, row.names=FALSE)
## Password Count Percent
## __rob_rao 1 0.00%
## _1phpbb 1 0.00%
## _7114461 1 0.00%
## _87698 1 0.00%
## _amber_ 1 0.00%
## _apr1l1a 1 0.00%
## _babytje_ 1 0.00%
## _bean 1 0.00%
## _blubber_ 1 0.00%
## _disablemenow_ 1 0.00%
Now, we check for commonality among “base” words, which (according to pipal
) are “words with any non-alpha character stripped from the start and end.” The gsub()
function does most of the heavy lifting here, taking in a regex and removing it from the original strings. The gsub()
function is vectorized so we can pass in the entire data.table
column and it will perform the substitution on each element without a loop (the same is true for tolower()
).
passwords$basewords <- gsub("^[^a-z]*", "", passwords$orig, ignore.case=TRUE)
passwords$basewords <- gsub("[^a-z]*$", "", passwords$basewords, ignore.case=TRUE)
passwords$basewords <- tolower(passwords$basewords)
We can then use the same factor()
/summary()
combination to get the statistics we’re looking for, and filter the data.frame
on only basewords that have more than three characters.
basewords <- factor(passwords[nchar(passwords$basewords)>3,]$basewords)
summary(basewords, maxsum=11)
## phpbb password dragon pass mike blue test qwerty
## 332 89 76 70 69 67 66 59
## alex alpha (Other)
## 58 53 149917
And, again we’ll make a nice table:
basewords <- as.data.frame(head(sort(table(passwords[nchar(passwords$basewords)>3,]$basewords), decreasing=TRUE),10))
basewords$Password <- rownames(basewords)
rownames(basewords) <- NULL
basewords <- basewords[,c(2,1)]
colnames(basewords) <- c("Password","Count")
basewords$Percent <- sprintf("%3.2f%%", ((basewords$Count / tot) * 100))
print(basewords, row.names=FALSE)
## Password Count Percent
## phpbb 332 0.18%
## password 89 0.05%
## dragon 76 0.04%
## pass 70 0.04%
## mike 69 0.04%
## blue 67 0.04%
## test 66 0.04%
## qwerty 59 0.03%
## alex 58 0.03%
## alpha 53 0.03%
To get a breakdown by overall password length, we run nchar()
over the original password column and create a new column that just has the length of each passwors. Then, we build a contingency table from that new column and show it first by length then create an ordered factor so we can view it also by frequency.
passwords$len <- nchar(passwords$orig)
# length ordered
summary(factor(passwords$len))
by.length <- as.data.frame(table(passwords$len))
colnames(by.length) <- c("Password","Count")
by.length$Percent <- sprintf("%3.2f%%", ((by.length$Count / tot) * 100))
print(by.length, row.names=FALSE)
# freq ordered
length.tab <- table(passwords$len) # contingency table
summary(factor(passwords$len,
levels = names(length.tab[order(length.tab, decreasing = TRUE)])))
by.freq <- as.data.frame(table(factor(passwords$len,
levels = names(length.tab[order(length.tab, decreasing = TRUE)]))))
colnames(by.freq) <- c("Password","Count")
by.freq$Percent <- sprintf("%3.2f%%", ((by.freq$Count / tot) * 100))
print(by.freq, row.names=FALSE)
## Password Count Percent
## 8 55338 30.01%
## 6 42070 22.82%
## 7 32731 17.75%
## 9 19188 10.41%
## 10 11896 6.45%
## 5 8198 4.45%
## 11 4933 2.68%
## 4 4598 2.49%
## 12 2505 1.36%
## 13 1018 0.55%
## 3 776 0.42%
## 14 515 0.28%
## 15 232 0.13%
## 2 137 0.07%
## 16 125 0.07%
## 17 36 0.02%
## 1 32 0.02%
## 18 27 0.01%
## 19 9 0.00%
## 20 8 0.00%
## 21 5 0.00%
## 23 3 0.00%
## 32 3 0.00%
## 22 2 0.00%
## 27 2 0.00%
## 25 1 0.00%
## 28 1 0.00%
plot(length.tab, col="steelblue", main="Password Length Frequency", xlab="Password Length", ylab="Count")
Next we break down the composition a bit more, seeing how many had 1-6 characters, 1-8, chars and >9 chars. This is a basic data.frame
filtering function (and we end up just counting the resulting rows).
one.to.six <- nrow(passwords[passwords$len>=1 & passwords$len<=6,])
one.to.eight <- nrow(passwords[passwords$len>=1 & passwords$len<=8,])
nine.plus <- nrow(passwords[passwords$len>8,])
print(sprintf("One to six characters = %d, (%3.3f%%)", one.to.six, 100*(one.to.six/tot)))
print(sprintf("One to eight characters = %d, (%3.3f%%)", one.to.eight, 100*(one.to.eight/tot)))
print(sprintf("More than eight characters = %d, (%3.3f%%)", nine.plus, 100*(nine.plus/tot)))
One to six characters = 55811, (30.268%)
One to eight characters = 143880, (78.031%)
More than eight characters = 40509, (21.969%)
To examine other bits of alpha-numeric compositon, we use use grepl()
which will return TRUE
if a regex is found and rely on a sneaky bit of functionality by the sum()
function where it will ignore FALSE
values and sum up the TRUE
ones in a vector (which is returned by grepl()
).
only.lower.alpha <- sum(grepl("^[a-z]+$",passwords$orig))
only.upper.alpha <- sum(grepl("^[A-Z]+$",passwords$orig))
only.alpha <- only.lower.alpha + only.upper.alpha
only.numeric <- sum(grepl("^[0-9]+$",passwords$orig))
first.cap.last.sym <- sum(grepl("^[A-Z].*[[:punct:]]$",passwords$orig))
first.cap.last.num <- sum(grepl("^[A-Z].*[0-9]$",passwords$orig))
print(sprintf("Only lowercase alpha = %d, (%3.3f%%)", only.lower.alpha, 100*(only.lower.alpha/tot)))
print(sprintf("Only uppercase alpha = %d, (%3.3f%%)", only.upper.alpha, 100*(only.upper.alpha/tot)))
print(sprintf("Only alpha = %d, (%3.3f%%)", only.alpha, 100*(only.alpha/tot)))
print(sprintf("Only numeric = %d, (%3.3f%%)", only.numeric, 100*(only.numeric/tot)))
print(sprintf("First capital last symbol = %d, (%3.3f%%)", first.cap.last.sym, 100*(first.cap.last.sym/tot)))
print(sprintf("First capital last number = %d, (%3.3f%%)", first.cap.last.num, 100*(first.cap.last.num/tot)))
Only lowercase alpha = 76041, (41.239%)
Only uppercase alpha = 1706, (0.925%)
Only alpha = 77747, (42.165%)
Only numeric = 20728, (11.241%)
First capital last symbol = 225, (0.122%)
First capital last number = 4749, (2.576%)
We move next to comparing against password lists. The pipal
tool let you pipe in lists (which this will eventually let you do) but we can start with the “25 worst passwords on the internet”. We’ll use the same basic pattern for all of these list-based comparisons:
- Put the list of words into a vector
- Create a “list of lists”—basically a nested data structure—that holds the search term and the count of times it appeared in the password dump.
- Output the search term, count and percentage
We’ll be using sapply()
to execute a new function, makeCounts()
, which will do the grepping and building of the nested data structure, then use another new function, printCounts()
, which will generate a familiar table output.
makeCounts <- function(x) {
return(x=list("count"=sum(grepl(x, passwords$orig, ignore.case=TRUE))))
}
printCounts <- function(ct) {
tmp <- data.frame(Term=names(ct), Count=as.numeric(unlist(ct)))
tmp$Percent <- sprintf("%3.2f%%", ((tmp$Count / tot) * 100))
print(tmp[order(-tmp$Count),], row.names=FALSE)
}
# setup the "worst passwords" vector
worst.pass <- c("password", "123456", "12345678", "qwerty", "abc123",
"monkey", "1234567", "letmein", "trustno1", "dragon",
"baseball", "111111", "iloveyou", "master", "sunshine",
"ashley", "bailey", "passw0rd", "shadow", "123123",
"654321", "superman", "qazwsx", "michael", "football")
worst.ct <- sapply(worst.pass, makeCounts, simplify=FALSE)
printCounts(worst.ct)
## Term Count Percent
## master 229 0.12%
## 123456 208 0.11%
## dragon 185 0.10%
## password 164 0.09%
## monkey 118 0.06%
## shadow 105 0.06%
## qwerty 95 0.05%
## 1234567 72 0.04%
## 12345678 47 0.03%
## letmein 44 0.02%
## michael 39 0.02%
## 123123 27 0.01%
## abc123 26 0.01%
## 654321 26 0.01%
## superman 18 0.01%
## qazwsx 17 0.01%
## 111111 15 0.01%
## ashley 15 0.01%
## bailey 15 0.01%
## baseball 13 0.01%
## sunshine 13 0.01%
## football 12 0.01%
## iloveyou 11 0.01%
## passw0rd 9 0.00%
## trustno1 7 0.00%
Now, we’ll do the same for weekdays (full & abbreviated), month names (full & abbreviated) and years (1975-2030). This will demonstrate some of R’s built in arrays and sequence generation capabilities.
weekdays.full <- c("sunday", "monday", "tuesday", "wednesday",
"thursday", "friday", "saturday")
weekdays.abbrev <- c("sun", "mon", "tue", "wed", "thu", "fri", "sat")
months.full <- tolower(month.name)
months.abbrev <- tolower(month.abb)
yrs <- as.character(1975:2030)
printCounts(sapply(weekdays.full, makeCounts, simplify=FALSE))
## Term Count Percent
## monday 12 0.01%
## friday 11 0.01%
## sunday 5 0.00%
## thursday 3 0.00%
## tuesday 2 0.00%
## wednesday 1 0.00%
## saturday 1 0.00%
printCounts(sapply(weekdays.abbrev, makeCounts, simplify=FALSE))
## Term Count Percent
## mon 954 0.52%
## sun 299 0.16%
## sat 187 0.10%
## thu 184 0.10%
## fri 169 0.09%
## wed 69 0.04%
## tue 16 0.01%
printCounts(sapply(months.full, makeCounts, simplify=FALSE))
## Term Count Percent
## may 171 0.09%
## june 56 0.03%
## april 48 0.03%
## july 27 0.01%
## march 23 0.01%
## august 22 0.01%
## october 15 0.01%
## january 8 0.00%
## november 7 0.00%
## december 6 0.00%
## february 3 0.00%
## september 3 0.00%
printCounts(sapply(months.abbrev, makeCounts, simplify=FALSE))
## Term Count Percent
## mar 1406 0.76%
## jan 341 0.18%
## jun 190 0.10%
## may 171 0.09%
## nov 161 0.09%
## jul 158 0.09%
## dec 120 0.07%
## sep 118 0.06%
## apr 108 0.06%
## aug 83 0.05%
## oct 69 0.04%
## feb 42 0.02%
printCounts(sapply(yrs, makeCounts, simplify=FALSE))
## Term Count Percent
## 2000 428 0.23%
## 2002 268 0.15%
## 2001 236 0.13%
## 2003 235 0.13%
## 2005 199 0.11%
## 1987 183 0.10%
## 2004 180 0.10%
## 1984 176 0.10%
## 1985 171 0.09%
## 1983 168 0.09%
## 1988 165 0.09%
## 1986 152 0.08%
## 2006 145 0.08%
## 1979 142 0.08%
## 1982 142 0.08%
## 1981 139 0.08%
## 1989 139 0.08%
## 1980 130 0.07%
## 1990 127 0.07%
## 1978 118 0.06%
## 1991 115 0.06%
## 1977 96 0.05%
## 2007 91 0.05%
## 1975 82 0.04%
## 1992 82 0.04%
## 1976 80 0.04%
## 1999 79 0.04%
## 2010 57 0.03%
## 1997 56 0.03%
## 1993 49 0.03%
## 1998 49 0.03%
## 2011 48 0.03%
## 2020 47 0.03%
## 2012 45 0.02%
## 1994 41 0.02%
## 2021 39 0.02%
## 1996 38 0.02%
## 2030 32 0.02%
## 2008 30 0.02%
## 2013 27 0.01%
## 2022 27 0.01%
## 2009 26 0.01%
## 2019 26 0.01%
## 1995 25 0.01%
## 2028 19 0.01%
## 2025 18 0.01%
## 2027 18 0.01%
## 2017 17 0.01%
## 2015 16 0.01%
## 2018 16 0.01%
## 2023 15 0.01%
## 2024 15 0.01%
## 2026 13 0.01%
## 2016 12 0.01%
## 2014 9 0.00%
## 2029 8 0.00%
As pipal
points out, “the common assumption is that when people are forced to use passwords with numbers, their general response is to add a single digit on the end. Looking at this next set of stats, in this list people actually prefered to add two digits onto the end. The assumption that the last digit will be “1” does however hold true.”
We’ll rely on grepl()
and standard regex for this part of the analysis:
singles.on.end <- sum(grepl("[^0-9]+([0-9]{1})$", passwords$orig))
doubles.on.end <- sum(grepl("[^0-9]+([0-9]{2})$", passwords$orig))
triples.on.end <- sum(grepl("[^0-9]+([0-9]{3})$", passwords$orig))
print(sprintf("Single digit on the end = %d, (%3.3f%%)", singles.on.end, 100*(singles.on.end/tot)))
print(sprintf("Two digits on the end = %d, (%3.3f%%)", doubles.on.end, 100*(doubles.on.end/tot)))
print(sprintf("Three digits on the end = %d, (%3.3f%%)", doubles.on.end, 100*(doubles.on.end/tot)))
Single digit on the end = 14447, (7.835%)
Two digits on the end = 18113, (9.823%)
Three digits on the end = 18113, (9.823%)
passwords$last.num <- as.numeric(str_extract(passwords$orig, "[0-9]$"))
last.num.factor <- factor(na.omit(passwords$last.num))
plot(last.num.factor, col="steelblue", main="Count By Last digit")
summary(last.num.factor)
## 0 1 2 3 4 5 6 7 8 9
## 7753 13572 8735 9313 6279 6409 5992 6472 5726 6728
last.num <- as.data.frame(table(last.num.factor))
colnames(last.num) <- c("Digit","Count")
last.num$Percent <- sprintf("%3.2f%%", ((last.num$Count / tot) * 100))
print(last.num, row.names=FALSE)
## Digit Count Percent
## 0 7753 4.20%
## 1 13572 7.36%
## 2 8735 4.74%
## 3 9313 5.05%
## 4 6279 3.41%
## 5 6409 3.48%
## 6 5992 3.25%
## 7 6472 3.51%
## 8 5726 3.11%
## 9 6728 3.65%
We’ll conclude with a final digit-based analysis, this time taking a look at commonality by last n (1-5) digits used. We’ll leave the tabluar output as an exercise for the reader (rest assured, it’ll be there in the final version).
passwords$last.2 <- str_extract(passwords$orig, "[0-9]{2}$")
passwords$last.3 <- str_extract(passwords$orig, "[0-9]{3}$")
passwords$last.4 <- str_extract(passwords$orig, "[0-9]{4}$")
passwords$last.5 <- str_extract(passwords$orig, "[0-9]{5}$")
print(tail(sort(table(na.omit(passwords$last.2))),10))
## 88 69 13 21 99 11 12 01 00 23
## 1028 1052 1095 1150 1341 1620 1817 1992 2185 3027
print(tail(sort(table(na.omit(passwords$last.3))),10))
## 111 002 101 321 666 001 007 234 000 123
## 261 274 284 286 398 430 449 477 708 2164
print(tail(sort(table(na.omit(passwords$last.4))),10))
## 1985 1988 1987 2004 2005 2001 2003 2002 2000 1234
## 132 133 141 153 166 181 202 215 377 424
print(tail(sort(table(na.omit(passwords$last.5))),10))
## 77777 23123 21985 11988 00000 21984 11111 54321 23456 12345
## 13 14 15 16 18 21 23 25 68 110
You can follow the development of ripal
over on github and stay tuned to the DDSec Blog as we incorporate some additional analytics and build a Shiny app around the tool. Use the comments to request features or enhancements and file issues over at github if things seem wonky.