Statistical Analysis of Password Strength via Gawker’s Leaked Database
This past weekend, Gawker Media was hacked and its user account database was leaked online. The database contained about 1.3 million rows of information containing usernames, e-mail addresses, and passwords (encrypted via DES). This security breach is unfortunate for people whose information is contained within that database, but the silver lining is that it provides a rare opportunity for statistics nerds like me to analyze some otherwise completely unobtainable data.
Because the passwords were encrypted using such an out-of-date scheme (tsk, tsk, Gawker), about 200,000 of the passwords contained in the database have been decrypted. Of course, the passwords that were cracked were relatively weak. For example, all 2641 accounts that used some trivial modification of “password” or “querty” as their password were of course decrypted. In this post I will look at trends in which users’ passwords were cracked to gain insight into which users do and do not create strong passwords.
It should of course be made clear that, because this data comes from a single database, the results that follow may not be representative of the population as a whole, but rather may be skewed by the fact that people with Gawker accounts are generally more “techy” than the average internet user.
Preliminaries: Cleaning Up the Database
The database of course had to be significantly cleaned before it could be of too much use statistically, so some of the numbers here may differ slightly from the raw numbers you see from news outlets or if you download the raw database yourself. The numbers here are the result of removing any incomplete rows from the database (i.e., rows missing a password, e-mail address, or both) and removing any accounts that were clearly created by SPAMbots (I’m only interested in the password strength of real users).
Also, I will only look at accounts that contain an e-mail address with a domain that was registered in the database at least 50 times. This restriction is in place partly because it is extremely difficult to compute any sort of meaningful statistics on something with a sample size that is much smaller than 50, and it is partly due to the fact that Gawker doesn’t require verified e-mail addresses (so 46993 of the 52593 domain names listed in the database were used by exactly one person, many of which are clearly fake and/or for SPAM).
After making the aforementioned “fixes” to the database, there are 412670 accounts, 157794 (38.2%) of which had their password decrypted.
Password Strength by Domain Name
The following table displays the 10 most frequently-occurring domain names used for e-mail addresses in the database along with how many users of the domain had their password cracked.
Domain | Total Accounts | Decrypted Passwords | Decryption % |
---|---|---|---|
gmail.com | 158031 | 50530 | 32.0% |
yahoo.com | 94147 | 40964 | 43.5% |
hotmail.com | 66752 | 27332 | 40.9% |
aol.com | 17534 | 8151 | 46.5% |
comcast.net | 7222 | 2801 | 38.8% |
msn.com | 5544 | 2250 | 40.6% |
mac.com | 4951 | 1750 | 35.3% |
sbcglobal.net | 3896 | 1667 | 42.8% |
hotmail.co.uk | 3204 | 1476 | 46.1% |
verizon.net | 2211 | 860 | 38.9% |
The following table shows the z-values associated with the statistical test that the two given domains have the same proportion of users with strong passwords. Differences that are statistically significant at the α = 0.01 level are in bold. Click on a z-value to see a normal distribution showing the associated p-value. Notice in particular that gmail.com users have stronger passwords than users of any of the other top-10 domain names, while aol.com and hotmail.co.uk users have the weakest passwords.
Yahoo | Hotmail | AOL | Comcast | MSN | Mac | SBC | HotmailUK | Verizon | |
---|---|---|---|---|---|---|---|---|---|
GMail | 58.28 | 40.84 | 38.65 | 12.10 | 13.48 | 5.00 | 14.27 | 16.89 | 6.92 |
Yahoo | – | -10.26 | 7.29 | -7.81 | -4.27 | -11.31 | -0.89 | 2.87 | -4.33 |
Hotmail | – | – | 13.23 | -3.55 | -0.53 | -7.74 | 2.27 | 5.75 | -1.93 |
AOL | – | – | – | -11.09 | -7.70 | -13.94 | -4.19 | -0.44 | -6.75 |
Comcast | – | – | – | – | 2.06 | -3.85 | 4.11 | 6.98 | 0.09 |
MSN | – | – | – | – | – | -5.52 | 2.14 | 5.00 | -1.37 |
Mac | – | – | – | – | – | – | 7.14 | 9.67 | 2.88 |
SBC | – | – | – | – | – | – | – | 2.77 | -2.97 |
HotmailUK | – | – | – | – | – | – | – | – | -5.24 |
Educational Institutions
Not surprisingly, users who entered an e-mail address from an educational institution typically had stronger passwords than the general population. Of the 2092 users who provided a college or university-based e-mail address, only 697 (33.3%) were decrypted. This proportion is significantly lower than the corresponding proportion for the general population (z = 4.64, p < 0.001).
However, two universities stood out as having particularly weak passwords: of the 56 users who used a University of Texas e-mail address, 27 (48.2%) had their password decrypted, and similarly 101 (45.1%) of 224 New York University passwords were decrypted.
ISP-Provided E-Mail Users
Users who used an e-mail address provided to them by their ISP (such as something@comcast.net) typically had weaker passwords than the general population, a fact that can perhaps be explained by the fact that tech-unsavvy folks are less likely to go out and get a new e-mail address for themselves at a place like GMail. Of the 31667 users who provided an ISP-based e-mail address, 13053 (41.2%) of them had their password decrypted. This proportion is significantly higher than the corresponding proportion for the general population (z = -11.36, p < 0.001).
E-Mail Addresses with Typos
Also unsurprisingly, users who entered an obvious typo in their e-mail address were much more likely to have a weak password than people who entered their e-mail address correctly (by “obvious typo” I basically mean an e-mail address containing a typo of a common domain name, such as “fred@yahoo,com” or “fred@hotmail”). Of the 530 users with a typo in their e-mail address, 280 (52.8%) had passwords that were decrypted. This proportion is significantly higher than the average (z = -6.87, p < 0.001).
Password Strength by Country
The following table shows the strength of user passwords based on the country associated with their e-mail address. Of course some e-mail addresses provide no information about the user’s country, so domains that serve a largely international market (such as gmail.com, mac.com and aim.com) are excluded from this analysis.
Country | Total Accounts | Decrypted Passwords | Decryption % |
---|---|---|---|
India | 3129 | 1448 | 46.3% |
United Kingdom | 6874 | 3057 | 44.5% |
China | 1411 | 600 | 42.5% |
Canada | 2825 | 1160 | 41.1% |
United States | 30891 | 12507 | 40.5% |
Germany | 1378 | 484 | 35.1% |
Russia | 2223 | 533 | 24.0% |
So Russia and Germany are the big winners when it comes to password strength, while India and the United Kingdom seem to have the weakest passwords. The following table shows the z-values associated with the statistical test that the two given countries have the same proportion of users with strong passwords. Differences that are statistically significant at the α = 0.01 level are in bold. Click on a z-value to see a normal distribution showing the associated p-value.
UK | China | Canada | US | Germany | Russia | |
---|---|---|---|---|---|---|
India | -1.67 | -2.32 | -4.03 | -6.26 | -6.94 | -16.62 |
UK | – | -1.31 | -3.06 | -6.05 | -6.37 | -17.16 |
China | – | – | -0.88 | -1.49 | -3.97 | -11.72 |
Canada | – | – | – | -0.57 | -3.67 | -12.73 |
United States | – | – | – | – | -3.95 | -15.37 |
Germany | – | – | – | – | – | -7.18 |
Attached below is an Excel Spreadsheet containing significantly more detailed information than the snippets contained in this post (though of course all passwords, e-mail addresses and personally-identifiable information has been removed).
Download: Gawker Database Statistics [Excel spreadsheet]
Relative password strength by country is likely to have more to do with the cracking methods used than actual strength…
Assuming a dictionary crack was performed, it was likely that an english dictionary was used, and that use of a german or russian dictionary may yield different results.
Similarly, brute force tools such as John The Ripper are not linear, and attempt to guess common words first, and most such tools are typically trained towards english words.
Also the character set used by the brute forcing software may not include russian cyrillic characters.
Would would be interesting however, is an automated tool to analyse a cracked password database and produce statics and pretty management-friendly graphs of password strength. Stats like how many passwords were dictionary words, how many were username based, how many were very short and how many could not be cracked at all etc.