berzerk0 / probable-wordlists Goto Github PK

Version 2 is live! Wordlists sorted by probability originally created for password generation and testing - make sure your passwords aren't popular!

License: Creative Commons Attribution Share Alike 4.0 International

dictionary dictionary-attack password password-safety password-strength wordlist

probable-wordlists's People

Contributors

Stargazers

Watchers

Forkers

sorklin sudowright syphersec trashpandas awstahl joaoguedes91 y4b41 m31n99 gregtampa c0dak ovidsec jamyn xrmr bupt007 bipabo1l hax0rg1rl caoimhinp omgwtfun av1080p dauth icyphox kvangaveti ashtonias gitbork loudroute shitl0ad realtoughcandy alilangtest miguelraulb soulless313 c002 wflk stackpivot yangy-xiao oernii olafhartong olivierh59500 g33xter urosb xsuperbug 5ynthetic raspberrycoulis arbazkiraak lucasparsy mhohenberg eplox soufianetahiri easyguyme vegardvaage bmoar aancw rohitmual moriarty2016 toomasmolder clicknull andrey1186 bijaye hainanhu depasonico kongcloud teroz dor1s akamajoris greycel solarflair adriensaladin asdlei99 majid03 nunomcruz robinminto bybox sage-pe jlospinoso chewvala iammrinal0 fetterm4n cryptojones mikalv vr44 badsectortv whatever1234 hwahab m56789 mosesrenegade babernethy n3tsurge clubjk ion-storm mtj42 matlink team-firebugs ganapati newpidc kartikeyap theotherside matteliot frichetten lamkeysing92 security-geeks caseydunham

probable-wordlists's Issues

Suggestion: Statistics about popularity.

Hello,
Maybe I am wrong, but I have a feeling that a big number of all passwords, are "seen" only once of the different sources (if they are not just copy/upgrade of each other). Will be useful to have some general guididence like:

first milion - words seen between 200 to 20 times
from 1000k to 10000k - words seen between 19 to 4 times
From 10000k to 1000000k - words seen between 3 to 2 times
from 100000k to the end - words seen 1 time only

This will give better understanding - where the probability stops, and random/alphabetically order starts.

For examble - even in the 120m wordlist I saw many passwords, that are obviously from random generator, and the chance to be used by many people or on many places is close to zero.

Corrupted torrent files?

All torrent clients I've tried tells me that torrent files are malformed.
Tried: every torrent file in downloads, Clients: KTorrent, Transmission,qBittorrent.

Error message:

An error occurred while loading https://github.com/berzerk0/Probable-Wordlists/blob/master/Real-Passwords/WPA-Length/WPA-Length-Rev-2-Torrents/ProbWL-v2-Real-WPA-Passwords-targz.torrent:
Illegal token: 10

Suggestion: Human passwords only

Hello again :)
I think there is a way to generate human-generated wordlists only, but I am sure will be tricky :).

What I mean - we already have plenty of human words + names + city names, etc.
We already know how people put 4 instead of A, 1, instead of i, etc.
if you search (no case sensitive + number replacement option) for all human words in the current biggest file and extract all matches you will find (still ordered by probability) all passwords, that for sure are NOT generated by random password generator.

I am sure there are people who can provide good analysis what is word in general - there are specific patterns, that can be found only in human words, no matter of the language. This way all kind of slang, jargon, and street offensive words can be included, and for some funny reason, they are HUGE % of all passwords :)

I believe this new list will be far more probable, especially for WPA.

Torrents Down, Again.

Sorry for any inconvenience - might be a few days this time.
Again, all files can be downloaded via Mega.nz links

Check the folders for files with the -MegaLinks.md suffix, or just see below.

Real Passwords

Real Passwords WPA Length

Dictionary-Style

Seedbox File Switchover

After the release of Version 2 in the next few days, the seedbox will go down briefly as I switch over from the old to the new files.

If you want to get the Rev 1 files, do so ASAP

Why so many trackers?

There are a lot of trackers in the included torrents. I dont have a good way to count them all, but it looks like well over 100, with many being just random IP addresses, not even a domain. Is there some reason for that? Could that number come down to a something more reasonable (like, say, 3 or 4)?

De-duplicate items

#Looks like there could be quite a few dupes in here, for instance, "password" is at 1 and 19: https://github.com/berzerk0/Probable-Wordlists/blob/master/Real-Passwords/WPA-Length/Top76-probable-WPA.txt

In Compressed versions, Top1575-probable-v2.txt is named Top1575-probable2.txt

The file's contents are identical, however.
A small naming error that is not worth taking down the torrents and megalinks for.

SecLists Integration

Great work here!

We'd like to include the content in SecLists. Is that ok with you?

all .txt password files dont have anything separating the passwords

If the passwords aren't separated by any characters, how is a program supposed to know when one password ends and another starts?

license of password data

"This is released without license, but also without intent for commercial use."

This means that no commercial distribution can ship this password list as part of the default password cracking dictionary. Can you relicense this work under a more acceptable license such as the APL?

Please make torrents easier to find

I believe you would get more seeders if Release notes explicitly linked to each torrent file (so potential seeders could just click each link).

Also, the main README says "Torrents are live!" but nothing more. Instead you might say:

Torrents are live -- please us help seed these files!

These Wordlists Don't Target Specific Individuals

While these lists are representative of the WORLD, they may not be representative of a particular PERSON.

People are more likely to use passwords that include some aspects of their personal lives, things that are important to them.

Is there some kind of tool that can create wordlists that are laser-guided to a specific individual?

Real-Password WPA Length Megalink (7Z Only) temporarily down

The big Real-Password WPA Length Megalink (7Z only) is temporarily down - https://mega.nz/#F!eVAGAArR!k5Lso87x7a4wrP03np_Eaw

It's still possible to download these files as tar.gz via MegaLink, or by using the torrents.

.

Duplicated entries found on WPA-Length wordlists

There are duplicated entries for some words in the Top 31 Million, Top 102 Million and Top 1.8 Billion files. As an example, the word 'password' can be found on both line 1 and line 11,853,466 of the files.

I am not good with Unix commands, but the files can be easily fixed using SQL / MySQL. I already fixed them with the code I am sharing below, including also removing words with length of 7 characters. For the 31 Million file, 302,363 entries were removed after cleaning. This code is an example for the 31 Million wordlist, but the same code can be used for the other wordlist just changing the name of the txt file:

/* Creates a Database named 'WPA' */
CREATE DATABASE WPA;
USE WPA;

/* Creates a table named 'Top31MillionWPA'
with two columns: a unique auto_incremental 'id' to keep the
popularity order and 'word' containing the text.
Uses utf8_bin to compare strings case-sensitively */

CREATE TABLE Top31MillionWPA(
id BIGINT NOT NULL AUTO_INCREMENT, Word varchar(255)
, PRIMARY KEY (id), INDEX IX_word (word)
) AUTO_INCREMENT=1 COLLATE utf8_bin;

/*Temporary settings for speed up load of text file*/
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;

/*Loads the text file into the table, into the 'word' column.
 The id column will get automatically populated
////// Change directory and filename accordingly //////
 */ 
LOAD DATA INFILE '/tmp/Top31Million-probable-WPA.txt' INTO TABLE Top31MillionWPA(word);

/* Back to default settings*/
set unique_checks = 1;
set foreign_key_checks = 1;
set sql_log_bin=1;

 /*  This will keep the first entry of the duplicates only in a new table
  this is faster than deleting the duplicates (at the cost of storage space) */
CREATE TABLE Top31MillionWPAclean SELECT Top31MillionWPA.* FROM Top31MillionWPA
LEFT OUTER JOIN(
	SELECT MIN(id) AS FirstID, word
	FROM Top31MillionWPA
	GROUP BY word
	) AS KeepFirst ON
	Top31MillionWPA.id = KeepFirst.FirstID
	WHERE KeepFirst.FirstID IS NOT NULL;

/* Delete original MySQL table  */
DROP TABLE Top31MillionWPA;	
	
/* CREATE Primary Key on new table to speed up the query */
ALTER TABLE Top31MillionWPAclean
ADD PRIMARY KEY (id);
 
/* Create clean text file keeping the popularity order. 
Also, the output is only words of length >= 8 characters
////// Change directory and filename accordingly //////
 */
SELECT word INTO OUTFILE '/tmp/Top31Million-probable-WPA-clean.txt'
FROM Top31MillionWPAclean WHERE LENGTH(word)>=8 ORDER BY id ASC;

/* Delete the MySQL table  */
DROP TABLE Top31MillionWPAclean;

Further de-duplication for rules cracking

Great project, thanks for taking the time.

Food for thought .. typically when using hashcat I like to run through and pull out the straight matches, then switch to rules like Korelogic or the build-in set. To that end, having various permutations in the file reduces efficiency because the rules will catch them anyway .. example, having "password" in the list would suffice since "Password0", "p455w0rd" and "Pa55word" would all be generated by the most common mungers. Sure, rules on top of a munged version might produce more words, but there are better ways of layer rules on top of each other in a more deliberate way.

Anyway, as long as you are on the path of creating derivative password lists, one that is normalized for munging rules would be something to think about. For my purposes I just strip out the easy stuff -- tolower it all, strip off leading and trailing single digits, replace mid-stream digits with corresponding letters, etc.

cheers

$ egrep '^[Pp][aA4][sS5]{2}w[oO0]rd[0-9]{0,2}$' Top125Thousand-probable.txt | head
password
password1
passw0rd
Password
Password1
pa55word
password2
pa55w0rd
password12
password01
$ egrep '^[Pp][aA4][sS5]{2}w[oO0]rd[0-9]{0,2}$' Top125Thousand-probable.txt | wc -l
106

Misc Wordlists: NSA code names missing

With ShadowBrokers release, we've got some new ones… 😆

Provide password occurences

Could you please provide how often the passwords occur?

This way one could build adequately weighted probable password masks for hashcat.

Provide contributor guidelines

Fork before PR I assume? Guidance on any change / no change? Size notation could go in guidelines.

Wordlists don't contain Non-ASCII Characters

Americans aren't the only ones with passwords - why not have special wordlists that include non-ASCII Characters?

I'm glad you asked.

As my knowledge level increases so does my ability to sort out lines. I have two methodologies that I will put to use for Rev 2.0

1. Grep out passwords containing characters from different alphabets

If there is an alphabet published in unicode on Wikipedia, I plan to grep for it

The Ukranian Alphabet is different than the Russian, which is different than the Belorussian, which is different than the Common Cyrillic, which is different than the Serbian which is different than...
This means we could have NATIONALLY targeted lists based on predominant languages
This isn't only true for Cyrillic-based alphabets. Dano-Norwegian is a different alphabet than Swedish, English... etc.
At the very least by language family
My sources still bias towards English, so the ASCII-only lists may simply dwarf the others, but they should still be available.

2. Make Sub-set lists based on source name.

I have many sources with "Rus", "ru", and "Russian" in the title. These lists contain are presumably from Russian sources - so perhaps they should be amalgamated themselves.
Some sources are obviously geared towards WPA, etc.
Caveat: Since my methodology is based on approximating accuracy using the number of files a given line appears in, these groups made of sub-set sources are likely to be precise, but inaccurate. An analogy would be me throwing darts. I might be landing them within a circle of less than 1", but the target is about 4ft over to the left.

In actuality, I'm awful at darts.

I welcome any suggestions - except on my darts game. I mean suggestions about the wordlists.

Down load link broken

Download links are both broken

Please update

Please update these awesome lists with all the new breaches and etc. Awesome list.

Probable-Wordlists/Real-Passwords MegaLink returns page 404 error.

I think the title is self-explanatory.

Provide torrents

Instead of Mega links maybe providing torrents would also be nice. E.g. Mega requires an add-on for downloading files >1GB.

Correct License Badge

Currently doesn't handle a click through to the 4.0 license, instead goes to image.

As you are using CC 4.0 by SA try:
License: CC BY-SA 4.0

Reference: https://gist.github.com/lukas-h/2a5d00690736b4c3a7ba

Links are death

Easier Readme Guide

Add a link to an easier to follow to readme guide, perheps with a "what not to do" disclaimer.
Also add a some scripts to make sure we have all the prerequisites we need and maybe handle downloading

Mix of line endings

It seems to me everything under Dictionary-Style has CRLF line endings. IMHO every file should have LF endings, so people don't end up with a mix of line endings after concatenating files.

Compress the files

Please compress the files. .tar.gz, .tar.xz and .zip versions of single files or entire folders (+ #4) would be great! Top35Million-probable.txt uncompressed is 369Mb, compressed with xz it's just 85Mb. One could check their contents with zcat or zgrep -a without first uncompressing them.

So is this all the passwords, or only those that showed up in the analysis twice?

Hello,

Is this all individual passwords you found or all of those that only showed across the files twice?

If so, what about other passwords that were unique to only one list (only 1 person had that password), or words from books, Wikipedia, Gutenberg etc...

Perhaps I'm just misunderstanding but would like this clarified....

Thanks for your work on this project!

Download Link down

https://mega.nz/#F!HRgEwRJR!lTA3GYxLSnC8I7ecOiFjFA
and
https://mega.nz/#F!GM5ygaRI!GTyAgQZkONBGgFF7RH0VAg

Download link broken

Hi, I'm trying to get the dictionary style passwords list but the Mega link is no more available.
https://github.com/berzerk0/Probable-Wordlists/blob/master/Dictionary-Style/Dictionary-Style-MegaLinks.md

Would it be possible to recreate it ?

Some duplicates may appear due to newlines - a judgement call.

In some of the Release 2.0 files, a blankspace character was at the end of every line. In these cases, I would remove the final blankspace character from all lines. However, some files did not have consistency when it came to beginning or ending with blankspace characters. In this instance, I would leave them in place, since I had reason to believe the blankspaces were part of the data.

This may cause the appearance of duplicates that differ only with the inclusion of a blankspace character.

I am labeling this as "won't fix" since it doesn't appear to be feasible to do so.

Rules files contain duplicate rules

All of the rule files contain duplicates.
For example, in ProbWL-26-rule-probable-v2.rule, the rule $1 appears five times.

Finicky Torrents

As of now, the torrents are finicky.
I can get some people seeding, but not others. Sometimes it stalls out.
I haven't spotted a pattern to how and why, but I suspect it has to do with trackers.

If you have found your torrent has stalled, first try pausing and resuming, or using the "update tracker" option in your client.

Personally, I can get them to leech onto one of my computers using Deluge, but not qBittorrent.
However, I have seen some downloaders that are downloading successfully with qBittorrent, so that seems inconclusive.

Anyone have any ideas?

Passwords without spaces

Hi I'm new to all of this and I'm using Kali Linux.
I'm just downloaded these word-lists and opened some of them and I saw that there aren't any spaces between passwords, how can I fix this without manually adding spaces ?
Or is there no need for spaces when doing a dictionary attack ?

I do this for research purposes only of course.

Are passwords for the same mail address deduplicated?

Looking through recent leaks, I found mail:password combos that are contained particularly often.
This, however, does not indicate it would be more commonly used. It should still count as a single occurrence.
Is this taken into account?

Full database size

I'm doing some analyses based on the appearances data now added, but two specific numbers would be helpful in characterizing the full dataset that these top X appearances are then extracted from.

(1) How many unique passwords (i.e., >=1 appearance) were present in the full database? I.e., the "nearly 13 billion" value, but I would appreciate the specific number.

(2) What is the total number of password appearances in the full database, i.e., the sum of the appearances column across all nearly 13 billion passwords.

1.1 repo size

Following the comments about the reduced repo size I just tried to clone it but it still appears to be absolutely massive:

Cloning into 'Probable-Wordlists'...
remote: Counting objects: 1649, done.
remote: Compressing objects: 100% (7/7), done.
receiving objects:  31% (522/1649), 1.83 GiB | 6.67 MiB/s

I am assuming this is because the old versions of the files are still in the commit history, could they be removed using BFG?

Pass

Real-WPA-Password MegaLinks | 7z | Dead link

Hello,
Just here to inform that the Megalink for Real-WPA-Password / 7z is dead.

Thanks !

Seedbox Issues

Seedbox seems to be having a problem - there may be an issue with the torrents for a bit while I work it out.

Apologies for any inconvenience, files can still be downloaded via Mega.Nz links.
Check the folders for files with the -MegaLinks.md suffix, or just see below.

Real Passwords

Real Passwords WPA Length

Dictionary-Style

List not filtered properly

The 258Million list has not been filtered properly. It contain a lot of HTML tags like and .

unable to unzip with 7zip. corrupt?

Rev 2 isn't released yet

TLDR;
I said Rev 2 would be out mid-july 2017. That's now.
New estimate is Mid-August

I am not close to release.

I had a major setback due to a minor script typo that required me to do about 30% of the total Rev 2 work all over again. C'est la vie, or perhaps shikata ga nai is better here.

I'm pretty much back to where I was before the typo, if not farther along.
There's a lot of manual work that needs to be done that isn't script friendly.
I see bash in my sleep, but luckily I am nearing the end of that portion.

Next up is the stage where I just set it up to run and go about my business.
New estimate is Mid-August.

Recommend Merge with OWASP project

Would the data you provide here be a good set to "give" (considering licensing is currently unknown), to the OWASP SecLists project?

https://github.com/danielmiessler/SecLists

https://www.owasp.org/index.php/Projects/OWASP_SecLists_Project

Some questions for v. 1.2

Hi,

i have any questions. You have a lot of funny things in the list:

Passwords from 1 to 4 characters - with brute-force goes fast and saves approx. 370 MB
Passwords consisting only of numbers. Up to 9 characters with brute-force is faster. Above 10 characters such passwords are rather rare. You can also save 2 GB (save it in a separate file).
E-mails are rarely used as passwords.
Passwords consisting only of special characters. Are also rather rare.
Very long lines with code fragments and MD5 hashes (32 characters and longer). The most are definitely not passwords, but garbage, the hackers in their password-lists not eliminated because too lazy.
Special characters a-la "&036;" in passwords. Most of them have been created between UTF-8, Win and UTF-16 in the event of a wrong configuration. You have to convert such things.

Regards,

John

Note/warn about size

You should state the size of the whole repo in the Readme, so people are not surprised when cloning it… 😄

7z link is down

The Title should by self-explanatory