Code Monkey home page Code Monkey logo

probable-wordlists's People

Contributors

berzerk0 avatar borekon avatar jimbergman avatar spmedia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

probable-wordlists's Issues

Suggestion: Statistics about popularity.

Hello,
Maybe I am wrong, but I have a feeling that a big number of all passwords, are "seen" only once of the different sources (if they are not just copy/upgrade of each other). Will be useful to have some general guididence like:

first milion - words seen between 200 to 20 times
from 1000k to 10000k - words seen between 19 to 4 times
From 10000k to 1000000k - words seen between 3 to 2 times
from 100000k to the end - words seen 1 time only

This will give better understanding - where the probability stops, and random/alphabetically order starts.

For examble - even in the 120m wordlist I saw many passwords, that are obviously from random generator, and the chance to be used by many people or on many places is close to zero.

Suggestion: Human passwords only

Hello again :)
I think there is a way to generate human-generated wordlists only, but I am sure will be tricky :).

What I mean - we already have plenty of human words + names + city names, etc.
We already know how people put 4 instead of A, 1, instead of i, etc.
if you search (no case sensitive + number replacement option) for all human words in the current biggest file and extract all matches you will find (still ordered by probability) all passwords, that for sure are NOT generated by random password generator.

I am sure there are people who can provide good analysis what is word in general - there are specific patterns, that can be found only in human words, no matter of the language. This way all kind of slang, jargon, and street offensive words can be included, and for some funny reason, they are HUGE % of all passwords :)

I believe this new list will be far more probable, especially for WPA.

Seedbox File Switchover

After the release of Version 2 in the next few days, the seedbox will go down briefly as I switch over from the old to the new files.

If you want to get the Rev 1 files, do so ASAP

Why so many trackers?

There are a lot of trackers in the included torrents. I dont have a good way to count them all, but it looks like well over 100, with many being just random IP addresses, not even a domain. Is there some reason for that? Could that number come down to a something more reasonable (like, say, 3 or 4)?

SecLists Integration

Great work here!

We'd like to include the content in SecLists. Is that ok with you?

license of password data

"This is released without license, but also without intent for commercial use."

This means that no commercial distribution can ship this password list as part of the default password cracking dictionary. Can you relicense this work under a more acceptable license such as the APL?

These Wordlists Don't Target Specific Individuals

While these lists are representative of the WORLD, they may not be representative of a particular PERSON.

People are more likely to use passwords that include some aspects of their personal lives, things that are important to them.

Is there some kind of tool that can create wordlists that are laser-guided to a specific individual?

Duplicated entries found on WPA-Length wordlists

There are duplicated entries for some words in the Top 31 Million, Top 102 Million and Top 1.8 Billion files. As an example, the word 'password' can be found on both line 1 and line 11,853,466 of the files.

I am not good with Unix commands, but the files can be easily fixed using SQL / MySQL. I already fixed them with the code I am sharing below, including also removing words with length of 7 characters. For the 31 Million file, 302,363 entries were removed after cleaning. This code is an example for the 31 Million wordlist, but the same code can be used for the other wordlist just changing the name of the txt file:

/* Creates a Database named 'WPA' */
CREATE DATABASE WPA;
USE WPA;

/* Creates a table named 'Top31MillionWPA'
with two columns: a unique auto_incremental 'id' to keep the
popularity order and 'word' containing the text.
Uses utf8_bin to compare strings case-sensitively */

CREATE TABLE Top31MillionWPA(
id BIGINT NOT NULL AUTO_INCREMENT, Word varchar(255)
, PRIMARY KEY (id), INDEX IX_word (word)
) AUTO_INCREMENT=1 COLLATE utf8_bin;

/*Temporary settings for speed up load of text file*/
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;

/*Loads the text file into the table, into the 'word' column.
 The id column will get automatically populated
////// Change directory and filename accordingly //////
 */ 
LOAD DATA INFILE '/tmp/Top31Million-probable-WPA.txt' INTO TABLE Top31MillionWPA(word);

/* Back to default settings*/
set unique_checks = 1;
set foreign_key_checks = 1;
set sql_log_bin=1;

 /*  This will keep the first entry of the duplicates only in a new table
  this is faster than deleting the duplicates (at the cost of storage space) */
CREATE TABLE Top31MillionWPAclean SELECT Top31MillionWPA.* FROM Top31MillionWPA
LEFT OUTER JOIN(
	SELECT MIN(id) AS FirstID, word
	FROM Top31MillionWPA
	GROUP BY word
	) AS KeepFirst ON
	Top31MillionWPA.id = KeepFirst.FirstID
	WHERE KeepFirst.FirstID IS NOT NULL;

/* Delete original MySQL table  */
DROP TABLE Top31MillionWPA;	
	
/* CREATE Primary Key on new table to speed up the query */
ALTER TABLE Top31MillionWPAclean
ADD PRIMARY KEY (id);
 
/* Create clean text file keeping the popularity order. 
Also, the output is only words of length >= 8 characters
////// Change directory and filename accordingly //////
 */
SELECT word INTO OUTFILE '/tmp/Top31Million-probable-WPA-clean.txt'
FROM Top31MillionWPAclean WHERE LENGTH(word)>=8 ORDER BY id ASC;

/* Delete the MySQL table  */
DROP TABLE Top31MillionWPAclean;

Further de-duplication for rules cracking

Great project, thanks for taking the time.

Food for thought .. typically when using hashcat I like to run through and pull out the straight matches, then switch to rules like Korelogic or the build-in set. To that end, having various permutations in the file reduces efficiency because the rules will catch them anyway .. example, having "password" in the list would suffice since "Password0", "p455w0rd" and "Pa55word" would all be generated by the most common mungers. Sure, rules on top of a munged version might produce more words, but there are better ways of layer rules on top of each other in a more deliberate way.

Anyway, as long as you are on the path of creating derivative password lists, one that is normalized for munging rules would be something to think about. For my purposes I just strip out the easy stuff -- tolower it all, strip off leading and trailing single digits, replace mid-stream digits with corresponding letters, etc.

cheers

$ egrep '^[Pp][aA4][sS5]{2}w[oO0]rd[0-9]{0,2}$' Top125Thousand-probable.txt | head
password
password1
passw0rd
Password
Password1
pa55word
password2
pa55w0rd
password12
password01
$ egrep '^[Pp][aA4][sS5]{2}w[oO0]rd[0-9]{0,2}$' Top125Thousand-probable.txt | wc -l
106

Provide password occurences

Could you please provide how often the passwords occur?

This way one could build adequately weighted probable password masks for hashcat.

Wordlists don't contain Non-ASCII Characters

Americans aren't the only ones with passwords - why not have special wordlists that include non-ASCII Characters?

I'm glad you asked.

As my knowledge level increases so does my ability to sort out lines. I have two methodologies that I will put to use for Rev 2.0

1. Grep out passwords containing characters from different alphabets

If there is an alphabet published in unicode on Wikipedia, I plan to grep for it

  • The Ukranian Alphabet is different than the Russian, which is different than the Belorussian, which is different than the Common Cyrillic, which is different than the Serbian which is different than...
  • This means we could have NATIONALLY targeted lists based on predominant languages
  • This isn't only true for Cyrillic-based alphabets. Dano-Norwegian is a different alphabet than Swedish, English... etc.
  • At the very least by language family
  • My sources still bias towards English, so the ASCII-only lists may simply dwarf the others, but they should still be available.

2. Make Sub-set lists based on source name.

  • I have many sources with "Rus", "ru", and "Russian" in the title. These lists contain are presumably from Russian sources - so perhaps they should be amalgamated themselves.
  • Some sources are obviously geared towards WPA, etc.
  • Caveat: Since my methodology is based on approximating accuracy using the number of files a given line appears in, these groups made of sub-set sources are likely to be precise, but inaccurate. An analogy would be me throwing darts. I might be landing them within a circle of less than 1", but the target is about 4ft over to the left.

In actuality, I'm awful at darts.

I welcome any suggestions - except on my darts game. I mean suggestions about the wordlists.

Please update

Please update these awesome lists with all the new breaches and etc. Awesome list.

Provide torrents

Instead of Mega links maybe providing torrents would also be nice. E.g. Mega requires an add-on for downloading files >1GB.

Easier Readme Guide

Add a link to an easier to follow to readme guide, perheps with a "what not to do" disclaimer.
Also add a some scripts to make sure we have all the prerequisites we need and maybe handle downloading

Mix of line endings

It seems to me everything under Dictionary-Style has CRLF line endings. IMHO every file should have LF endings, so people don't end up with a mix of line endings after concatenating files.

Compress the files

Please compress the files. .tar.gz, .tar.xz and .zip versions of single files or entire folders (+ #4) would be great! Top35Million-probable.txt uncompressed is 369Mb, compressed with xz it's just 85Mb. One could check their contents with zcat or zgrep -a without first uncompressing them.

So is this all the passwords, or only those that showed up in the analysis twice?

Hello,

Is this all individual passwords you found or all of those that only showed across the files twice?

If so, what about other passwords that were unique to only one list (only 1 person had that password), or words from books, Wikipedia, Gutenberg etc...

Perhaps I'm just misunderstanding but would like this clarified....

Thanks for your work on this project!

Some duplicates may appear due to newlines - a judgement call.

In some of the Release 2.0 files, a blankspace character was at the end of every line. In these cases, I would remove the final blankspace character from all lines. However, some files did not have consistency when it came to beginning or ending with blankspace characters. In this instance, I would leave them in place, since I had reason to believe the blankspaces were part of the data.

This may cause the appearance of duplicates that differ only with the inclusion of a blankspace character.

I am labeling this as "won't fix" since it doesn't appear to be feasible to do so.

Finicky Torrents

As of now, the torrents are finicky.
I can get some people seeding, but not others. Sometimes it stalls out.
I haven't spotted a pattern to how and why, but I suspect it has to do with trackers.

If you have found your torrent has stalled, first try pausing and resuming, or using the "update tracker" option in your client.

Personally, I can get them to leech onto one of my computers using Deluge, but not qBittorrent.
However, I have seen some downloaders that are downloading successfully with qBittorrent, so that seems inconclusive.

Anyone have any ideas?

Passwords without spaces

Hi I'm new to all of this and I'm using Kali Linux.
I'm just downloaded these word-lists and opened some of them and I saw that there aren't any spaces between passwords, how can I fix this without manually adding spaces ?
Or is there no need for spaces when doing a dictionary attack ?

I do this for research purposes only of course.

Are passwords for the same mail address deduplicated?

Looking through recent leaks, I found mail:password combos that are contained particularly often.
This, however, does not indicate it would be more commonly used. It should still count as a single occurrence.
Is this taken into account?

Full database size

I'm doing some analyses based on the appearances data now added, but two specific numbers would be helpful in characterizing the full dataset that these top X appearances are then extracted from.

(1) How many unique passwords (i.e., >=1 appearance) were present in the full database? I.e., the "nearly 13 billion" value, but I would appreciate the specific number.

(2) What is the total number of password appearances in the full database, i.e., the sum of the appearances column across all nearly 13 billion passwords.

1.1 repo size

Following the comments about the reduced repo size I just tried to clone it but it still appears to be absolutely massive:

Cloning into 'Probable-Wordlists'...
remote: Counting objects: 1649, done.
remote: Compressing objects: 100% (7/7), done.
receiving objects:  31% (522/1649), 1.83 GiB | 6.67 MiB/s

I am assuming this is because the old versions of the files are still in the commit history, could they be removed using BFG?

Rev 2 isn't released yet

TLDR;
I said Rev 2 would be out mid-july 2017. That's now.
New estimate is Mid-August

I am not close to release.

I had a major setback due to a minor script typo that required me to do about 30% of the total Rev 2 work all over again. C'est la vie, or perhaps shikata ga nai is better here.

I'm pretty much back to where I was before the typo, if not farther along.
There's a lot of manual work that needs to be done that isn't script friendly.
I see bash in my sleep, but luckily I am nearing the end of that portion.

Next up is the stage where I just set it up to run and go about my business.
New estimate is Mid-August.

Some questions for v. 1.2

Hi,

i have any questions. You have a lot of funny things in the list:

  1. Passwords from 1 to 4 characters - with brute-force goes fast and saves approx. 370 MB

  2. Passwords consisting only of numbers. Up to 9 characters with brute-force is faster. Above 10 characters such passwords are rather rare. You can also save 2 GB (save it in a separate file).

  3. E-mails are rarely used as passwords.

  4. Passwords consisting only of special characters. Are also rather rare.

  5. Very long lines with code fragments and MD5 hashes (32 characters and longer). The most are definitely not passwords, but garbage, the hackers in their password-lists not eliminated because too lazy.

  6. Special characters a-la "&036;" in passwords. Most of them have been created between UTF-8, Win and UTF-16 in the event of a wrong configuration. You have to convert such things.

Regards,

John

Note/warn about size

You should state the size of the whole repo in the Readme, so people are not surprised when cloning it… πŸ˜„

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.