Code Monkey home page Code Monkey logo

english-words's Introduction

List Of English Words

A text file containing over 466k English words.

While searching for a list of english words (for an auto-complete tutorial) I found: https://stackoverflow.com/questions/2213607/how-to-get-english-language-word-database which refers to https://www.infochimps.com/datasets/word-list-350000-simple-english-words-excel-readable (archived).

No idea why infochimps put the word list inside an excel (.xls) file.

I pulled out the words into a simple new-line-delimited text file. Which is more useful when building apps or importing into databases etc.

Copyright still belongs to them.

Files you may be interested in:

  • words.txt contains all words.
  • words_alpha.txt contains only [[:alpha:]] words (words that only have letters, no numbers or symbols). If you want a quick solution choose this.
  • words_dictionary.json contains all the words from words_alpha.txt as json format. If you are using Python, you can easily load this file and use it as a dictionary for faster performance. All the words are assigned with 1 in the dictionary.

See read_english_dictionary.py for example usage.

english-words's People

Contributors

adi-g15 avatar anurag-chauhan-289 avatar arhell avatar arsho avatar awe23123 avatar blairg23 avatar bsoyka avatar chew avatar computingfreak avatar dimnikolos avatar etigerstudio avatar fbattello avatar grahampcharles avatar harshalaxman avatar innovativeinventor avatar iteles avatar jmonty42 avatar joanchirinos avatar joechen2 avatar kevinzwang avatar mikestopcontinues avatar nelsonic avatar nkamc avatar simonsteele24 avatar treit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

english-words's Issues

Y isn't sorted properly

Some words with the letter Y appear under the letter I instead.

Examples:
NYET appears in between NIESHOUT and NIETZSCHE

ABIRRITATIVE is followed by
ABYS
ABYSM
ABYSMAL
[ etc ]
before continuing with
ABYSSUS
ABISTON
ABIT

XXV -> XXX -> Z -> ZA -> ZABAEAN

Words like YACHT, YACCA, YARTH etc all appear under I

Did you perhaps use the Frisian alphabet when sorting the list?

2?

WHY DOES IT START WITH 2? 2 IS NOT A WORD. MY FRIEND AND I HAVE BEEN DRIVEN INSANE THESE LAST FEW DAYS TRYING TO FIGURE OUT A DICTIONARY DEFINITION OF 2 THAT WOULD PERMIT IT TO GO ON THIS LIST. PLEASE PLEASE PLEASE ANSWER QUICKLY. WE HAVE SUNK FAR TOO MUCH TIME INTO THIS AND WE NEED TO KNOW.

Megafauna not included

According to merriam-webster, "Megafauna" is a word that is not in here at all, and should be added.

Credits

Hi! Thank you so much for this. It is such a huge help. Just want to know how to properly credit this to them or to you.

Missing word: cryptocurrency

The word "cryptocurrency" exists in the English language (according to Merriam Webster dictionary) and is not in words_alpha.txt. This word should be added as it is a common word now.

Removing CR Saves 370098 bytes and Makes File POSIX Compliant (if you add one at the end)

Just a nit. The file has DOS line endings (Carriage Return [like typewriter] and Line Feed). You can save 370098-bytes by removing the Carriage Returns (e.g. '\r') leaving only '\n'. (you would also need to add a '\n' terminating the final line). This would make the file POSIX compliant (and it would still work just as well in windows for everything other than viewing in Notepad...) Either way works, it's just odd that such a file has DOS line endings.

Very long and silly abbreviations?

The list suggests that the words elect. and mineral. are abbreviations, though I cannot verify this. There are also a lot of very long jargon abbreviations like anthrop. and anthropol. and palaeontol. It seems that many of these are borrowed from biology/medical and geology jargons. I'm not sure if they belong.

Words with apostrophe

I see that in the file words_alpha there are the (wrong) words:
isnt
arent
wouldnt

and that these (right) words are not included:
isn't
aren't
wouldn't

Is this intentional?

needs a bit of cleanup

It's very handy to have this list.
It does have a few non-words in it, possibly due to OCR errors. A couple I caught were

  • brainwashjng
  • neritjc

Add CONTRIBUTING.md File to Repo pointing to github.com/dwyl/contributing

As a person who is new to the DWYL Org/Community ๐Ÿ†•
I need to know how to contribute to the project effectively ๐Ÿ’ญ
so that I can start my journey towards Doing What I Love with my Life! โค๏ธ โœ… ๐Ÿ˜‚

Markdown:

_**Please read** our_
[**contributing guide**](https://github.com/dwyl/contributing)
(_thank you_!)

Note: these are line-separated but in the actual rendered page it's all one line.
see: https://github.com/dwyl/contributing/blob/master/CONTRIBUTING.md

@iteles we are getting quite a few PRs for this repo which I don't want to "reject",
but rather I want people to discuss their proposed "improvements" before submitting ...

Cleanup needed

I was scrolling through the list and I found "Ultra-englishUltra-french". I suppose this is not a word?

Clarify why this repo exists! :-)

The word list was originally used in a proof-of-concept auto-completion mini-project: https://github.com/nelsonic/ac but it's since been used by quite a few people.
Please give examples of how you are using the list in your project(s) so we can add them as suggestions in the readme. thanks!

Words as a trie in JSON?

So I just did a little bit of fooling around: I took the words_alpha.txt file and converted it into a minimal trie that can be JSON-serialized.

The result looks a bit like this:

image

To clarify: "_":1 marks a node that actually represents a word - there are nodes that are part of a larger word but do not represent words themselves - "antidisestablishmentarianism" is a word in this trie, but "antidisestablishmentarianis" (missing the last "m") is not.

Because of the verbosity of JSON, the resulting file is much larger than the original text file, but because it's so repetitive it compresses to a slightly smaller file: (could be relevant when served over http with gzipping enabled)

image

Combined with a very simple function we can really quickly check if a word is in the dictionary now:

// Yes, it's JavaScript, but it's not that hard to translate to Python code or other languages, right?
function inDictionary(word) {
   let node = words_trie;
   for (const char of word) {
      let nextNode = node[char];
      if (!nextNode) return false;
      node = nextNode;
   }
   return node._ === 1;
}

Here is the compressed file:
words_alpha_trie.tar.gz

Observable notebook where I did it here, plus a demo of the example function to test if a word is in the dictionary: https://observablehq.com/@jobleonard/efficiently-checking-if-a-word-is-in-a-dictionary

If there are no underscores in the regular words.txt file, then I think this approach should work there as well.

I thought it could be useful because, as stated, the compressed file is smaller than the text file, and in certain situations the trie is more interesting than the plain text file - so it might be more efficient to directly build it then

EDIT: I got a private email asking if it was OK to use this file. So in case anyone else is curious: since it's based on the words file in this repo, I'll just say it's released under the same unlicense license, and you're free to do whatever with it :)

Adequacy of the list

How adequate is this list to perform a letter frequency analysis? My only concern is the number of words . Can you provide me with a relative size comparison?

Invalid word 'acceleratorh'

The word 'acceleratorh' is the word list, but I couldn't find that in any dictionary and am skeptical it is an actual word.

Looking for words within a word [apologize for submitting a PR beforehand]

I am thankful to you (owners and contributors) for this repo. I had made an application that using the words.txt database. The reason of submitting a PR, because I wanted to contribute something back. However I did not start right, because I had not read the contributing guidelines. My apology to all (owners and contributors) in this matter.

Recently I have made a repo for introducing my app LWW (Looking for Words within a Word). I wanted to submit a script in Python beforehand that create a simple result list of words out of a word using the words.txt database.

Since I had read the contributing guidelines, i wanted to start right and re-propose LWW to be contributed to this repo. However i had already created LWW repo in my own GitHub. Can you please advice me how to contribute LWW in this case?

I will explain the usage of LWW. This application is searching words in words.txt that will be cross-check with a word that has been given. It will compile and collected words in the words.txt that contain alphabets of the word (that has been given/typed). This app was made for fun, educational, and ideas for making words game.

Please visit my repo https://www.githubs.com/kakkarja/LWW for further understanding of the apps usage.

Once again, thank you for this repo. May you have all the success and wonderful time in your journey of life. Thank You ๐Ÿ‘.

Why is 2 a word?

In your large list of English words, you start with 2, thus sending my friend and I on an absurd journey into linguistics and mathematics, trying to figure out why 2 is a word. There are no other single numbers on the list. Is it a title or tag for the list without formatting? Is it a necessary aspect in coding languages? Is it a mistake? Please get back to me ASAP so we can understand why a list of words begins with 2 and get some sleep. Thank you.

Snailfishes and Preinferred

Two errors I found (at least) in wordsalpha.txt and words_dictionary.json:

The two words "preinferred" and "preinferring" are combined into "preinferredpreinferring" which I suspect is not actually a word.
Similarly, they contain the word "snailfishessnailflower". "snailflower" is included as well, so in this case "snailfishes" just has an unnecessary suffix.

Slow loading of words_alpha.txt in Python

While working with words_alpha.txt in Python 3.6, I have faced slow loading and slow searching. I think there should be JSON formatted dictionary for better searching performance of all words.

words.txt lacks words that are in words_alpha.txt

Example :

# cat words_alpha.txt|grep ^ned                                        
ned
nedder
neddy
neddies
nederlands
# cat words.txt|grep ^ned
nedder
neddies
#

The documentation states that words_alpha.txt is a subset from words.txt which apparently is not the case as of now.

How was "words_alpha.txt" generated?

Clearly, "words_alpha.txt" is missing a lot of words from "words.txt" that only have letters, no numbers or symbols.
Why is that?
How was "words_alpha.txt" generated?

Should words_dictionary.json be a json array ?

Is there any reason why the content of words_dictionary.json isn't a JSON array ?

For e.g instead of it being
{ "aalii": 1,
"aaliis": 1,
"aals": 1 } it could be ["aalii", "aaliis", "aals"]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.