Code Monkey home page Code Monkey logo

wordlists's Introduction

WORD LIST

This project aims to create a set of free to use word lists for use in open source games.

Starting from a number of open word repositories (or even a dictionaries in some cases), we add, remove and filter entries to create a list of suitable words. These words are then stored in a format that makes searching simple and fast.

License

This project itself is GPLv2. But some words come from sources that may have their own licenses that they will retain.

See the corresponding data/<language>.LICENSE file.

Building

The word repositories together with configuration files are under data/. Under src/ you will find some short programs to manage them:

To build all lists execute:

make update

The results will be in the build/ directory. To create a ZIP file with all lists:

make dist

To search the list for one of the languages execute::

make test LANGUAGE=en

To do this in Java:

make javatest LANGUAGE=en

FAQ

  • Q: I demand you add support for <my language>!!!
  • A: This is an open source project. You can add it yourself.
  • Q: These words are not good! They are unsuitable for my game! Fix them, or else!!
  • A: This is an open source project and you have the chance to improve it by submitting your own changes. Note however that the main word list is meant to contain all possible words while some games may require certain words to be excluded (e.g. words ending with plural 's). In that case you should use our lists as a starting point and then filter our any words that are not suitable for your particular need.
  • Q: How do I limit the word list to words of a certain length?
  • A: Change the data/<lang>.config file to your liking and rebuild the list.
  • Q: Why does the word list also contain language statistics?
  • A: This information can be used to help create better randomly generated game levels.
  • Q: What is the format of the produced word list?
  • A: The build/<lang>.bin file is contains a sorted list of zero-terminated words plus a small header and footer. The sorted list of words allows use a quasi-binary search, which is a good compromise w.r.t. memory usage, speed and code complexity.
  • Q: I don't understand the file format. Can you help me add it to my game?
  • A: Use the wordlist.c and test.c files as an example. For an example in hava see Wordlist.java.
  • Q: With one byte per character, how do you handle non ASCII characters?
  • A: the data/<lang>.map file can be used to map characters outside a-z to something else such as 1-9. In your game you will have to translate them back to the original charset. To help you with this, the charset at the end of the file contains both ASCII and unicode representations.
  • Q: I want to create a new list and my data is not UTF-8, how do I convert it?
  • A: try:

    iconv --from-code=ISO-8859-1 --to-code=UTF-8 input > output

  • Q: I am creating a new list and upper-case unicode characters are not recognized
  • A: The idea is to only include lower-case. A tool used internally cannot handle unicode case conversion. Open your list and manually replace those letters with lower-case.

wordlists's People

Contributors

tube42 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

wordlists's Issues

Add some german word list?

(manually importing this issue from gitlab since it may be useful for reference later)


Martin Quinson
Add some german word list?

Hello,

you may want to add some german word list, for example one listed in lexica/lexica#69 (comment)

Thanks, Mt


tube42 @tube42 · 3 months ago
Maintainer

Waiting for this issue to resolve

enz/german-wordlist#1
tube42
tube42 @tube42 · 3 months ago
Maintainer

No idea about the quality of the words, but German is now added.

Note that the file format and the way unicode is handled has changed in this release.
tube42 @tube42 closed 3 months ago
Martin Quinson
Martin Quinson @mquinson · 3 months ago

Thanks a lot. I can tell you that almost all words containing a œ in their list are actually French words. I guess that they are usable in German too, but I'm not fluent enough to say for sure. So I kinda agree with you even if I'd prefer a native German speaker to comment on the bug you opened on their repo.

weird words

I'm a native german speaker and i notice really odd words in the german word list, which i am certain do positively not exist in the german language. for example:

Aa
lormest
losbräch
Lunden
lustrier
Maa
macklich
MacGuffin

Also I find words that are obscure / very specialised, so that no average speaker would know them without being an expert in a very narrow field. They indeed exist (e.g. found on de.wikipedia.org) but it is questionable whether they should appear in a game. For example:

Aalhamen
Kyu
kyanisieren
Labantzen
Machorka

Where did the words come from?
Any ambitions to curate suitable words?

What games currently use these lists?

While searching for a Swedish word list to add to https://github.com/lexica/lexica , this is the best collection of dictionaries I have found. See for example this: lexica/lexica#179

What games currently use these lists? Collaborating on keeping them up to date seems like a very reasonable thing to do. For Swedish, there are regular updates of words to add and remove, as collected for example here: http://scrabbleforbundet.se/ordlistor/

Here are some categories that could be useful for some games, to allow the player to choose what types of games to play with:
Beginner words (200)
Common words (1000)
Expanded common words (10 000)
City, country, continent, area names
Flora names
Fauna names
Celebrity names
Hard/unusual words
Abbreviations
Scrabble
Massive list of everything

But to do something like this there would need to be more restructuring done. The original word lists look like you want to keep them as they are, without changes. But is that really the best idea? Language is changing, keeping them more up to date is likely better. At least for Swedish, the included list stopped being updated 9 years ago when there was heavy discussion about how it was used in a popular app game. There are some ways that the list was set up that was useful for its original purpose, but not suitable for use in a word game. As the stated purpose of this repo is to maintain word lists for use in word games, it might make sense to adapt the list for this use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.