Code Monkey home page Code Monkey logo

kindle-dict's Introduction

Since Kindle ebook readers unfortunately don't come with any Norwegian (Bokmål) dictionaries, here is a simple way for creating one based on dict.cc data. The resulting dictionary can be used like any other Kindle dictionary (in-document word look-up (also of inflected forms), vocabulary trainer, browsing the dictionary). It contains ca. 24.800 uninflected NB > DE entries plus (regularly and irregularly) inflected forms for most verbs, nouns and adjectives.

With slight changes, these files can be used to create bilingual dictionaries based on other dict.cc language pairs.

Creating and Installing the Dictionary

  1. Get the dictionary source data from dict.cc's download page and save it as data/dict.cc/dict.cc.tsv.

  2. Get the files lemma.txt and fullformsliste.txt from Språkbankens ressurskatalog and save them in data/spraakbanken/.

  3. Get a list of Bokmål stop words (for instance via ranks.nl) and save it as data/stopwords/stopwords.txt (one word per line).

  4. Convert the TSV file into an appropriately formatted HTML file:

python transform.py > NB_DE_dict.html
  1. Install KindleGen and use it to convert the dictionary into a MOBI file. The conversion requires the following files:
  • NB_DE_dict.opf: Contains information on the files used for MOBI conversion and general metadata about the dictionary.
  • NB_DE_dict.html: Contains the actual dictionary entries.
  • NB_DE_dict.jpeg: The cover image (useless, but required for creating the MOBI file).
kindlegen.exe NB_DE_dict.opf -c2 -verbose -dont_append_source
  1. (Optional) Use the Kindle Previewer to preview the dictionary. Note that this only allows you to view the dictionary as if it were a regular book, but you unfortunately cannot try it out on an actual book in preview mode.

  2. Copy the MOBI file to the directory documents/dictionaries/ on your Kindle. You may need to restart the device afterwards (especially if you are updating the dictionary).

If you are using Windows, you can execute steps 4 and 5 at once by executing run.bat.

To uninstall, go to documents/dictionaries/ and delete NB_DE_dict.mobi as well as NB_DE_dict.sdr/.

Building Dictionaries for Other Languages

  1. In the OPF file, update the dictionary title, languages and all relevant file names.

  2. If the dictionary data is not in the dict.cc format, either re-format it accordingly or change the way the file is parsed in transform.py.

  3. Create a class that can generate inflected forms and that extends the Inflector class (inflector.py). Use it as Inflector class in transform.py.

  4. Follow the steps above for creating & installing a new dictionary.

Features / To Do

  • Generate inflections (nouns, adjectives, verbs).
    • Regular inflections (from Språkbanken where available, otherwise generated according to regular inflection paradigms)
    • Irregular inflections (from Språkbanken's list)
    • Genitive forms
    • Multi-token entries (in particular: phrasal verbs)
  • Deal with parentheses and ellipses in Norwegian entries.
  • Merge entries for identical Norwegian words (e.g. blomsterbutikk).
    • Extend this to [kvinnelig] entries.
  • Show relevant multi-token entries when looking up single-token entries (e.g. the entry for blå (blue) also contains information on the phrase å være i det blå (to be in the dark), which is also a distinct entry).
    • I don't check for POS tags when creating these references; therefore, there are some false positives here. Since I find them quite interesting, I don't plan on refining this.
  • Extend the dictionary.
    • Note: Unless compound nouns are in the dictionary, it's not possible to look them (or their constituents) up. Since I cannot change the way the dictionary is used to look up entries, there is not much I can do.
    • Look into adding Wiktionary data. Specifically from the English or Norwegian versions of Wiktionary.
    • The best (monolingual) Norwegian dictionary I know is https://ordbok.uib.no/, whose database I unfortunately cannot download and use. But maybe there are other good monolingual dictionaries out there that I can use?
    • Written Danish and Bokmål are very similar. If I can find a large DA>EN or DA>DE dictionary, it could be worth looking into adding these entries where no Norwegian entries are present.
    • What about Norsk Ordvev (Norwegian WordNet) for (monolingual) thesaurus-like information?

References and Data

kindle-dict's People

Contributors

verenablaschke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

seanvk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.