nltk / nltk_data Goto Github PK

NLTK Data

Makefile 5.27% XSLT 31.57% Python 41.12% Shell 22.04%

nltk nlp natural-language-processing corpora linguistics

nltk_data's Introduction

Natural Language Toolkit (NLTK)

NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. NLTK requires Python version 3.8, 3.9, 3.10, 3.11 or 3.12.

For documentation, please visit nltk.org.

Contributing

Do you want to contribute to NLTK development? Great! Please read CONTRIBUTING.md for more details.

Donate

Have you found the toolkit helpful? Please support NLTK development by donating to the project via PayPal, using the link on the NLTK homepage.

Citing

If you publish work that uses NLTK, please cite the NLTK book, as follows:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

Copyright

For license information, see LICENSE.txt.

AUTHORS.md contains a list of everyone who has contributed to NLTK.

Redistributing

NLTK source code is distributed under the Apache 2.0 License.
NLTK documentation is distributed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license.
NLTK corpora are provided under the terms given in the README file for each corpus; all are redistributable and available for non-commercial use.
NLTK may be freely redistributed, subject to the provisions of these licenses.

nltk_data's People

Contributors

Stargazers

Watchers

Forkers

sp00 chandangoopta janstrunk kairat kmike stevenxxiu biblicalhumanities navneetojha krusheel pombredanne omganne brainfreeze0 abioy logosity biswanaths eelcor gedeonchrison tomleo championedit sfu-natlang cleonard1261 russellbackman drop2kumar sk8darr vitaliyatrasevych pquentin anoopkunchukuttan whucs10034 ducki13 thantthet native93 defaultrobot moocedx avitalp yamasani seemanarwariya davidnemeskey alexlouden unutbu barrygolden rohun-tripathi rcrowder ahmadfaizalbh mgehlot15 oanc badgitty mlehl evanmisshula tushartilwankar razib764 krrepo aruntct7 sanyaade-nlp letuananh fcbond thanhpham25290 wannaphong sumit12dec reedloden annierajan sandy4321 crsharat deveshmanjhi afgiel shirisha134 sadeghmir erlangbus ab212 jswdba jadoo87 riksi vemishra tsolakghukasyan yashgoel2 ajdapretnar wimmuskee moskit0 yooper martymacgyver mahaocheng lgdkobe24 kamalpasa lwllvyb jimsow mikastefanie tabris223 ajinkyat snowcicala naoyak apoorva-dave isumitg amy626 djokester alexxnica kryndex reigin sosoyi glowskir transparint seejee

nltk_data's Issues

Some punctuation doubled in Brown corpus

This seems to consistently affect ?!;. Example:

How effective have Kennedy administration first foreign policy decisions been in dealing with Communist aggression ? ?

This shows some more examples:

from nltk.corpus import brown
n = 1000
for sent in brown.sents()[:n]:
  if any(punc in sent for punc in '!?;'):
    print ' '.join(sent)

New pretrained Punkt model for Polish

A new pretrained Punkt model for nltk.tokenize.punkt for Polish trained on the Polish National Corpus by Krzysztof Langner and the updated README file for the collection of pretrained Punkt models.

Tigrigna_Corpus

Thank you.

Data as PyPi packages

It would be nicer or even more usable to have each data package as a python pypi package. That way they could be installed using the pip installer. Instead of using the text.download() method and interactive installer

Add Snowball test data

Add the snowball-data corpus, for testing of stemmers, https://github.com/snowballstem/snowball-data

Inclusion of MULTEXT-East Corpus

Hi,
@jwacalex @toydarian and I are currently working on a corpus reader and POS-Tagger for the MULTEXT-East corpus for a project in a text-mining course at the University of Passau.
It would be great if our work could be integrated into NLTK. The first step therefore would be to add the corpus to your repository. It is available at https://www.clarin.si/repository/xmlui/handle/11356/1043 the license is CC BY-NC-SA 4.0 (http://creativecommons.org/licenses/by-nc-sa/4.0/).

MULTEXT-East would be a great resource for a multilingual, annotated corpus, where there exist not many in NLTK.

We propose the name mte_teip5 as a short string for the corpus inside NLTK.

Regards,
Alex, Tommy and Thomas

panlex_lite is out of date

panlex_lite has been recently updated and is therefore always showing as "out of date" per the nltk downloader (GUI and command line). The current correct data for index.xml is as follows:

<package author="David Kamholz" checksum="255bbe2c06d8e1acfe57cab42052d6cb" id="panlex_lite" license="CC0 1.0 Universal" name="PanLex Lite Corpus" size="2235214249" subdir="corpora" unzip="1" unzipped_size="5917599729" url="http://dev.panlex.org/db/panlex_lite.zip" webpage="http://panlex.org/" />

wrong german stopwords in stopwords corpora

nltk_data/packages/corpora/stopwords.zip contains four wrong german stopwords:

unse
unsem
unsen
unses

Add multilingual wordnet

@francisbond is contributing the Open Multilingual Wordnet to NLTK (http://www.casta-net.jp/~kuribayashi/multi/).

We need to settle on a short name to use: multiwordnet?

Can't download cmudata

I keep getting a HTTP 503 error trying to download cmudict either through the nltk interface or directly from github.

remove Russian_Russky-UTF8~ from corpora/udhr.zip

This is the same file as Russian_Russky-UTF8

Update state_union and inaugural corpora

New speeches need to be added to these corpora.

framenet_v15 appears to be corrupt

I can't seem to download framenet_v15, either via nltk.download(), or via manually downloading through a web browser. Is it corrupt?

Cannot download latest nltk_data

Not long ago we had a problem downloading nltk data described in the nltk repo under this issue: nltk/nltk#882 where an erroneous extension caused issues downloading nltk data using nltk v2.

It seems the problem has been re-introduced with the following commit: c9887b5 where the .xml extension is causing download failures in nltk. Would it be possible to have the same fix applied to remove the erroneous extension?

gh-pages index.xml broken

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 2267, in <module>
    halt_on_error=options.halt_on_error)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 664, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 534, in incr_download
    try: info = self._info_or_id(info_or_id)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 508, in _info_or_id
    return self.info(info_or_id)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 875, in info
    self._update_index()
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 825, in _update_index
    ElementTree.parse(compat.urlopen(self._url)).getroot())
  File "/usr/local/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)
  File "/usr/local/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 23, column 143

Stopwords.zip - french : missing 2 common words

add these 2 stop words in french file

les
ils

"les" is plural of "le", "la" (the / the)
"ils" is plural of "il" (they / he)

Re-train maxent_treebank_pos_tagger

It currently doesn't unpickle under Python3.x. I guess this is because of http://bugs.python.org/issue6784 : Treebank corpus reader returned bytestrings under Python 2.x and the pickled classifier was trained on it; Python 3.x tries to decode them to unicode and this fails because the encoding is unknown. I think the way to fix this is to re-train the classifier on Python 2.x but with unicode strings as features; this should be backwards-compatible if I'm not mistaken.

Type inconsistency between NLTK Wordnet and OMW for all_lemma_names()

NLTK's Wordnet and Open Multilingual Wordnet ("OMW") share a common function, but produce output of differing types:

import nltk
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn
for lang in sorted(wn.langs()):
    print lang, type(wn.all_lemma_names(lang=lang))

produces:

als <type 'list'>
arb <type 'list'>
bul <type 'list'>
cat <type 'list'>
cmn <type 'list'>
dan <type 'list'>
ell <type 'list'>
eng <type 'dictionary-keyiterator'>
eus <type 'list'>
fas <type 'list'>
fin <type 'list'>
fra <type 'list'>
glg <type 'list'>
heb <type 'list'>
hrv <type 'list'>
ind <type 'list'>
ita <type 'list'>
jpn <type 'list'>
nno <type 'list'>
nob <type 'list'>
pol <type 'list'>
por <type 'list'>
qcn <type 'list'>
slv <type 'list'>
spa <type 'list'>
swe <type 'list'>
tha <type 'list'>
zsm <type 'list'>

This inconsistency will cause users to have to use workarounds that result in less comprehensible code than would otherwise be the case. For example, until this bug is fixed, developers may need to use Python's built-in list() function to ensure that a list is returned regardless of which OMW language code is passed to all_lemma_names().

This inconsistency also partly explains the zero result for English (eng) in #42 .

Update Universal Tagset mapping data

Source: https://code.google.com/p/universal-pos-tags/
Cf: https://groups.google.com/forum/#!topic/nltk-users/IwKDLMZzZGw

corpora/inaugural/2005-Bush.txt has encoding issues

For example, near the top of the text:

years of sabbatical<A1>Xand then there came a day of fire

<A1> is a symbol with hex code A1 and 'X' is an usual 'X'. I didn't found an encoding for this in Python; this looks more like a copy-paste error.

Maybe just make text ascii? It is available at http://en.wikisource.org/wiki/George_W._Bush%27s_Second_Inaugural_Address

Crubadan corpus replacement/addition

http://crubadan.org/ The crubadan site should explain most of the project, have a creative commons license, and provide links to the numerous corpora under that license. There are over 2100 collections and they are being updated frequently. I understand that a typical user would probably want one language at a time so if there's a specific way you all would like this to be handled please contact me on github.

We also have a rough reader written that works off of the language codes for crubadan. It is located at https://github.com/BrennanG/nltk/blob/develop/nltk/corpus/reader/crubadan.py. IF you have any suggestions for it please do not hesitate to let me know.

Thanks,

Dustin Joosten

README files

Normalize corpus README files. Currently some are README.txt, but CorpusReader.readme() expects them to be called README, cf nltk/nltk#31

brown.zip (possible others) corrupt

When I attempt to download the brown.zip corpus and extract with the nltk tools I get the message "Error with downloaded zip file".

When I attempt to download it manually it seems to download just fine, but then when I go to extract manually:

$ unzip brown.zip
Archive:  brown.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of brown.zip or
        brown.zip.zip, and cannot find brown.zip.ZIP, period.

hmm_treebank_pos_tagger doesn't load

In [7]: _POS_TAGGER = 'taggers/hmm_treebank_pos_tagger/treebank.tagger.pickle.gz'

In [8]: pickle.load(nltk.data.find(_POS_TAGGER).open())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-3e64e4dab40b> in <module>()
----> 1 pickle.load(nltk.data.find(_POS_TAGGER).open())

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load(file)
   1376 
   1377 def load(file):
-> 1378     return Unpickler(file).load()
   1379 
   1380 def loads(str):

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load_reduce(self)
   1131         args = stack.pop()
   1132         func = stack[-1]
-> 1133         value = func(*args)
   1134         stack[-1] = value
   1135     dispatch[REDUCE] = load_reduce

/Users/kmike/envs/nltk/lib/python2.7/copy_reg.pyc in _reconstructor(cls, base, state)
     46 def _reconstructor(cls, base, state):
     47     if base is object:
---> 48         obj = object.__new__(cls)
     49     else:
     50         obj = base.__new__(cls, state)

TypeError: object.__new__(ConditionalProbDist) is not safe, use collections.defaultdict.__new__()

panlex_swadesh shows "out of date" despite being updated

Using the GUI to update panlex_swadesh. It is always displaying out of date despite repeatedly downloading the updates.

Groningen Meaning Bank Corpus Reader

http://gmb.let.rug.nl/data.php
suggested nltk name: gmb
corpus reader: being developed
GMB contains texts that can be freely redistributed and the annotations are released to the public domain, as stated in the LICENSE file within the corpus.

Make gh-pages branch the default

You can make the gh-pages branch the GitHub default. Not a big deal, but might help some people browsing around and unfamiliar with the GitHub UI. Within the Settings tab on the right, there's a dropdown menu for Default Branch.

Update WordNet data files to 3.1

WordNet 3.1 provides updated data files in the same format as 3.0, plus a host of additional files. However, the lexnames file is gone.

ریشه یابی افعال و اسامی بی قاعده

با سلام و احترام .من قصد دارم روی دیتا ست رویترز کار کنم .در فاز اول باید دیتا ست را نرمال سازی کنم و عملیات ریشه یابی را انجام دهم . . با توجه به این که دیتا ست من لاتین است .من چگونه می توانم افعاد و اسامی بی قاعده را ریشه یابی کنم . ؟

Common contracted forms are missing from the English stop word list

While the list contains s and t (most likely because they can occur after an apostrophe as part of a contraction in e.g. dog's and can't), other common forms, i.e.

d as in she'd,
ll as in we'll,
m as in I'm,
o as in o'clock,
re as in you're,
ve as in they've,
y as in y'all
are missing.

Also missing are the parts of these contractions that fall to the left of the apostrophe, e.g. ain (but don is there).

Of course, the lack of these forms could be justified by pointing out that if the tokenizer does not split by apostrophes, then these forms will not occur in the tokenized text. However, it is a strong assumption, especially taking into account that nltk's own Punkt tokenizer, for instance, does split at the apostrophes. Also, some of the contractions seem to be handled (don't , can't, the possessive s), so it does not make sense to not include the rest.

http://nltk.org/nltk_data/ is stale

For example, size for punkt tokenizer is 5920457 at http://nltk.org/nltk_data/, but it is 6613281 in index.xml. This makes nltk.download think that punkt is always outdated.

Store PanLex corpus in Git LFS

The PanLex corpus is too large to store directly in git, so it should be stored in Git LFS and pulled from there instead. This also would take the load off of PanLex's dev server and permit secure downloads of the corpus (rather than downloading via insecure/unencrypted http://).

panlex_lite.xml causes build failure

Trying to build index.xml, but I keep getting this error. Any suggestions?

$ python tools/build_pkg_index.py . https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages index.xml
Traceback (most recent call last):
  File "C:\Users\XXX\Anaconda3\lib\site-packages\nltk\downloader.py", line 2200, in _find_packages
    try: zf = zipfile.ZipFile(zipfilename)
  File "C:\Users\XXX\Anaconda3\lib\zipfile.py", line 1009, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: '.\\packages\\corpora\\panlex_lite.zip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/build_pkg_index.py", line 24, in <module>
    index = build_index(ROOT, BASE_URL)
  File "C:\Users\XXX\Anaconda3\lib\site-packages\nltk\downloader.py", line 2085, in build_index
    for pkg_xml, zf, subdir in _find_packages(os.path.join(root, 'packages')):
  File "C:\Users\XXX\Anaconda3\lib\site-packages\nltk\downloader.py", line 2203, in _find_packages
    (zipfilename, e))
ValueError: Error reading file '.\\packages\\corpora\\panlex_lite.zip'!
[Errno 2] No such file or directory: '.\\packages\\corpora\\panlex_lite.zip'

>>> import nltk;
>>> print('The nltk version is {}.'.format(nltk.__version__))
The nltk version is 3.2.1.

How can I contribute a corpus?

I'd like to contribute a corpus. What format does the corpus need to be in? Does it need to be POS tagged?

It's a corpus of about 65K books from the British Library. Currently, they're only XML files, but I'm working on getting them in plaintext, as well. You can see a few sample files in https://github.com/Git-Lit/git-lit/tree/master/data2. It's about 1TB, or 250GB compressed, so it won't fit in this GH repo. However, I'm making github repositories for each text in the corpus. So all that would be needed is a way for nltk.corpus.download() to grab each text in this corpus, given a URL for each one. What would be the best way of doing that?

Fix typo in name of contributor of Alpino corpus

Please, fix a typo in the name of contributor of Alpino corpus. In the file https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/alpino.xml change the following two lines

         contact="Gertjan Van Noord"
         license="Distributed with permission of Gertjan Van Noord"

into

         contact="Gertjan van Noord"
         license="Distributed with permission of Gertjan van Noord"

National Corpus of Polish

Hello. I would like to add National Corpus of Polish (NKJP) -> http://nkjp.pl/index.php?page=0&lang=1 and also reader for it. My suggestion for name is pol_c. NKJP if freely available resource (see 'tools and resources' section).

Different API calls for obtaining all lemma names in NLTK's Open Multilingual Wordnet produce inconsistent results

This bug appears to be related to #42, but is of a more general character.

import nltk
from tabulate import tabulate
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn

table = list()

for lang in sorted(wn.langs()):
    my_set_of_all_lemma_names = set()
    from nltk.corpus import wordnet as wn
    for aln_term in list(wn.all_lemma_names(lang=lang)):
        for synset in wn.synsets(aln_term):
            for lemma in synset.lemma_names():
                my_set_of_all_lemma_names.add(lemma)
    table.append([lang,
        len(set(wn.all_lemma_names(lang=lang))),
        len(my_set_of_all_lemma_names)])

print tabulate(table,
    headers=["Language code",
        "all_lemma_names()",
        "lemma_name.synset.lemma.lemma_names()"])

produces (with headers condensed onto multiple lines, and column markers added):

Language | all_lemma_names() | lemma_name.synset
code     |                   | .lemma.lemma_names()
-------- | ----------------- | --------------------
als      |              5988 |                 2477
arb      |             17785 |                   54
bul      |              6720 |                    0
cat      |             46534 |                24368
cmn      |             61532 |                   13
dan      |              4468 |                 4336
ell      |             18229 |                  800
eng      |            147306 |               148730
eus      |             26242 |                 6055
fas      |             17560 |                    0
fin      |            129839 |                49042
fra      |             55350 |                45367
glg      |             23125 |                12893
heb      |              5325 |                    0
hrv      |             29010 |                 8596
ind      |             36954 |                21780
ita      |             41855 |                13225
jpn      |             89637 |                 1028
nno      |              3387 |                 3255
nob      |              4186 |                 3678
pol      |             45387 |                10844
por      |             54069 |                21889
qcn      |              3206 |                    0
slv      |             40236 |                25363
spa      |             36681 |                20922
swe      |              5824 |                 4640
tha      |             80508 |                  622
zsm      |             33932 |                19253

As with #42, it is interesting that sometimes the first API call finds more lemma names; and sometimes the second API call finds more. That again suggests to me that this behaviour does indeed represent a bug (or perhaps a series of bugs), and is not intentional.

No CI breaking package.

As you can see - issues like that could be avoid #70

ask me in import Reuters dataset

I ask you to explain how to import Reuters dataset ..I want to import Reuters dataset in Python
but it was error. .

Iinstalled corpus Related it.All corpos the answer except Reuters.

from nltk.corpus import reuters
reuters.fileids()
error:{Traceback (most recent call last):
File "D:\Python34kk\lib\site-packages\nltk\corpus\util.py", line 63, in __load
try: root = nltk.data.find('corpora/%s' % zip_name)
File "D:\Python34kk\lib\site-packages\nltk\data.py", line 618, in find
raise LookupError(resource_not_found)}

Problem loading machado corpus with python 3.5

The following code which used to work a while ago is failing now with python 3.5

from nltk.corpus import machado

textos = [machado.raw(id) for id in machado.fileids()]
len(textos)

this is yielding the following error:

AssertionError                            Traceback (most recent call last)
<ipython-input-23-3d2409a86302> in <module>()
----> 1 textos = [machado.raw(id) for id in machado.fileids()]
      2 len(textos)

<ipython-input-23-3d2409a86302> in <listcomp>(.0)
----> 1 textos = [machado.raw(id) for id in machado.fileids()]
      2 len(textos)

/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/plaintext.py in raw(self, fileids, categories)
    158     def raw(self, fileids=None, categories=None):
    159         return PlaintextCorpusReader.raw(
--> 160             self, self._resolve(fileids, categories))
    161     def words(self, fileids=None, categories=None):
    162         return PlaintextCorpusReader.words(

/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/plaintext.py in raw(self, fileids)
     72         if fileids is None: fileids = self._fileids
     73         elif isinstance(fileids, string_types): fileids = [fileids]
---> 74         return concat([self.open(f).read() for f in fileids])
     75 
     76     def words(self, fileids=None):

/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/plaintext.py in <listcomp>(.0)
     72         if fileids is None: fileids = self._fileids
     73         elif isinstance(fileids, string_types): fileids = [fileids]
---> 74         return concat([self.open(f).read() for f in fileids])
     75 
     76     def words(self, fileids=None):

/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/api.py in open(self, file)
    208         """
    209         encoding = self.encoding(file)
--> 210         stream = self._root.join(file).open(encoding)
    211         return stream
    212 

/usr/local/lib/python3.5/dist-packages/nltk/data.py in open(self, encoding)
    503 
    504     def open(self, encoding=None):
--> 505         data = self._zipfile.read(self._entry)
    506         stream = BytesIO(data)
    507         if self._entry.endswith('.gz'):

/usr/local/lib/python3.5/dist-packages/nltk/data.py in read(self, name)
    980         self.fp = open(self.filename, 'rb')
    981         value = zipfile.ZipFile.read(self, name)
--> 982         self.close()
    983         return value
    984 

/usr/lib/python3.5/zipfile.py in close(self)
   1610             fp = self.fp
   1611             self.fp = None
-> 1612             self._fpclose(fp)
   1613 
   1614     def _write_end_record(self):

/usr/lib/python3.5/zipfile.py in _fpclose(self, fp)
   1714 
   1715     def _fpclose(self, fp):
-> 1716         assert self._fileRefCnt > 0
   1717         self._fileRefCnt -= 1
   1718         if not self._fileRefCnt and not self._filePassed:

AssertionError:

Additional categories for different NLTK usages

We have all-corpora and all but it'll be nice if we can several new category that includes:

popular
- punkt
- stopwords
- wordnet
- averaged_perceptron_tagger
- brown
- movie_reviews
- words
tokenizers
- punkt
- snowball
- perluniprops
- nonbreaking_prefixes

That way I think it's easier to advise users to do the following to install nltk:

pip install -U nltk
python -m nltk.downloader popular

More importantly, I think all-no-third-party and all-third-party, so that we can separate issues when the third-party datasets/models don't update their checksum to nltk when they refresh their data/models.

@stevenbird Are the suggestions okay? How should we go about adding these categories?

Duplicates in words.words() dictionary

The words.words() dictionary contains 844 duplicates, which may as well be eliminated. I discovered it because some of them were out of alphabetical order.

Here is some Python to illustrate this:

from nltk.corpus import words
len(words.words())
236736
len(set(words.words()))
235892
236736-235892
844

Thanks.

Word missing in words

'children' is missing in english words.
I understand some plurals are missing but some irregular plurals are included like 'hippopotami' and 'corpora'. I would find it difficult to programmatically find 'children' from 'child'.

NCBI disease corpus

http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/

NCBI disease corpus is the latest version biomed disease related corpus used for biomedical research, the biocreative_ppi corpus doesn't work currently.

Stopwords for Kazakh language (kazakh)

suggested nltk name for stopwords: kazakh
There is no open stopwords list available for kazakh language. So, this list is done by our team (Almaty, Kazakhstan) and freely redistributable.

panlex_lite.zip is broken

The 1.7G file is broken and cause downloader failed.
Proof:

$ wget http://dev.panlex.org/db/panlex_lite.zip
$ unzip panlex_lite.zip
Archive:  panlex_lite.zip
warning [panlex_lite.zip]:  76 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [panlex_lite.zip]:  reported length of central directory is
  -76 bytes too long (Atari STZip zipfile?  J.H.Holm ZIPSPLIT 1.1
  zipfile?).  Compensating...
   skipping: panlex_lite/db.sqlite   need PK compat. v4.5 (can do v2.1)
   creating: panlex_lite/
  inflating: panlex_lite/README.txt

note:  didn't find end-of-central-dir signature at end of central dir.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

Add Universal Dependencies dataset

http://universaldependencies.github.io/docs/

Capitalization inconsistencies in NLTK's Open Multilingual Wordnet

NLTK's Open Multilingual Wordnet ("OMW") corpus data violates the principle of least surprise, in the following respect.

A user can reasonably expect that any NLTK function that:

produces a list of all lemmas given a language, and
in that list, retains the language's typical capitalisation of those lemmas

will do the same thing for every other language in OMW (if it is a language that uses capital letters).

However, NLTK violates that expectation:

import nltk
import re
capital = re.compile(r'[A-Z]')
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn
for lang in sorted(wn.langs()):
    print lang, len(filter(capital.match, wn.all_lemma_names(lang=lang)))

produces:

als 14
arb 6
bul 1
cat 4139
cmn 55
dan 3
ell 314
eng 0
eus 722
fas 0
fin 30861
fra 11818
glg 4272
heb 2
hrv 2505
ind 15072
ita 2416
jpn 168
nno 1
nob 3
pol 2492
por 21850
qcn 0
slv 2888
spa 6493
swe 5
tha 0
zsm 11193

Some of the zero or very low results above are probably due to the language not generally using the letters A-Z. However, not all of those results can be accounted for in this way. Those that can not be accounted for in this way represent inconsistencies in OMW.

I have not yet investigated which kinds of inconsistency they represent. They may well represent more than one kind of inconsistency.

Stopwords for Slovene

I have a list of stopwords for Slovene language and I'd like to include it in NLTK, so we can operate with them from within the library. I've tried to submit a PR, but you seem to have all .zip files. What should be in them? .txt? Please give me a bit more elaborate contribution guidelines.

Thanks!

Konkani corpus contribution

Konkani(Kōṅkaṇī) is an Indo-Aryan language belonging to the Indo-European family of languages and is spoken along the western coast of India. It is one of the 22 scheduled languages mentioned in the 8th schedule of the Indian Constitution and the official language of the Indian state of Goa. The first Konkani inscription is dated 1187 A.D. It is a minority language in Maharashtra, Karnataka, northern Kerala (Kasaragod district) Dadra and Nagar Haveli, and Daman and Diu.

Konkani is a member of the southern Indo-Aryan language group. It retains elements of Old Indo-Aryan structures[citation needed] and shows similarities with both western and eastern Indo-Aryan languages [Refrence: https://en.wikipedia.org/wiki/Konkani_language]

I would like to contribute a part of speech tagged corpus in konkani.

NLTK Name: konkani
Corpus reader: TaggedCorpusReader
Source: I have collected the corpus from various sources like newspapers, magazines, periodicals, academic texts
the corpus is freely redistributable