Code Monkey home page Code Monkey logo

nltk_data's Introduction

Natural Language Toolkit (NLTK)

PyPI CI

NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. NLTK requires Python version 3.8, 3.9, 3.10, 3.11 or 3.12.

For documentation, please visit nltk.org.

Contributing

Do you want to contribute to NLTK development? Great! Please read CONTRIBUTING.md for more details.

See also how to contribute to NLTK.

Donate

Have you found the toolkit helpful? Please support NLTK development by donating to the project via PayPal, using the link on the NLTK homepage.

Citing

If you publish work that uses NLTK, please cite the NLTK book, as follows:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

Copyright

Copyright (C) 2001-2023 NLTK Project

For license information, see LICENSE.txt.

AUTHORS.md contains a list of everyone who has contributed to NLTK.

Redistributing

  • NLTK source code is distributed under the Apache 2.0 License.
  • NLTK documentation is distributed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States license.
  • NLTK corpora are provided under the terms given in the README file for each corpus; all are redistributable and available for non-commercial use.
  • NLTK may be freely redistributed, subject to the provisions of these licenses.

nltk_data's People

Contributors

alvations avatar avitalp avatar djokester avatar ekaf avatar ewan-klein avatar explorerfreda avatar fcbond avatar gdemelo avatar glowskir avatar ihulub avatar jacksonllee avatar letuananh avatar martymacgyver avatar nimbusaeta avatar reedloden avatar sahutd avatar simonrichard avatar stevenbird avatar theredpea avatar tomaarsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nltk_data's Issues

Some punctuation doubled in Brown corpus

This seems to consistently affect ?!;. Example:

How effective have Kennedy administration first foreign policy decisions been in dealing with Communist aggression ? ?

This shows some more examples:

from nltk.corpus import brown
n = 1000
for sent in brown.sents()[:n]:
  if any(punc in sent for punc in '!?;'):
    print ' '.join(sent)

New pretrained Punkt model for Polish

A new pretrained Punkt model for nltk.tokenize.punkt for Polish trained on the Polish National Corpus by Krzysztof Langner and the updated README file for the collection of pretrained Punkt models.

Data as PyPi packages

It would be nicer or even more usable to have each data package as a python pypi package. That way they could be installed using the pip installer. Instead of using the text.download() method and interactive installer

Inclusion of MULTEXT-East Corpus

Hi,
@jwacalex @toydarian and I are currently working on a corpus reader and POS-Tagger for the MULTEXT-East corpus for a project in a text-mining course at the University of Passau.
It would be great if our work could be integrated into NLTK. The first step therefore would be to add the corpus to your repository. It is available at https://www.clarin.si/repository/xmlui/handle/11356/1043 the license is CC BY-NC-SA 4.0 (http://creativecommons.org/licenses/by-nc-sa/4.0/).

MULTEXT-East would be a great resource for a multilingual, annotated corpus, where there exist not many in NLTK.

We propose the name mte_teip5 as a short string for the corpus inside NLTK.

Regards,
Alex, Tommy and Thomas

panlex_lite is out of date

panlex_lite has been recently updated and is therefore always showing as "out of date" per the nltk downloader (GUI and command line). The current correct data for index.xml is as follows:

<package author="David Kamholz" checksum="255bbe2c06d8e1acfe57cab42052d6cb" id="panlex_lite" license="CC0 1.0 Universal" name="PanLex Lite Corpus" size="2235214249" subdir="corpora" unzip="1" unzipped_size="5917599729" url="http://dev.panlex.org/db/panlex_lite.zip" webpage="http://panlex.org/" />

Can't download cmudata

I keep getting a HTTP 503 error trying to download cmudict either through the nltk interface or directly from github.

Cannot download latest nltk_data

Not long ago we had a problem downloading nltk data described in the nltk repo under this issue: nltk/nltk#882 where an erroneous extension caused issues downloading nltk data using nltk v2.

It seems the problem has been re-introduced with the following commit: c9887b5 where the .xml extension is causing download failures in nltk. Would it be possible to have the same fix applied to remove the erroneous extension?

gh-pages index.xml broken

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 2267, in <module>
    halt_on_error=options.halt_on_error)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 664, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 534, in incr_download
    try: info = self._info_or_id(info_or_id)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 508, in _info_or_id
    return self.info(info_or_id)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 875, in info
    self._update_index()
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 825, in _update_index
    ElementTree.parse(compat.urlopen(self._url)).getroot())
  File "/usr/local/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)
  File "/usr/local/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 23, column 143

Re-train maxent_treebank_pos_tagger

It currently doesn't unpickle under Python3.x. I guess this is because of http://bugs.python.org/issue6784 : Treebank corpus reader returned bytestrings under Python 2.x and the pickled classifier was trained on it; Python 3.x tries to decode them to unicode and this fails because the encoding is unknown. I think the way to fix this is to re-train the classifier on Python 2.x but with unicode strings as features; this should be backwards-compatible if I'm not mistaken.

Type inconsistency between NLTK Wordnet and OMW for all_lemma_names()

NLTK's Wordnet and Open Multilingual Wordnet ("OMW") share a common function, but produce output of differing types:

import nltk
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn
for lang in sorted(wn.langs()):
    print lang, type(wn.all_lemma_names(lang=lang))

produces:

als <type 'list'>
arb <type 'list'>
bul <type 'list'>
cat <type 'list'>
cmn <type 'list'>
dan <type 'list'>
ell <type 'list'>
eng <type 'dictionary-keyiterator'>
eus <type 'list'>
fas <type 'list'>
fin <type 'list'>
fra <type 'list'>
glg <type 'list'>
heb <type 'list'>
hrv <type 'list'>
ind <type 'list'>
ita <type 'list'>
jpn <type 'list'>
nno <type 'list'>
nob <type 'list'>
pol <type 'list'>
por <type 'list'>
qcn <type 'list'>
slv <type 'list'>
spa <type 'list'>
swe <type 'list'>
tha <type 'list'>
zsm <type 'list'>

This inconsistency will cause users to have to use workarounds that result in less comprehensible code than would otherwise be the case. For example, until this bug is fixed, developers may need to use Python's built-in list() function to ensure that a list is returned regardless of which OMW language code is passed to all_lemma_names().

This inconsistency also partly explains the zero result for English (eng) in #42 .

Crubadan corpus replacement/addition

http://crubadan.org/ The crubadan site should explain most of the project, have a creative commons license, and provide links to the numerous corpora under that license. There are over 2100 collections and they are being updated frequently. I understand that a typical user would probably want one language at a time so if there's a specific way you all would like this to be handled please contact me on github.

We also have a rough reader written that works off of the language codes for crubadan. It is located at https://github.com/BrennanG/nltk/blob/develop/nltk/corpus/reader/crubadan.py. IF you have any suggestions for it please do not hesitate to let me know.

Thanks,

Dustin Joosten

brown.zip (possible others) corrupt

When I attempt to download the brown.zip corpus and extract with the nltk tools I get the message "Error with downloaded zip file".

When I attempt to download it manually it seems to download just fine, but then when I go to extract manually:

$ unzip brown.zip
Archive:  brown.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of brown.zip or
        brown.zip.zip, and cannot find brown.zip.ZIP, period.

hmm_treebank_pos_tagger doesn't load

In [7]: _POS_TAGGER = 'taggers/hmm_treebank_pos_tagger/treebank.tagger.pickle.gz'

In [8]: pickle.load(nltk.data.find(_POS_TAGGER).open())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-3e64e4dab40b> in <module>()
----> 1 pickle.load(nltk.data.find(_POS_TAGGER).open())

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load(file)
   1376 
   1377 def load(file):
-> 1378     return Unpickler(file).load()
   1379 
   1380 def loads(str):

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load_reduce(self)
   1131         args = stack.pop()
   1132         func = stack[-1]
-> 1133         value = func(*args)
   1134         stack[-1] = value
   1135     dispatch[REDUCE] = load_reduce

/Users/kmike/envs/nltk/lib/python2.7/copy_reg.pyc in _reconstructor(cls, base, state)
     46 def _reconstructor(cls, base, state):
     47     if base is object:
---> 48         obj = object.__new__(cls)
     49     else:
     50         obj = base.__new__(cls, state)

TypeError: object.__new__(ConditionalProbDist) is not safe, use collections.defaultdict.__new__()

Make gh-pages branch the default

You can make the gh-pages branch the GitHub default. Not a big deal, but might help some people browsing around and unfamiliar with the GitHub UI. Within the Settings tab on the right, there's a dropdown menu for Default Branch.

Update WordNet data files to 3.1

WordNet 3.1 provides updated data files in the same format as 3.0, plus a host of additional files. However, the lexnames file is gone.

ریشه یابی افعال و اسامی بی قاعده

با سلام و احترام .من قصد دارم روی دیتا ست رویترز کار کنم .در فاز اول باید دیتا ست را نرمال سازی کنم و عملیات ریشه یابی را انجام دهم . . با توجه به این که دیتا ست من لاتین است .من چگونه می توانم افعاد و اسامی بی قاعده را ریشه یابی کنم . ؟

Common contracted forms are missing from the English stop word list

While the list contains s and t (most likely because they can occur after an apostrophe as part of a contraction in e.g. dog's and can't), other common forms, i.e.

  • d as in she'd,
  • ll as in we'll,
  • m as in I'm,
  • o as in o'clock,
  • re as in you're,
  • ve as in they've,
  • y as in y'all
    are missing.

Also missing are the parts of these contractions that fall to the left of the apostrophe, e.g. ain (but don is there).

Of course, the lack of these forms could be justified by pointing out that if the tokenizer does not split by apostrophes, then these forms will not occur in the tokenized text. However, it is a strong assumption, especially taking into account that nltk's own Punkt tokenizer, for instance, does split at the apostrophes. Also, some of the contractions seem to be handled (don't , can't, the possessive s), so it does not make sense to not include the rest.

Store PanLex corpus in Git LFS

The PanLex corpus is too large to store directly in git, so it should be stored in Git LFS and pulled from there instead. This also would take the load off of PanLex's dev server and permit secure downloads of the corpus (rather than downloading via insecure/unencrypted http://).

panlex_lite.xml causes build failure

Trying to build index.xml, but I keep getting this error. Any suggestions?

$ python tools/build_pkg_index.py . https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages index.xml
Traceback (most recent call last):
  File "C:\Users\XXX\Anaconda3\lib\site-packages\nltk\downloader.py", line 2200, in _find_packages
    try: zf = zipfile.ZipFile(zipfilename)
  File "C:\Users\XXX\Anaconda3\lib\zipfile.py", line 1009, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: '.\\packages\\corpora\\panlex_lite.zip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/build_pkg_index.py", line 24, in <module>
    index = build_index(ROOT, BASE_URL)
  File "C:\Users\XXX\Anaconda3\lib\site-packages\nltk\downloader.py", line 2085, in build_index
    for pkg_xml, zf, subdir in _find_packages(os.path.join(root, 'packages')):
  File "C:\Users\XXX\Anaconda3\lib\site-packages\nltk\downloader.py", line 2203, in _find_packages
    (zipfilename, e))
ValueError: Error reading file '.\\packages\\corpora\\panlex_lite.zip'!
[Errno 2] No such file or directory: '.\\packages\\corpora\\panlex_lite.zip'

>>> import nltk;
>>> print('The nltk version is {}.'.format(nltk.__version__))
The nltk version is 3.2.1.

How can I contribute a corpus?

I'd like to contribute a corpus. What format does the corpus need to be in? Does it need to be POS tagged?

It's a corpus of about 65K books from the British Library. Currently, they're only XML files, but I'm working on getting them in plaintext, as well. You can see a few sample files in https://github.com/Git-Lit/git-lit/tree/master/data2. It's about 1TB, or 250GB compressed, so it won't fit in this GH repo. However, I'm making github repositories for each text in the corpus. So all that would be needed is a way for nltk.corpus.download() to grab each text in this corpus, given a URL for each one. What would be the best way of doing that?

Different API calls for obtaining all lemma names in NLTK's Open Multilingual Wordnet produce inconsistent results

This bug appears to be related to #42, but is of a more general character.

import nltk
from tabulate import tabulate
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn

table = list()

for lang in sorted(wn.langs()):
    my_set_of_all_lemma_names = set()
    from nltk.corpus import wordnet as wn
    for aln_term in list(wn.all_lemma_names(lang=lang)):
        for synset in wn.synsets(aln_term):
            for lemma in synset.lemma_names():
                my_set_of_all_lemma_names.add(lemma)
    table.append([lang,
        len(set(wn.all_lemma_names(lang=lang))),
        len(my_set_of_all_lemma_names)])

print tabulate(table,
    headers=["Language code",
        "all_lemma_names()",
        "lemma_name.synset.lemma.lemma_names()"])

produces (with headers condensed onto multiple lines, and column markers added):

Language | all_lemma_names() | lemma_name.synset
code     |                   | .lemma.lemma_names()
-------- | ----------------- | --------------------
als      |              5988 |                 2477
arb      |             17785 |                   54
bul      |              6720 |                    0
cat      |             46534 |                24368
cmn      |             61532 |                   13
dan      |              4468 |                 4336
ell      |             18229 |                  800
eng      |            147306 |               148730
eus      |             26242 |                 6055
fas      |             17560 |                    0
fin      |            129839 |                49042
fra      |             55350 |                45367
glg      |             23125 |                12893
heb      |              5325 |                    0
hrv      |             29010 |                 8596
ind      |             36954 |                21780
ita      |             41855 |                13225
jpn      |             89637 |                 1028
nno      |              3387 |                 3255
nob      |              4186 |                 3678
pol      |             45387 |                10844
por      |             54069 |                21889
qcn      |              3206 |                    0
slv      |             40236 |                25363
spa      |             36681 |                20922
swe      |              5824 |                 4640
tha      |             80508 |                  622
zsm      |             33932 |                19253

As with #42, it is interesting that sometimes the first API call finds more lemma names; and sometimes the second API call finds more. That again suggests to me that this behaviour does indeed represent a bug (or perhaps a series of bugs), and is not intentional.

ask me in import Reuters dataset

I ask you to explain how to import Reuters dataset ..I want to import Reuters dataset in Python
but it was error. .

Iinstalled corpus Related it.All corpos the answer except Reuters.

from nltk.corpus import reuters
reuters.fileids()
error:{Traceback (most recent call last):
File "D:\Python34kk\lib\site-packages\nltk\corpus\util.py", line 63, in __load
try: root = nltk.data.find('corpora/%s' % zip_name)
File "D:\Python34kk\lib\site-packages\nltk\data.py", line 618, in find
raise LookupError(resource_not_found)}

Problem loading machado corpus with python 3.5

The following code which used to work a while ago is failing now with python 3.5

from nltk.corpus import machado

textos = [machado.raw(id) for id in machado.fileids()]
len(textos)

this is yielding the following error:

AssertionError                            Traceback (most recent call last)
<ipython-input-23-3d2409a86302> in <module>()
----> 1 textos = [machado.raw(id) for id in machado.fileids()]
      2 len(textos)

<ipython-input-23-3d2409a86302> in <listcomp>(.0)
----> 1 textos = [machado.raw(id) for id in machado.fileids()]
      2 len(textos)

/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/plaintext.py in raw(self, fileids, categories)
    158     def raw(self, fileids=None, categories=None):
    159         return PlaintextCorpusReader.raw(
--> 160             self, self._resolve(fileids, categories))
    161     def words(self, fileids=None, categories=None):
    162         return PlaintextCorpusReader.words(

/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/plaintext.py in raw(self, fileids)
     72         if fileids is None: fileids = self._fileids
     73         elif isinstance(fileids, string_types): fileids = [fileids]
---> 74         return concat([self.open(f).read() for f in fileids])
     75 
     76     def words(self, fileids=None):

/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/plaintext.py in <listcomp>(.0)
     72         if fileids is None: fileids = self._fileids
     73         elif isinstance(fileids, string_types): fileids = [fileids]
---> 74         return concat([self.open(f).read() for f in fileids])
     75 
     76     def words(self, fileids=None):

/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/api.py in open(self, file)
    208         """
    209         encoding = self.encoding(file)
--> 210         stream = self._root.join(file).open(encoding)
    211         return stream
    212 

/usr/local/lib/python3.5/dist-packages/nltk/data.py in open(self, encoding)
    503 
    504     def open(self, encoding=None):
--> 505         data = self._zipfile.read(self._entry)
    506         stream = BytesIO(data)
    507         if self._entry.endswith('.gz'):

/usr/local/lib/python3.5/dist-packages/nltk/data.py in read(self, name)
    980         self.fp = open(self.filename, 'rb')
    981         value = zipfile.ZipFile.read(self, name)
--> 982         self.close()
    983         return value
    984 

/usr/lib/python3.5/zipfile.py in close(self)
   1610             fp = self.fp
   1611             self.fp = None
-> 1612             self._fpclose(fp)
   1613 
   1614     def _write_end_record(self):

/usr/lib/python3.5/zipfile.py in _fpclose(self, fp)
   1714 
   1715     def _fpclose(self, fp):
-> 1716         assert self._fileRefCnt > 0
   1717         self._fileRefCnt -= 1
   1718         if not self._fileRefCnt and not self._filePassed:

AssertionError: 

Additional categories for different NLTK usages

We have all-corpora and all but it'll be nice if we can several new category that includes:

  • popular

    • punkt
    • stopwords
    • wordnet
    • averaged_perceptron_tagger
    • brown
    • movie_reviews
    • words
  • tokenizers

    • punkt
    • snowball
    • perluniprops
    • nonbreaking_prefixes

That way I think it's easier to advise users to do the following to install nltk:

pip install -U nltk
python -m nltk.downloader popular

More importantly, I think all-no-third-party and all-third-party, so that we can separate issues when the third-party datasets/models don't update their checksum to nltk when they refresh their data/models.

@stevenbird Are the suggestions okay? How should we go about adding these categories?

Duplicates in words.words() dictionary

The words.words() dictionary contains 844 duplicates, which may as well be eliminated. I discovered it because some of them were out of alphabetical order.

Here is some Python to illustrate this:

from nltk.corpus import words
len(words.words())
236736
len(set(words.words()))
235892
236736-235892
844

Thanks.

Word missing in words

'children' is missing in english words.
I understand some plurals are missing but some irregular plurals are included like 'hippopotami' and 'corpora'. I would find it difficult to programmatically find 'children' from 'child'.

Stopwords for Kazakh language (kazakh)

suggested nltk name for stopwords: kazakh
There is no open stopwords list available for kazakh language. So, this list is done by our team (Almaty, Kazakhstan) and freely redistributable.

panlex_lite.zip is broken

The 1.7G file is broken and cause downloader failed.
Proof:

$ wget http://dev.panlex.org/db/panlex_lite.zip
$ unzip panlex_lite.zip
Archive:  panlex_lite.zip
warning [panlex_lite.zip]:  76 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [panlex_lite.zip]:  reported length of central directory is
  -76 bytes too long (Atari STZip zipfile?  J.H.Holm ZIPSPLIT 1.1
  zipfile?).  Compensating...
   skipping: panlex_lite/db.sqlite   need PK compat. v4.5 (can do v2.1)
   creating: panlex_lite/
  inflating: panlex_lite/README.txt

note:  didn't find end-of-central-dir signature at end of central dir.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

Capitalization inconsistencies in NLTK's Open Multilingual Wordnet

NLTK's Open Multilingual Wordnet ("OMW") corpus data violates the principle of least surprise, in the following respect.

A user can reasonably expect that any NLTK function that:

  • produces a list of all lemmas given a language, and
  • in that list, retains the language's typical capitalisation of those lemmas

will do the same thing for every other language in OMW (if it is a language that uses capital letters).

However, NLTK violates that expectation:

import nltk
import re
capital = re.compile(r'[A-Z]')
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn
for lang in sorted(wn.langs()):
    print lang, len(filter(capital.match, wn.all_lemma_names(lang=lang)))

produces:

als 14
arb 6
bul 1
cat 4139
cmn 55
dan 3
ell 314
eng 0
eus 722
fas 0
fin 30861
fra 11818
glg 4272
heb 2
hrv 2505
ind 15072
ita 2416
jpn 168
nno 1
nob 3
pol 2492
por 21850
qcn 0
slv 2888
spa 6493
swe 5
tha 0
zsm 11193

Some of the zero or very low results above are probably due to the language not generally using the letters A-Z. However, not all of those results can be accounted for in this way. Those that can not be accounted for in this way represent inconsistencies in OMW.

I have not yet investigated which kinds of inconsistency they represent. They may well represent more than one kind of inconsistency.

Stopwords for Slovene

I have a list of stopwords for Slovene language and I'd like to include it in NLTK, so we can operate with them from within the library. I've tried to submit a PR, but you seem to have all .zip files. What should be in them? .txt? Please give me a bit more elaborate contribution guidelines.

Thanks!

Konkani corpus contribution

Konkani(Kōṅkaṇī) is an Indo-Aryan language belonging to the Indo-European family of languages and is spoken along the western coast of India. It is one of the 22 scheduled languages mentioned in the 8th schedule of the Indian Constitution and the official language of the Indian state of Goa. The first Konkani inscription is dated 1187 A.D. It is a minority language in Maharashtra, Karnataka, northern Kerala (Kasaragod district) Dadra and Nagar Haveli, and Daman and Diu.

Konkani is a member of the southern Indo-Aryan language group. It retains elements of Old Indo-Aryan structures[citation needed] and shows similarities with both western and eastern Indo-Aryan languages [Refrence: https://en.wikipedia.org/wiki/Konkani_language]

I would like to contribute a part of speech tagged corpus in konkani.

NLTK Name: konkani
Corpus reader: TaggedCorpusReader
Source: I have collected the corpus from various sources like newspapers, magazines, periodicals, academic texts
the corpus is freely redistributable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.