dkpro / dkpro-jwktl Goto Github PK

View Code? Open in Web Editor NEW

57.0 13.0 24.0 4.48 MB

Java Wiktionary Library

Home Page: http://dkpro.org/dkpro-jwktl/

License: Apache License 2.0

Java 99.99% Perl 0.01%

dkpro-jwktl's Introduction

JWKTL

Summary

JWKTL (Java-based Wiktionary Library) is an application programming interface for the free multilingual online dictionary Wiktionary (https://www.wiktionary.org). Wiktionary is collaboratively constructed by volunteers and continually growing. JWKTL enables efficient and structured access to the information encoded in the English and German Wiktionary language editions, including sense definitions, part of speech tags, etymology, example sentences, translations, semantic relations, and many other lexical information types. The API was first described in an LREC 2008 paper.

Further information and documentation is available from the project homepage:

https://dkpro.github.io/dkpro-jwktl/

License

JWKTL is available as open source software under the Apache License 2.0 (ASL). The software thus comes "as is" without any warranty (see the license text for more details). JWKTL makes use of Berkeley DB Java Edition 5.0.73 (Sleepycat License), Apache Ant 1.7.1 (ASL), JUnit 4.12 (CPL), and Wikokit (New BSD license). For the respective third party licenses, see NOTICE.txt.

Publications

A more detailed description of Wiktionary and JWKTL is available in our scientific articles:

Christian M. Meyer and Iryna Gurevych: Wiktionary: A new rival for expert-built lexicons? Exploring the possibilities of collaborative lexicography, Chapter 13 in S. Granger & M. Paquot (Eds.): Electronic Lexicography, pp. 259–291, Oxford: Oxford University Press, November 2012. http://ukcatalogue.oup.com/product/9780199654864.do
Christian M. Meyer and Iryna Gurevych: OntoWiktionary – Constructing an Ontology from the Collaborative Online Dictionary Wiktionary, chapter 6 in M. T. Pazienza and A. Stellato (Eds.): Semi-Automatic Ontology Development: Processes and Resources, pp. 131–161, Hershey, PA: IGI Global, February 2012. https://www.ukp.tu-darmstadt.de/data/lexical-resources/wiktionary/ontowiktionary/
Torsten Zesch, Christof Müller, and Iryna Gurevych: Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary, in: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), pp. 1646–1652, May 2008. Marrakech, Morocco. http://lrec-conf.org/proceedings/lrec2008/summaries/420.html

Please cite a JWKTL-related article if you use the software in your scientific work.

Project Background

Prior to being available as open source software, JWKTL was a research project at the Ubiquitous Knowledge Processing (UKP) Lab of Technische Universität Darmstadt, Germany under the auspices of Prof. Iryna Gurevych. Since being open source software, JWKTL is developed by multiple contributors (see CONTRIBUTORS.txt for details).

Contact

In case of any questions, please contact Christian M. Meyer.

dkpro-jwktl's People

Contributors

Stargazers

Watchers

dkpro-jwktl's Issues

Ability to access SubSenses seperately

I would like to get subsenses and be able to iterate over those. For example:
https://en.wiktionary.org/wiki/train
one subsense could be "A sequence of events or ideas which are interconnected; a course or procedure of something"
Currently, all of subsenses are obtained in one sense.

Dump parser is timezone dependent

Originally reported on Google Code with ID 9

Timestamp parsing in WiktionaryDumpParser is currently timezone dependent. When the
code is executed on a non-UTC system the output will be wrong.

The attached patch fixes this and also makes the tests timezone independent.

cf. https://bitbucket.org/jberkel/jwktl/commits/3bf75f7fd67c619c027f7fb575e34f997a2254b6

Reported by jan.berkel on 2014-11-18 01:06:04

- _Attachment: [timestamp.patch](https://storage.googleapis.com/google-code-attachments/jwktl/issue-9/comment-0/timestamp.patch)_

ArrayIndexOutOfBoundsException when parsing invalid translation template Üt

When parsing the current wiktionary dump I get the following exception:

java.lang.ArrayIndexOutOfBoundsException: 2
	at de.tudarmstadt.ukp.jwktl.parser.de.components.DETranslationHandler.processBody(DETranslationHandler.java:150)
	at de.tudarmstadt.ukp.jwktl.parser.WiktionaryEntryParser.parse(WiktionaryEntryParser.java:149)
	at de.tudarmstadt.ukp.jwktl.parser.WiktionaryArticleParser.setText(WiktionaryArticleParser.java:133)
	at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.setText(WiktionaryDumpParser.java:247)
	at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.onElementEnd(WiktionaryDumpParser.java:175)
	at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser$XMLDumpHandler.endElement(XMLDumpParser.java:83)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1783)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2970)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
	at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parseStream(XMLDumpParser.java:130)
	at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:121)
	at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:78)
	at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:140)
	at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:114)

The problem appears here:

							translationText = fields[2].trim();

If the template Üt is used incorrectly, like {{Üt|}}, the array fields has probably only 0 or 1 elements so accessing fields[2] without produces the exception above.

Suggested fix: check for the length of the array, if <3, use null as translationText.

I'll fix this and provide a PR.

getting audio.

Originally reported on Google Code with ID 19

Are there any opportunities to get audio?
I try to get audio through
IWiktionaryPage page = wkt.getPageForWord("boat");
System.out.println(page.getEntry(0).getPronunciations().get(0).getText());

but I always got the same type (IPA).

Is thhre any opportunity to get AUDIO type?

Reported by rilot77 on 2015-05-15 19:27:03

XML parse error

Originally reported on Google Code with ID 6

=> What steps will reproduce the problem?
1. Run the parsing data with following code
public static void main(String[] args) throws Exception {
    File dumpFile = new File(PATH_TO_DUMP_FILE);
    File outputDirectory = new File(TARGET_DIRECTORY);
    boolean overwriteExisting = OVERWRITE_EXISTING_FILES;

    JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, overwriteExisting);

2. Using the 2 latest dump datafiles from Wiktionary
     enwiktionary-20140504-pages-articles.xml
     enwiktionary-latest-pages-articles.xml


=> What is the expected output? What do you see instead?
INFO: Parsed 775000 pages
Exception in thread "main" de.tudarmstadt.ukp.jwktl.api.WiktionaryException: XML parse
error
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:140)
    at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:74)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:143)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:117)
    at ParsingData.main(ParsingData.java:12)
Caused by: org.xml.sax.SAXParseException; lineNumber: 34869191; columnNumber: 5; Invalid
byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:131)
    ... 4 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
of 4-byte UTF-8 sequence.
    at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    ... 14 more

=> What version of the product are you using? On what operating system?
   <dependency>
     <groupId>de.tudarmstadt.ukp.jwktl</groupId>
     <artifactId>jwktl</artifactId>
     <version>1.0.0</version>
   </dependency>

=> Please provide any additional information below.
However, the parsing went through successfully with the dump file "enwiktionary-20140415-pages-articles.xml"

Reported by [email protected] on 2014-05-21 01:34:15

Add inflection group property to the word form

As discussed in #57, add the int inflectionGroup property to IWiktionaryWordForm.

The purpose of this property is to identify which "column" of the inflection table does a word form belong to.

Consider the word Fels for example:

{{Deutsch Substantiv Übersicht
|Genus 1=m
|Genus 2=m
|Nominativ Singular 1=Fels
|Nominativ Singular 2=Fels
|Nominativ Plural=Felsen
|Genitiv Singular 1=Fels
|Genitiv Singular 1*=Felses
|Genitiv Singular 1**=Felsens
|Genitiv Singular 2=Felsen
|Genitiv Plural=Felsen
|Dativ Singular 1=Fels
|Dativ Singular 2=Felsen
|Dativ Plural=Felsen
|Akkusativ Singular 1=Fels
|Akkusativ Singular 2=Felsen
|Akkusativ Plural=Felsen
}}

Here we would generate the following word forms:

we would thus generate 14 word forms with the following properties:

form=Fels, gender=MASC, num=SING, case=NOM, inflectionGroup=1
form=Fels, gender=MASC, num=SING, case=NOM, inflectionGroup=2
form=Fels, gender=null, num=PL, case=NOM, inflectionGroup=0
form=Fels, gender=MASC, num=SING, case=GEN, inflectionGroup=1
form=Felses, gender=MASC, num=SING, case=GEN, inflectionGroup=1
form=Felsens, gender=MASC, num=SING, case=GEN, inflectionGroup=1
form=Felsen, gender=MASC, num=SING, case=GEN, inflectionGroup=2
form=Felsen, gender=null, num=PL, case=GEN, inflectionGroup=0
form=Fels, gender=MASC, num=SING, case=DAT, inflectionGroup=1
form=Felsen, gender=MASC, num=SING, case=DAT, inflectionGroup=2
form=Felsen, gender=null, num=PL, case=DAT, inflectionGroup=0
form=Fels, gender=MASC, num=SING, case=ACC, inflectionGroup=1
form=Felsen, gender=MASC, num=SING, case=ACC, inflectionGroup=2
form=Felsen, gender=null, num=PL, case=ACC, inflectionGroup=0

Parse sub-senses

Originally reported on Google Code with ID 18

Looks like these don't get parsed correctly. For an example see https://en.wiktionary.org/wiki/ter#Portuguese

Reported by jan.berkel on 2015-05-07 19:25:59

Release version 1.1.0

Is a pre-built JWKTL database dump available?

Hi, sorry to post this as an issue, but I figured this would be the best form of communication. I'm wondering whether there are any dumps of the scraped data available which I can play around with? I'd really love to use this project, but am hesitant to learn Java and troubleshoot my way to creating my own database without knowing whether it'll be useful for my purposes.

Thanks for your team's work on this!

Upgrade to Java 8

Originally reported on Google Code with ID 16

This is essentially possible by upgrading to the newest parent POM.

Reported by chmeyer.de on 2015-04-22 13:07:53

Parsing nullpointerexception in stream.close()

Originally reported on Google Code with ID 8

What steps will reproduce the problem?
1.public static void main(String[] args) throws Exception {
    File dumpFile = new File(PATH_TO_DUMP_FILE);
    File outputDirectory = new File(TARGET_DIRECTORY);
    boolean overwriteExisting = OVERWRITE_EXISTING_FILES;

    JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, overwriteExisting);

2. Using the latest file enwiktionary_20140908_pages_articles.xml
and enwiktionary_20140415_pages_articles occurs the same exception.


What is the expected output? What do you see instead?

Info: Parsed 4025411 pages
Exception in thread "main" java.lang.NullPointerException
    at de.tudarmstadt.ukp.jwktl.JWKTL.getVersion(JWKTL.java:60)
    at de.tudarmstadt.ukp.jwktl.parser.WritableBerkeleyDBWiktionaryEdition.saveProperties(WritableBerkeleyDBWiktionaryEdition.java:160)
    at de.tudarmstadt.ukp.jwktl.parser.WiktionaryArticleParser.onClose(WiktionaryArticleParser.java:115)
    at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.onClose(WiktionaryDumpParser.java:102)
    at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:75)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:151)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:125)
    at de.tudarmstadt.ukp.jwktl.examples.Example1_ParseWiktionaryDump.main(Example1_ParseWiktionaryDump.java:52)

=> stream.close() -> stream is null lead to JWKTL.getVersion() has nullpointerexception


What version of the product are you using? On what operating system?

jwktl 1.0.0 
OS:WIN8

Please provide any additional information below.

There are 4038950 pages in the latest file, but in the 4025411 file
always occurs NullPointerException.

Reported by ss61418 on 2014-09-16 14:14:14

Italic definition null

Originally reported on Google Code with ID 15

What steps will reproduce the problem?
1.entries = dict.getEntriesForWord("the", filter, true);
2.entry.getSenses()
3.sense.getGloss().getPlainText()

What is the expected output? 
Definite grammatical article that implies necessarily that an entity it articulates
is presupposed; something already mentioned, or completely specified later in that
same sentence, or assumed already completely specified. 

What do you see instead?
null

What version of the product are you using? On what operating system?
1.0.1, OsX 10.9..5

Please provide any additional information below.
It seems that the italic font is not parsed.

Reported by aronicafrancesco on 2015-04-21 10:28:24

Create word forms for "Deutsch Substantiv Übersicht -sch"

See: https://de.wiktionary.org/wiki/Niederl%C3%A4ndisch

Contains:

{{Deutsch Substantiv Übersicht -sch
}}

Displayed as:

I think we should generate appropriate word forms in this case.

About 2017 dump English Wiktionary

Hi there,
I am trying the library with the latest Wiktionary dump from 2017 and I get an XML parse error. I realize one user reported working with dumps from 2013 (Google groups) with no problem. Do you keep a record of what Wiktionary can be parsed with your libraries without errors?

Thank you in advance for any help.

Context information for alternative meanings

Originally reported on Google Code with ID 20

Hello, thanks for an awesome library.
I want to propose new feature, for words with a multiply senses there is often a context
information in wiktionary that is missing in your API.
Please take a look on the page https://en.wiktionary.org/wiki/fly - 
the word has a number of different senses with contexts like (obsolete), (baseball),(weightlifting)

Could you please add this functionality or point me where to look in sources to add
it.

Reported by acc4konstantin on 2015-05-21 13:04:29

Fix Javadoc warnings

There are a few hundred Javadoc warnings to be seen during the build.

I would like to fix them to have clean Javadoc.

I will work on this issue on a branch of my fork.

Consider `mn` and `mfn` when parsing gender

In some cases the gender of the substantive is specified as mn or mfn for multiple genders.

Example for an earlier version of Tetragraph:

=== {{Wortart|Substantiv|Deutsch}}, {{mn}} ===

{{mn}} is actually not correct, as it resolves to Mongolisch. The correct version should be {{mn.}}

Example for {{mfn}} is Flipchart.

Neither mn nor mfn are currently considered by the DEPartOfSpeechHandler, in both cases this results in empty list of genders which is obiously incorrect.

I'll fix it and provide a PR.

NullPointerException when filtering by language

Originally reported on Google Code with ID 1

What steps will reproduce the problem?
1. download 'dump dewiktionary-20130818-pages-articles.xml.bz2'
2. parse it with 'JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, true);'
3. run the snippet you gave as an example on http://code.google.com/p/jwktl/wiki/JWKTLUseCases

  IWiktionaryEdition wkt = JWKTL.openEdition(WIKTIONARY_DIRECTORY);
  WiktionaryEntryFilter filter = new WiktionaryEntryFilter();
  filter.setAllowedWordLanguages(Language.GERMAN);
  filter.setAllowedPartsOfSpeech(PartOfSpeech.ADJECTIVE);
  int deAdjectiveCount = 0;
  for (IWiktionaryEntry entry : wkt.getAllEntries(filter)) {
    System.out.println(entry.getWord());
    deAdjectiveCount++;
  }
  System.out.println("German adjectives: " + deAdjectiveCount);
  wkt.close();

What is the expected output? What do you see instead?

What I expect is that all german adjectives are printed and their count also.
But what I see is a NullPointerException. Here is the stack trace:

Exception in thread "main" java.lang.NullPointerException
    at java.util.TreeMap.getEntry(TreeMap.java:342)
    at java.util.TreeMap.containsKey(TreeMap.java:227)
    at java.util.TreeSet.contains(TreeSet.java:234)
    at de.tudarmstadt.ukp.jwktl.api.filter.WiktionaryEntryFilter.acceptWordLanguage(WiktionaryEntryFilter.java:103)
    at de.tudarmstadt.ukp.jwktl.api.filter.WiktionaryEntryFilter.accept(WiktionaryEntryFilter.java:149)
    at de.tudarmstadt.ukp.jwktl.api.entry.WiktionaryEdition$1.fetchNext(WiktionaryEdition.java:97)
    at de.tudarmstadt.ukp.jwktl.api.entry.WiktionaryEdition$1.fetchNext(WiktionaryEdition.java:81)
    at de.tudarmstadt.ukp.jwktl.api.util.WiktionaryIterator.hasNext(WiktionaryIterator.java:43)
    at WiktionarySearcher.main(WiktionarySearcher.java:24)



What version of the product are you using? On what operating system?

I am using version version 1.0.0 on Ubuntu 

Best

Abou

Reported by [email protected] on 2013-08-21 15:06:15

Support Plural* in noun table

For instance see the word Gelb:

{{Deutsch Substantiv Übersicht
|Genus=n
|Nominativ Singular=Gelb
|Nominativ Plural=Gelbs
|Nominativ Plural*=Gelbtöne
|Genitiv Singular=Gelbs
|Genitiv Plural=Gelbs
|Genitiv Plural*=Gelbtöne
|Dativ Singular=Gelb
|Dativ Plural=Gelbs
|Dativ Plural*=Gelbtönen
|Akkusativ Singular=Gelb
|Akkusativ Plural=Gelbs
|Akkusativ Plural*=Gelbtöne
|Bild 1=Zeichen_306_-_Vorfahrtstraße,_StVO_1970.svg|200px|1|ein Verkehrszeichen mit Gelbanteilen
|Bild 2=Andwil Oberarnegg Briefkasten.jpg|200px|1|Gelb ist die Farbe vieler Postbetriebe, hier ein Schweizer Briefkasten
}}

This uses Plural* for a second plural option.

getting Caused by: com.sleepycat.je.UniqueConstraintException: (JE 5.0.73) Unique secondary key is already present

I get the following stack trace when trying to parse the enwiki-20180420-pages-articles.xml.bz2 dump file. I am using the 1.1.0 version. I will try to parse some older dump files and see what happens.

INFO: Parsed 1975000 pages
Exception in thread "main" de.tudarmstadt.ukp.jwktl.api.WiktionaryException: Unable to save page Busser
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryArticleParser.saveParsedWiktionaryPage(WiktionaryArticleParser.java:156)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryArticleParser.onPageEnd(WiktionaryArticleParser.java:105)
at java.lang.Iterable.forEach(Iterable.java:75)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.onPageEnd(WiktionaryDumpParser.java:190)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.onElementEnd(WiktionaryDumpParser.java:151)
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser$XMLDumpHandler.endElement(XMLDumpParser.java:83)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1783)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2970)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parseStream(XMLDumpParser.java:130)
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:121)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:78)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:140)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:114)
at com.lmco.textanalysis.filereader.WiktionaryDB.main(WiktionaryDB.java:19)
Caused by: com.sleepycat.je.UniqueConstraintException: (JE 5.0.73) Unique secondary key is already present
at com.sleepycat.je.SecondaryDatabase.insertKey(SecondaryDatabase.java:997)
at com.sleepycat.je.SecondaryDatabase.updateSecondary(SecondaryDatabase.java:857)
at com.sleepycat.je.SecondaryTrigger.databaseUpdated(SecondaryTrigger.java:41)
at com.sleepycat.je.Database.notifyTriggers(Database.java:2122)
at com.sleepycat.je.Cursor.putNotify(Cursor.java:2136)
at com.sleepycat.je.Cursor.putNoDups(Cursor.java:2052)
at com.sleepycat.je.Cursor.putInternal(Cursor.java:2020)
at com.sleepycat.je.Cursor.putNoOverwrite(Cursor.java:792)
at com.sleepycat.persist.PrimaryIndex.put(PrimaryIndex.java:402)
at com.sleepycat.persist.PrimaryIndex.put(PrimaryIndex.java:335)
at de.tudarmstadt.ukp.jwktl.parser.WritableBerkeleyDBWiktionaryEdition.savePage(WritableBerkeleyDBWiktionaryEdition.java:219)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryArticleParser.saveParsedWiktionaryPage(WiktionaryArticleParser.java:149)
... 23 more

Multiple genders in (German) wiktionary result in empty gender in jwktl

Despite being defined as Substantiv, n, m in the German wiktionary, the resulting Entry for the word 'Liter' has an empty Gender field (NULL). This defective behavior is reproduceable for all noun pages which have a single entry with multiple genders (e.g. Cola). However in cases of multiple entries per page with (at least) one non-ambiguous and one ambiguous gender definition (e.g. Spezi), a gender is assigned to all entries.

The bug results from the insufficient specification of possible gender matches in parser/de/components/DEPartOfSpeechHandler.java.

Build fails unter Java 8 due to Javadoc errors

I'm trying to build the project unter Java 8. The build fails due to JavaDoc errors.

Maven logs:

std.txt
err.txt

PR with the fix follows.

entry link missing for some german entries

Originally reported on Google Code with ID 21

What steps will reproduce the problem?
1. query entry for 'Eingaben' with current de dump.

What is the expected output?
an entry with an entry link to 'Eingabe'

What do you see instead?
link is missing;

What version of the product are you using?
current trunk

Adding 'Grundformverweis Dekl' and 'Grundformverweis Konj' as entry link template solves
this issue -- I've attached a patch.

Reported by kgschwebke on 2015-06-28 06:10:11

- _Attachment: [DEEntryLinkHandler.java.patch](https://storage.googleapis.com/google-code-attachments/jwktl/issue-21/comment-0/DEEntryLinkHandler.java.patch)_

Add gender to singular word forms in German

Please see the discussion in #57.

Scope is only German language.
Add GrammaticalGender getGender() to the IWiktionaryWordForm.
When parsing the noun table, consider labels:
- Genus
- Genus 1
- Genus 2
- Genus 3
- Genus 4
- In case a label does not have the value m, n or f, log a warning.
When parsing the noun table, consider labels:
- Singular
- Singular 1, Singular 1*, Singular 1**
- Singular 2, Singular 2*, Singular 2**
- Singular 3, Singular 3*, Singular 3**
- Singular 4, Singular 4*, Singular 4**
Assign the gender with the corresponding index in the word form. If there is no gender with the corresponding index, log a warning and assign null as gender to the word form.
For word forms with "plural" label assign null as gender to the word form.

Problem compiling from Maven central

Originally reported on Google Code with ID 7

I have to admit that I am rather new to Maven. I added 

<dependency>
   <groupId>de.tudarmstadt.ukp.jwktl</groupId>
   <artifactId>jwktl</artifactId>
   <version>1.0.0</version>
</dependency>

to my pom.xml file, but when I tried to compile I get an error caused by xerces. The
solution seems to be to add

<dependency>
   <groupId>xerces</groupId>
   <artifactId>xercesImpl</artifactId>
   <version>2.11.0</version>
</dependency>

in the pom.xml file. After that the importing the dump file worked without a problem.

Reported by c.orasan on 2014-06-05 12:49:40

Replace Xerces

I'm not sure if I've run into bug #6 again, but the current wiktionary dump (20151102) does not parse correctly, failing with

org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid byte 2 of 4-byte UTF-8 sequence.

After some investigation I think I fixed the underlying issue in Xerces (XERCES-J-1668).

However the UTF-8 handling there is quite messy and has no good test coverage. I propose to replace Xerces with something else unless there are objections.

Java's default XMLStreamReader could be a good option, it will probably more performant as well.

Model different versions of grammatical number in word forms

Context: I am using JWKTL to work with declension tables for German nouns.

There's a feature I need (and can implement) but I'd like to first discuss, what would be the best way to model it.

Basically, I want to be able to produce the declension if a given German noun.
Input: Antwort, output: die Antwort, Genitiv der Antwort, Dativ der Antwort, Akkusativ die Antwort, something along the lines. So essentially this boils down to generating the full declension table or its columns.

For most cases (ca. 90%) this is pretty straightforward. Two grammatical numbers, four grammatical cases - 8 word forms. Sometimes few forms are missing, sometimes there are two versions for one number/case, but it's pretty trivial.

But in some cases it gets more complicated. Some words may have several genders and sometimes there are different singular and plural forms. The most extreme example is Eponym with two genders (m, n), two singular and two plural declinations and up to 3 variations per number/case giving a total of 28 word forms.
But apart from that extreme example, the case with several grammatical numbers is rare, around 4%.

To process such cases, I need to know which words belong to the same "number". Let us take Dschungel for example:

{{Deutsch Substantiv Übersicht
|Genus 1=m
|Genus 2=n
|Genus 3=f
|Nominativ Singular 1=Dschungel
|Nominativ Singular 2=Dschungel
|Nominativ Singular 3=Dschungel
|Nominativ Plural 1=Dschungel
|Nominativ Plural 2=Dschungeln
|Genitiv Singular 1=Dschungels
|Genitiv Singular 2=Dschungels
|Genitiv Singular 3=Dschungel
|Genitiv Plural 1=Dschungel
|Genitiv Plural 2=Dschungeln
|Dativ Singular 1=Dschungel
|Dativ Singular 2=Dschungel
|Dativ Singular 3=Dschungel
|Dativ Plural 1=Dschungeln
|Dativ Plural 2=Dschungeln
|Akkusativ Singular 1=Dschungel
|Akkusativ Singular 2=Dschungel
|Akkusativ Singular 3=Dschungel
|Akkusativ Plural 1=Dschungel
|Akkusativ Plural 2=Dschungeln
|Bild=Hopetoun falls.jpg|200px|1|ein Wasserfall im ''Dschungel''
}}

To create this declination table I have to know not just the basic grammatical number (SINGULAR or PLURAL). I have to know if it's Singular 1 or Singular 2 etc. Then I can group word forms into a column of the declension table.

However, at the moment JWKTL (quite logically) only models grammatical number SINGULAR or PLURAL. At the moment I can't know if it was Singular 1 or Singular 2 which is my problem.

I would like to add this information to IWiktionaryWordForm, but I am not sure which would be the preferred way to model this. My suggestion would be to simply add the string rawGrammaticalNumber property. There's already something similar in IWiktionaryEntry.getRawHeadwordLine(), so the concept should not be completely out of its way.

Still, I'd like to hear your opinion on this before I actually implement this.

The number of parsed senses seems very small

Hi dkpro-jwktl team,

I git clone the project dkpro-jwktl and I was able to parse the following Wiktionary dump, enwiki-20170301-pages-articles.xml.bz2, without a problem, after adding two instructions to the XMLDumpParser in the private SAXParserFactory getParserFactory() method that increase the number of entries. Before adding these instructions, the libraries had thrown an Exception after parsing 650,000 entries. Here are the additional instructions that resolve this problem (this solution I found from another thread),

//Original instruction
//return SAXParserFactory.newInstance("com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl", null));

System.setProperty("jdk.xml.totalEntitySizeLimit", "1500000000");
SAXParserFactory spf = SAXParserFactory.newInstance("com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl", null);
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false);
return spf;

After modifying the code, the parsing of the dump finished correctly with no Exceptions or Errors. However, after executing one of the examples, Example3_IterateEntries.java the output showed the following results,

Pages: 10117574
Entries: 3776520
Senses: 986

The output seems short for the number of available senses since I presume these number should be the largest of the three or at least equal to the number of pages/entries. I also tried the examples from Word Senses suggested in

https://dkpro.github.io/dkpro-jwktl/documentation/architecture/

with the word "Boat" (certainly many more instances) and I got a IndexOutOfBoundsException . Do you have any idea why the libraries are not capturing enough number of senses? Finally, If I were to use "boat" I got a NullPointerException. Thank you in advance for any help.

Retrieve data in different language than query string?

I'd come in handy if one could specify a different language for the definitions than the one used for the query string, a bit like a translator.

So for instance, when I want some translation for a word, I go to the Wiktionary search for that word's definition and then click the language I'm interested in.

Is this supported by JWKTL, or by the Wiktionary API for that matter.

Polish word forms

When analyzing the English Wikitionary, getWordForm() always returns null for IWiktionaryEntry objects with a language of "Polish" --- even if the corresponding entry has a declension/conjugation table. Is this the expected behavior?

(When trying to analyze the Polish wiktionary, I got "Exception in thread "main" de.tudarmstadt.ukp.jwktl.api.WiktionaryException: Language Polish is not supported", so I assume that behavior is expected.)

Extract domain labels from the sense definition

Originally reported on Google Code with ID 2

Hi I have two questions concerning querying through the API the knowledge domain and
the synonyms of entries 

Two questions please:

1- can I also get the domain of an entry using your api? For example, find out that
"Diabetes" is a word of the domain "Medizin".

When you enter "Diabetes" in Wiktionary, you will indeed see its "Bedeutung" prefixed
with the word "Medizin". Thats why I wonder if your api also covers this information.

2- I have the same question concerning the synonyms: is there a method like "entry.getSynonyms()"?
I could not find a method that returns the synonyms of an entry.

Thank you in advance.
Best

Abou

Reported by [email protected] on 2013-08-21 15:44:00

Can't connect to Database

Hallo,

I can't figure out where I'm wrong, I used the "tutorial" on https://dkpro.github.io/dkpro-jwktl/documentation/getting-started/

My code is very simple:

package main;

import java.io.File;
import java.util.List;

import de.tudarmstadt.ukp.jwktl.JWKTL;
import de.tudarmstadt.ukp.jwktl.api.IWiktionaryEdition;
import de.tudarmstadt.ukp.jwktl.api.IWiktionaryEntry;
import de.tudarmstadt.ukp.jwktl.api.IWiktionaryPage;
import de.tudarmstadt.ukp.jwktl.api.IWiktionaryRelation;
import de.tudarmstadt.ukp.jwktl.api.PartOfSpeech;
import de.tudarmstadt.ukp.jwktl.api.RelationType;


public class Main {

    final static String PATH_TO_DUMP_FILE = "/GetWords/enwiktionary-20160601-pages-articles-multistream.xml";
    final static String TARGET_DIRECTORY = "/GetWords/";
    final static boolean OVERWRITE_EXISTING_FILES = true;
    /**
     * Simple example which parses an English dump file and prints the entries for the word <i>Wiktionary</i>
     * @param args name of the dump file, output directory for parsed data, ISO language code of the Wiktionary entry language (en/de), boolean value that specifies if existing parsed data should be deleted
     */
        public static void main(String[] args) throws Exception {
            File dumpFile = new File(PATH_TO_DUMP_FILE);
              File outputDirectory = new File(TARGET_DIRECTORY);
              boolean overwriteExisting = OVERWRITE_EXISTING_FILES;

              JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, overwriteExisting);

              IWiktionaryEdition wkt = JWKTL.openEdition(TARGET_DIRECTORY);

              //TODO: Query the data you need.

              // Close the database connection.
              wkt.close();
}
}

But the line: IWiktionaryEdition wkt = JWKTL.openEdition(TARGET_DIRECTORY); throws an error: The method openEdition(File) in the type JWKTL is not applicable for the arguments (String), when I try to enter the dump_file as

java.io.File
the program throws the following error:

Exception in thread "main" de.tudarmstadt.ukp.jwktl.api.WiktionaryException: Unable to establish a db connection
    at de.tudarmstadt.ukp.jwktl.api.entry.BerkeleyDBWiktionaryEdition.<init>(BerkeleyDBWiktionaryEdition.java:228)
    at de.tudarmstadt.ukp.jwktl.api.entry.BerkeleyDBWiktionaryEdition.<init>(BerkeleyDBWiktionaryEdition.java:205)
    at de.tudarmstadt.ukp.jwktl.JWKTL.openEdition(JWKTL.java:98)
    at de.tudarmstadt.ukp.jwktl.JWKTL.openEdition(JWKTL.java:89)
    at main.Main.main(Main.java:31)
Caused by: java.lang.IllegalArgumentException: Malformed \uxxxx encoding.
    at java.util.Properties.loadConvert(Unknown Source)
    at java.util.Properties.load0(Unknown Source)
    at java.util.Properties.load(Unknown Source)
    at com.sleepycat.je.dbi.DbConfigManager.applyFileConfig(DbConfigManager.java:388)
    at com.sleepycat.je.Environment.setupHandleConfig(Environment.java:323)
    at com.sleepycat.je.Environment.<init>(Environment.java:260)
    at com.sleepycat.je.Environment.<init>(Environment.java:212)
    at de.tudarmstadt.ukp.jwktl.api.entry.BerkeleyDBWiktionaryEdition.connect(BerkeleyDBWiktionaryEdition.java:241)
    at de.tudarmstadt.ukp.jwktl.api.entry.BerkeleyDBWiktionaryEdition.<init>(BerkeleyDBWiktionaryEdition.java:224)
    ... 4 more


Can someone show me where I'm wrong, or take it as issue. 
I use these libs:

- jwktl-1.0.1.jar
- je-6.4.25.jar
- apache-ant-1.8.2.jar

Thank you!

Change contributor attribution

As for the main DKPro projects, remove @author tags, since they are no longer up to date anyway, and introduce a contributor list.

Parse trans-see information

Originally reported on Google Code with ID 14

Some translation data might be missing because of redirects in the form of 

{{trans-see|other-headword}}

https://en.wiktionary.org/wiki/Template:trans-see

Reported by jan.berkel on 2015-03-26 13:35:33

Translations from the German Wiktionary cannot be extracted anymore

Originally reported on Google Code with ID 12

In current dump files for dewiktionary, translations are formatted in a different way.
Extraction code should be changed to recognize the changed format.

Reported by chmeyer.de on 2015-03-16 13:43:09

Label for every group of translations

Originally reported on Google Code with ID 3

Hello.

Is there a way with the API to retrieve the label of every group of translations for
a word?

I looked into them but i couldn't find anything.

Thanks in advance.
Best regards.



Gianpiero Venditti

Reported by gianpiero.venditti on 2013-09-02 20:35:46

Move project to GitHub and switch to new DKPro naming conventions

Originally reported on Google Code with ID 17

Next major release should be at the new GitHub site and comply with the new naming conventions:

  <groupId>org.dkpro.jwktl</groupId>
  <artifactId>dkpro-jwktl</artifactId>
  <version>2.0.0</version>

New package root: org.dkpro.jwktl.*

Reported by chmeyer.de on 2015-04-22 13:26:43

German inflections of verbs

Hi,
i'm trying to get the inflections of verbs. For example "reden" https://de.wiktionary.org/wiki/reden
Now i want to know the first person simple present form of "reden". When you check the site and the table on it you see, that it should be "ich rede".

I'm expecting that the method getWordForms() will do this, but unfortunally i get on all german verbs null. On nouns (example "Auto" https://de.wiktionary.org/wiki/Auto) i recieve the different forms like "des Autos". What method do i have to call, to get the inflection of a verb?

Missing POS tags

Interfix

Reference: WT:EL

Update languages_codes.txt

Should be updated with information from Module:languages, since it contains some additional (private) codes.

Add component for hyphenation extraction

Originally reported on Google Code with ID 22

German: separate section "Worttrennung"
English: part of "Pronunciation"

Caveat: maybe different forms (e.g., Plural), use commented list?

Reported by chmeyer.de on 2015-07-02 16:17:39

Etymology paragraph stripped of word hyperlinks

Originally reported on Google Code with ID 11

What steps will reproduce the problem?

1. Output the etymological information for a word.

What is the expected output? What do you see instead?

I did not know what to expect, so I output the etymology information for all words
in the db. These are the first lines from the file created (most entries are like this):

English:dictionary::    , from , from , from , perfect past participle of + .
English:dictionary::    , from , from , from , perfect past participle of + .
English:free::  From , from , , from , . Compare West Frisian , Dutch , German , Danish
.The verb comes from .
English:free::  From , from , , from , . Compare West Frisian , Dutch , German , Danish
.The verb comes from .


What version of the product are you using? On what operating system?

I am using jwktl-1.0.1 as a Maven artifact on a Ubuntu 12.04 machine, on Wiktionary
dump enwiktionary-20141004-pages-articles.xml

Please provide any additional information below.

The process runs smoothly, all other output information seems fine. This looks like
an "overly eager" clean-up of the paragraphs, since etymological information is given
in a slightly non-standard format. I am not sure if this format changed over time,
or the etymological information was always provided like this.

Reported by [email protected] on 2015-01-26 11:28:43

Descendant relationship parsing bugs

Originally reported on Google Code with ID 10

More complex descendant relationships fail to parse currently. The attached patch fixes
this and adds a few missing tests.

cf. https://bitbucket.org/jberkel/jwktl/commits/2d34b15d182e558c090d2ea18eb964d0014d1258

Reported by jan.berkel on 2014-11-18 01:47:00

- _Attachment: [descendant-parsing.patch](https://storage.googleapis.com/google-code-attachments/jwktl/issue-10/comment-0/descendant-parsing.patch)_

Link in README is broken

http://dkpro.org/jwktl/ => 404

Last version 1.1 does not available on Maven

When I add the maven dependency

de.tudarmstadt.ukp.jwktl
jwktl
1.1.0

I got Error: Missing artifact de.tudarmstadt.ukp.jwktl:jwktl:jar: 1.1.0

What is the problem?

Extrem memory consumption while iterating over multiple entries.

Originally reported on Google Code with ID 5

Trying to iterate over a whole Wiktionary database, or even a small portion of it, fails
with an OutOfMemoryError (Java heap space). This even occurs for example when just
writing some of the data to the console (so no memory consumption besides the library
usage). It seems like there is some serve Memory leak somewhere either in JWKTL or
in BerkeleyDB.

Of course, a quick fix solution would be to increase the memory available for the application
(using VM options -Xms & -Xmx), but first of all, the OutOfMemoryError occurs even
after only iterating a little amount of a Wiktionary, expecting an inefficiently high
memory need for iterating a whole Wiktionary. Secondly it still looks like some kind
of memory leak, so it should be possible to iterate over the whole Wiktionary without
increasing any heap space. Or is there any possibility in JWKTL to clear cached data
while iterating?

What steps will reproduce the problem?
1. Here is an example test method, that tries to extract all german example sentences
from the german and the english Wiktionary.

    @Test
    public void testGetAllExampleSentences() throws Exception {
        int counter = 0;
        IWiktionary wkt = JWKTL.openCollection(german, english);
        IWiktionaryIterator<IWiktionaryEntry> allEntries = wkt.getAllEntries();
        for (IWiktionaryEntry entry : allEntries) {
            ILanguage language = entry.getWordLanguage();
            if (language != null && language.getName().equals("German")) {
                List<IWikiString> examples = entry.getExamples();
                for (IWikiString example : examples) {
                    String plainText = example.getPlainText();
                    System.out.println(plainText);
                    counter++;
                }
            }
        }
        wkt.close();
        System.out.println(counter);
    }

What is the expected output? What do you see instead?

After 3300 sentences, the output slows down extremely, shortly after throwing the OutOfMemoryError
mentioned above.

What version of the product are you using? On what operating system?

I use the official JWKTL version 1.0.0 from the Central Maven Repository, using this
dependency:
        <dependency>
            <groupId>de.tudarmstadt.ukp.jwktl</groupId>
            <artifactId>jwktl</artifactId>
            <version>1.0.0</version>
        </dependency>

Thank you for your help. Besides that, thank you very much for providing this library.
It helps tremendously! Please continue providing such great libraries and tools.
Best wishes,
Andreas

Reported by andreas.schulz.de on 2013-12-04 11:54:41

Build fails under Java 9 due to java.lang.StringIndexOutOfBoundsException in maven-javadoc-plugin

When trying to build under Java 8 I get the following error:

[INFO] --- maven-javadoc-plugin:2.10.3:javadoc (generate-javadoc) @ dkpro-jwktl ---
[WARNING] Error injecting: org.apache.maven.plugin.javadoc.JavadocReport
java.lang.ExceptionInInitializerError
	at org.apache.maven.plugin.javadoc.AbstractJavadocMojo.<clinit>(AbstractJavadocMojo.java:195)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488)
	at com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:86)
	at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:105)
	at com.google.inject.internal.ConstructorInjector.access$000(ConstructorInjector.java:32)
	at com.google.inject.internal.ConstructorInjector$1.call(ConstructorInjector.java:89)
	at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:115)
	at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:133)
	at com.google.inject.internal.ProvisionListenerStackCallback.provision(ProvisionListenerStackCallback.java:68)
	at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:87)
	at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:267)
	at com.google.inject.internal.InjectorImpl$2$1.call(InjectorImpl.java:1016)
	at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1103)
	at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1012)
	at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1051)
	at org.eclipse.sisu.space.AbstractDeferredClass.get(AbstractDeferredClass.java:48)
	at com.google.inject.internal.ProviderInternalFactory.provision(ProviderInternalFactory.java:81)
	at com.google.inject.internal.InternalFactoryToInitializableAdapter.provision(InternalFactoryToInitializableAdapter.java:53)
	at com.google.inject.internal.ProviderInternalFactory$1.call(ProviderInternalFactory.java:65)
	at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:115)
	at org.eclipse.sisu.bean.BeanScheduler$Activator.onProvision(BeanScheduler.java:176)
	at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:126)
	at com.google.inject.internal.ProvisionListenerStackCallback.provision(ProvisionListenerStackCallback.java:68)
	at com.google.inject.internal.ProviderInternalFactory.circularGet(ProviderInternalFactory.java:63)
	at com.google.inject.internal.InternalFactoryToInitializableAdapter.get(InternalFactoryToInitializableAdapter.java:45)
	at com.google.inject.internal.InjectorImpl$2$1.call(InjectorImpl.java:1016)
	at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1092)
	at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1012)
	at org.eclipse.sisu.inject.Guice4$1.get(Guice4.java:162)
	at org.eclipse.sisu.inject.LazyBeanEntry.getValue(LazyBeanEntry.java:81)
	at org.eclipse.sisu.plexus.LazyPlexusBean.getValue(LazyPlexusBean.java:51)
	at org.codehaus.plexus.DefaultPlexusContainer.lookup(DefaultPlexusContainer.java:263)
	at org.codehaus.plexus.DefaultPlexusContainer.lookup(DefaultPlexusContainer.java:255)
	at org.apache.maven.plugin.internal.DefaultMavenPluginManager.getConfiguredMojo(DefaultMavenPluginManager.java:517)
	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:121)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
	at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
	at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 1
	at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3116)
	at java.base/java.lang.String.substring(String.java:1885)
	at org.apache.commons.lang.SystemUtils.getJavaVersionAsFloat(SystemUtils.java:1133)
	at org.apache.commons.lang.SystemUtils.<clinit>(SystemUtils.java:818)
	... 59 more
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

This happens with maven-javadoc-plugin version 2.10.3. Same story with 2.10.4.
Upgrading to 3.0.1 solves this issue.

English word forms

I am using the EnglishWordFormHandler for generating the word forms for verb lemmas.
How ever I am confused on how to use the Template object which is required for this task.
I am setting the name of the template object as "en-verb", and I am unable to understand what the parameters to this template object should be. Can you please help me here?
Thank you.

Support for alternative databases / Fork a BerkeleyDB-free version

Originally reported on Google Code with ID 4

I wanted to note that although 'jwktl' is covered by an Apache license, it has a deep
dependency on Berkeley DB. Oracle dual licenses Berkeley DB; its open source license
is now AGPL (see e.g. http://www.infoworld.com/d/open-source-software/oracle-switches-berkeley-db-license-222097)
 This makes it impractical to use jwktl in any commercial application, unless a very
costly commercial license (on a per-processor basis) is acquired from Oracle.

I have not dug into the source code yet, but I would like to have some informed opinion
on the feasibility of moving from using Berkeley DB to a more open licensed database
system - either purely relational such as the excellent H2, or the powerful XML database
system BaseX.

Reported by newintellectual on 2013-10-28 14:58:40

Parse "Alternative forms"

Originally reported on Google Code with ID 13

Would be nice to make this available via the API.

https://en.wiktionary.org/wiki/Wiktionary:Votes/pl-2010-07/Alternative_forms_header

Reported by jan.berkel on 2015-03-21 19:14:41