Code Monkey home page Code Monkey logo

languagetool's Introduction

LanguageTool

LanguageTool is an Open Source proofreading software for English, Spanish, French, German, Portuguese, Polish, Dutch, and more than 20 other languages. It finds many errors that a simple spell checker cannot detect.

For more information, please see our homepage at https://languagetool.org, this README, and CHANGES.

The LanguageTool core (this repo) is freely available under the LGPL 2.1 or later.

Docker

Try one of the following projects for a community-contributed Docker file:

Contributions

The development overview describes how you can contribute error detection rules.

For more technical details, see our dev pages.

Scripted installation and building

To install or build using a script, simply type:

curl -L https://raw.githubusercontent.com/languagetool-org/languagetool/master/install.sh | sudo bash <options>

If you wish to have more options, download the install.sh script. Usage options follow:

sudo bash install.sh <options>

Usage: install.sh <option> <package>
Options:
   -h --help                   Show help
   -b --build                  Builds packages from the bleeding edge development copy of LanguageTool
   -c --command <command>      Specifies post-installation command to run (default gui when screen is detected)
   -q --quiet                  Shut up LanguageTool installer! Only tell me important stuff!
   -t --text <file>            Specifies what text to be spellchecked by LanguageTool command line (default spellcheck.txt)
   -d --depth <value>          Specifies the depth to clone when building LanguageTool yourself (default 1).
   -p --package <package>      Specifies package to install when building (default all)
   -o --override <OS>          Override automatic OS detection with <OS>
   -a --accept                 Accept the oracle license at http://java.com/license. Only run this if you have seen the license and agree to its terms!
   -r --remove <all/partial>   Removes LanguageTool install. <all> uninstalls the dependencies that were auto-installed. (default partial)

Packages(only if -b is specified):
   standalone                  Installs standalone package
   wikipedia                   Installs Wikipedia package
   office-extension            Installs the LibreOffice/OpenOffice extension package

Commands:
   GUI                         Runs GUI version of LanguageTool
   commandline                 Runs command line version of LanguageTool
   server                      Runs server version of LanguageTool

Alternate way to build from source

Before start: you will need to clone from GitHub and install Java 8 and Apache Maven.

Warning: a complete clone requires downloading more than 500 MB and needs more than 1500 MB on disk. This can be reduced if you only need the last few revisions of the master branch by creating a shallow clone:

git clone --depth 5 https://github.com/languagetool-org/languagetool.git

A shallow clone downloads less than 60 MB and needs less than 200 MB on disk.

In the root project folder, run:

mvn clean test

(sometimes you can skip Maven step for repeated builds)

./build.sh languagetool-standalone package -DskipTests

Test the result in languagetool-standalone/target/.

./build.sh languagetool-wikipedia package -DskipTests

Test the result in languagetool-wikipedia/target.

./build.sh languagetool-office-extension package -DskipTests

Test the result in languagetool-office-extension/target, rename the *.zip to *.oxt to install it in LibreOffice/OpenOffice.

Now you can use the bleeding edge development copy of LanguageTool *.jar files, be aware that it might contain regressions.

How to run under Mac M1 or M2

  1. Install Brew for Rosetta: arch -x86_64 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
  2. Install openjdk for Rosetta: arch -x86_64 brew install openjdk
  3. Install Maven for Rosetta: arch -x86_64 brew install maven
  4. Now run build scripts

License

Unless otherwise noted, this software - the LanguageTool core - is distributed under the LGPL, see file COPYING.txt.

languagetool's People

Contributors

affemitkaraffe avatar agneskleinhans avatar arysin avatar azadehsafakish avatar danielnaber avatar dpelle avatar evan-defran-lt avatar f-knorr avatar fabrichter avatar fredkruse avatar gilloult avatar gulp21 avatar janschreiber avatar jaumeortola avatar luciesteib avatar luisa-lt avatar marcoagpinto avatar mark-baas avatar mikeunwalla avatar milekpl avatar rebecca-auque avatar st-ac-y avatar stevio89 avatar susanaboatto avatar taaltik avatar tatigf20 avatar tiagosantos81 avatar tiff avatar udomai avatar yakovru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

languagetool's Issues

clickable link in About dialog

Add a real clickable link to LanguageTool's About dialog. Probably easy if you know Swing programming.

Requires: knowledge of Java/Swing

Bug report: LT for Slovenian

Hello!

I dealt with the LT for Slovenian in my master's thesis and was advised to make a bug report regarding some of its issues so here it is:

  • the rules relating to the missing comma in Slovenian should be improved by counting the finite verb forms (too many false alarms were reported)
  • the spell checker needs to be improved, because it does not detect falsely written words due to word division rules (e.g. "post-komunističen" should be written as "postkomunističen" in Slovenian, but the checker does not recognize this as an error - although this word was not written at the end of the line)
  • word repetition is not detected if there are two or more words between the given word and its repetition (e.g. "pričakoval od bivše komunistične države pričakoval")

Regards

mario

Message area shows message for previous text

text-area-and-message-area-conflict

Stand-alone LT, version 2.3. AutoCheck not selected.

The text area has 1 line of text: Test the data.

Select 'CheckText' (either the icon, or from Text Checking).

The message area shows 3 messages, which are for text that was previously in the text area.

Select 'CheckText' again, and the correct messages appears in the message area.

Ability to navigate to issues in standalone GUI

It would be nice if the standalone GUI had options to navigate from an issue (in the lower part of the window) to the issue in the text. Also, the ability to go to the next/previous issues in the text would be very nice to have.

'Ignore' deactivates the 'Possible spelling mistake' rule

LT 2.4 and 2.4.1

In the GUI, when I click 'Ignore' on a misspelled word, the 'Possible spelling mistake' rule is deactivated. I expected 'Ignore' to apply only to the misspelled word on which I right-click. I suggest that if possible, you change 'Ignore' to 'Deactivate spelling check' or something similar.

Improve TO_NON_BASE rule

Include sentences like:

She wants you to goes there.
I've been trying to called you.
She is manipulating her father to gets her way.
He tried very hard to lifted the rock.

But there are similar correct sentences:

The calendar her eyes kept straying to said it was December sixth.
But everyone I talked to said it was too risky.
Whatever college he wants to go to is fine with me.

Custom resource loaders

All code that loads resources currently uses the classloader as resource loader. This is allows to package all resources within the JARs or somewhere else in the classpath. It's good for standalone applications or plugins, because they can be delivered as packages without any external references. It would be good to abstract the resource loading from the classloader to an own small API/interface to allow API-users custom loading mechanisms, e.g. from a database or a network filesystem instead of the classpath. Additionally the API-user may cache the content of resources, e.g. dictionaries or other huge files, and reduce disk I/O.

The benifit is that in such cases, most likely in server-environments, it's possible to change resources, like the grammar.xml or dictionaries, at runtime without the need of a server-restart to update the languagetool JARs.

Idea
A new interface ResourceLoader will be introduced and provided to both the Language and the JLanguageTool in a separate constructor. All default constructors will create a default implementation ClasspathResourceLoader that implements loading over the classloader. All code that requires to load resources must have access to the resource loader, most likely by requiring it in a constructor. There should be no static references to a default constructor or so, to keep it as flexible as possible.

Constraints

  • the default should still be the classpath resource loader
  • no additional configurations must be required when the default mechanism is used (backward compatibility)

Visualisation of false friends per language

I need a visualisation of false friends in the same way of
https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/rules/false-friends.css
but show only the rulegroups in which a certain language is used.

The list of false friends is getting longer and longer so being able to filter on rulegroups only related to the language you are working on is practical.

Ideas for implementation are (one or more):

  • parameter attached to the URL when viewing false-friends.xml in a web browser such as ?lang=all (default), ?lang=en, ?lang=de, etc.
  • adding a HTML 1-of-n select that has the values all (default), en, de, nl, etc. on top of the view

add phrase support for disambiguation rules

The disambiguation rules do not support phrases but phrases are simply discarded silently so the user may be quite surprised to see that they don't work. What is required is to bring PatternRule and DisambiguationPatternRule in sync (and PatternRuleMatcher and DisambiguationPatternReplacer as well). One needs to use the element list to ensure that markers are supported correctly in disambiguation rules.

Question on Language Tool Rules

Using the rule system, I'm under the impression that LanguageTool could be used for other purposes... Scaffolding XML rules in such a way could language tool parse any sort of string and apply all sorts of other rules to it, Is this correct?

For example could LanguageTool use XML rules to parse code style, sentence structure, basic mathematics etc?

Make sure JUnit tests for disambiguation rules support rulegroups

Right now the tests for rules in the disambiguation rules run all the rules on the sentence that precede the tested rule. But there is a bug: if the rule is a part of a rule group, then previous members of the rule group are not run, so any changes to tokens are not visible in the test (though they would be visible in the LanguageTool output). Make a correction in DisambiguationRuleTest to run the preceding rules from the same rule group as well.

[de] Missing rule: missing verb

Ich ein neue Fehler gefunden ==> was the text to prove. It showed no errors.
The correct sentence:
Ich (1) einen neuen Fehler gefunden

(1) could be: habe or hatte
Second Fault: neue: because "der Fehler" masculin, therefore:
Ich habe einen neuen Fehler gefunden

Another example:
Ich habe einen neue Feder gefunden
Correct: Ich habe eine neue Feder gefunden.
As "Feder" is feminin.

"Unsupported major.minor version" exception when trying to install LanguageTool

Hi,

I get the following error when I try to install LanguageTool:

(com.sun.star.uno.RuntimeException) { { Message = "[jni_uno bridge error]
UNO calling Java method writeRegistryInfo: non-UNO exception occurred:
java.lang.UnsupportedClassVersionError: org/languagetool/openoffice/Main :
Unsupported major.minor version 51.0\X000ajava stack
trace:\X000ajava.lang.UnsupportedClassVersionError:
org/languagetool/openoffice/Main : Unsupported major.minor version
51.0\X000a\X0009at java.lang.ClassLoader.defineClass1(Native
Method)\X000a\X0009at
java.lang.ClassLoader.defineClass(ClassLoader.java:643)\X000a\X0009at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)\X000a\X0009at
java.net.URLClassLoader.defineClass(URLClassLoader.java:277)\X000a\X0009at
java.net.URLClassLoader.access$000(URLClassLoader.java:73)\X000a\X0009at
java.net.URLClassLoader$1.run(URLClassLoader.java:212)\X000a\X0009at
java.security.AccessController.doPrivileged(Native Method)\X000a\X0009at
java.net.URLClassLoader.findClass(URLClassLoader.java:205)\X000a\X0009at
java.lang.ClassLoader.loadClass(ClassLoader.java:323)\X000a\X0009at
java.lang.ClassLoader.loadClass(ClassLoader.java:316)\X000a\X0009at
java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:615)\X000a\X0009at
java.lang.ClassLoader.loadClass(ClassLoader.java:268)\X000a\X0009at
com.sun.star.comp.loader.RegistrationClassFinder.find(RegistrationClassFinder.java:55)\X000a\X0009at
com.sun.star.comp.loader.JavaLoader.writeRegistryInfo(JavaLoader.java:399)\X000a",
Context = (com.sun.star.uno.XInterface) @0 } }

My versions:

Ubuntu Saucy 13.10
LibreOffice 4.1.4

libreoffice-java-common is installed

XML Rules in individual files

Perhaps it would be possible to extrapolate XML rules out from a single mass into individual XML files?

We regularly create new XML rules, and it would speed up the process of submitting new rules, as it could be done through github, individually.
It also means people could host their own custom XML rule repositories, that can be bolted onto LanguageTool ad-hoc.

Does not work: Dump + Encode with frequency + Dump again

Jaume, I dumped the Polish dictionary, used the frequency list to encode it. But then I cannot dump the dictionary again as there is an error:

d:\download\LanguageTool-2.4-SNAPSHOT>java -cp languagetool.jar org.languagetool
.dev.DictionaryExporter pl_PL.dict >pl_PL.src

Unhandled program error occurred.
Invoke with '--help' for help.
java.lang.RuntimeException: Invalid dictionary entry format (missing separator).

```
    at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:5
```

9)
        at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:1
5)
        at morfologik.tools.FSADumpTool.dump(FSADumpTool.java:171)
        at morfologik.tools.FSADumpTool.go(FSADumpTool.java:75)
        at morfologik.tools.Tool.go(Tool.java:45)
        at morfologik.tools.FSADumpTool.main(FSADumpTool.java:285)
        at org.languagetool.dev.DictionaryExporter.main(DictionaryExporter.java:
40)
I think this is an omission on our part in morfologik-speller but it shows also in LT code.

allow to use tokens with max > 1 in suggestions

Right now only the first token is matched in suggestions. for example, if tokens are:

<token max="2">abc</token>
<token>dfg</token>

Then the suggestion tag:

<suggestion>\1\2</suggestion>

will return:

abcdfg

for the input text "abc abc dfg" (note that only one abc is returned: the first occurrence).

We should have a way return both matched tokens. In the first case, we need two tokens with whitespace, as they are in the text (the same way it would work with:

<suggestion><match include_skipped="yes" no="1"/><match no="2/></suggestion>

In such a case, the rule should format the suggestion:

abc abcdfg

In the second case, we need to have a new attribute for the match element, for example:

<suggestion><match whitespace="no" include_skipped="yes" no="1"/><match no="2/></suggestion>

The text of the suggestion would be:

abcabcdfg

Requires: the knowledge of Java

check form: no popup when there's no error

If you check a text on http://languagetool.org and there's no error in it (or LT doesn't find one), you'll get a popup that you need to click. Instead, that text that no error was found should be displayed somewhere near the check form using HTML, not as a popup that needs to be clicked.

Requires knowledge of Javascript/HTML

find a way to easily disable and enable whole categories on the command line

Right now only individual rules or rule groups can be enabled and disabled on the commandline. Find a way to disable and enable whole categories: this is more difficult because categories have only names and types (which also could be used). But names may contain spaces, which means that they need to be escaped on the commandline, or we need to introduce IDs.

In verbose mode, the disabled/enabled categories and types should display also rule IDs that were enabled/disabled.

Multivalue tokens

I've noticed that quite a few tokens in the grammar.xml (at least in the german one) use regexp-tokens but just define a list of possible values, e.g. like this:

<token regexp="yes">an|auf|durch|für|gegen|über</token>

It would be nice to introduce multivalue-tokens for such cases, maybe like this:

<token multivalue="yes">an|auf|durch|für|gegen|über</token>

Either multivalue or regexp may be defined, but not both.

I think it would be the best to store the token values as Set because Element.isStringTokenMatched() must simply use Set.contains() then, which is much faster than a regular expression that just holds a list of possible values.

It's not clear if existing rules should be changed or if there's a mechanism that checks the token value and interprets it as a multivalue token instead of a regular expression.

It's already known that most of the time is spent in the method for matching rules (see New Pattern Matching @ http://wiki.languagetool.org/missing-features), so this improvement will not change the world, but will slightly improve the performance.

Constraints:

  • the old way must still work

No suggest for misspelling in French

In your demo site they are no suggestion the french misspelling :'(

ex of web :
https://languagetool.org:8081/?language=fr&text=saluut%20comment%20vass%20tu%20?

the reponse is :
...
error fromy="0" fromx="0" toy="0" tox="6" ruleId="HUNSPELL_NO_SUGGEST_RULE" msg="Faute de frappe possible trouvée" replacements="" context="saluut comment vass tu ?" contextoffset="0" offset="0" errorlength="6" category="Faute de frappe possible" locqualityissuetype="misspelling"/
...

and when we try with : https://languagetool.org:8081/?language=fr-FR&text=saluut%20comment%20vass%20tu%20?
there are a error with no language valid :s

In doc of commad line they are : -l, –-language Text language as a character code (e.g. en for English). Note that you need to specify a variant (e.g. en-US) if you want spell checking to be active (spell checking is supported since LanguageTool 1.8).
but in french is impossible to specify the variant

Sorry for my bad english !

min, max parameters not included in the disambiguator

Daniel, you forgot to add new functionality to the DisambiguatorPatternRule. I consider this a bug because the format of both files is kept in sync...

I guess skipMaxTokens should go to AbstractPatternRulePerformer?

detect UTF-8 text files on commandline and in GUI

In GUI and our commandline interface, we assume the platform's encoding. In Windows, it may not be UTF-8, but UTF-8 files are easy to detect (with BOM or without BOM). Add a simple fix to detect them properly.

Re-use loaded rules

Rules loaded by JLanguageTool.activateDefaultPatternRules() and JLanguageTool.activateDefaultFalseFriendRules() should be cached outside the JLanguageTool and be provided to it on creation.

The goal of this issue is to increase the speed and reduce the memory consumption of creating new JLanguageTools.

One choice is to move both methods to Language. The caller is able to cache the different variants of a Language then, e.g. "english with enabled pattern rules" and "english without pattern rules".

Various things must be taken care of:

  • where are user rules (JLanguageTool.addRule, JLanguageTool.disableRule, etc.) stored?
  • where are ignored words stored (JLanguageTool.addIgnoreWords adds them on the rule)?
  • any other things?

Constraints:

  • no changes on the XML structure
  • all the rule-loading and rule-management methods from JLanguageTool must still work. The new mechanism just enables more possibilities to cache for an API-user.
  • no changes on existing testing code (backward-compatibility)

Adding "-e APOS_TYP" in command line to enable a rule makes LT very slow

I just noticed that the command line is very
slow somehow when I used the -e option. I wonder
why. The following 2 commands take:

  • 4.58 sec without "-e APOS_TYP"
  • 15.69 sec with "-e APOS_TYP"

That looks like a bug to me. I can think why it
would take so long:

 $ echo "C'est-à-dire." | \
   time java -jar /home/pel/sb/languagetool/languagetool-standalone/target/LanguageTool-2.4-SNAPSHOT/LanguageTool-2.4-SNAPSHOT/languagetool-commandline.jar \
   -l fr -
 Expected text language: French
 Working on STDIN...
 Time: 524ms for 1 sentences (1.9 sentences/sec)
 4.58user 0.11system 0:01.84elapsed 254%CPU (0avgtext+0avgdata 97836maxresident)k
 0inputs+0outputs (0major+37347minor)pagefaults 0swaps
 $ echo "C'est-à-dire." | \
   time java -jar /home/pel/sb/languagetool/languagetool-standalone/target/LanguageTool-2.4-SNAPSHOT/LanguageTool-2.4-SNAPSHOT/languagetool-commandline.jar \
   -l fr -e APOS_TYP -
 Expected text language: French
 Working on STDIN...
 1.) Line 1, column 2, Rule ID: APOS_TYP[1]
 Message: Employez l’apostrophe typographique '’'.
 Suggestion: ’
 C'est-à-dire.
  ^           
 Time: 76ms for 1 sentences (13.2 sentences/sec)
 15.69user 0.26system 0:12.62elapsed 126%CPU (0avgtext+0avgdata 371400maxresident)k
 0inputs+0outputs (0major+106806minor)pagefaults 0swaps

Besides the difference with and without -e, the timing given by
LT is very different than the timing given by the "time" command
(15.69 sec vs 76ms in the last example). Obviously they measure
something different.

HTTPServerLoadTest hanging

I just had another case of HTTPServerLoadTest hanging, like #13. Unfortunately I made a mistake so I don't have a stacktrace this time. I'm now running the test again for a longer time, hopefully I can reproduce the issue.

add URLs to rules

Any error detection rule in LanguageTool (i.e. the rules in grammar.xml) can have an URL attached which provides more information about the error. Find a rule that could benefit from such a rule and add the URL. It should point to some reputable site where we can assume the URL will still be there in a few years.

Requires: knowledge of XML

Wrong spelling suggestion for "Trash"

Trivial but irritating.
Libreoffice grammar check: 'trash' with UK English selected produces the following message
"trash is a common American expression, in British English it is more common to use: rubbis."

The final 'h' is missing.

make sure that only the longest match is returned in PatternRules with token with maxOccurrence > 1

For a rule:

<token max="2">abc</token>
<token>dfg</token>

there will be two matches in the following sequence:

abc abc dfg

The first match:
abc abc dfg

The second match:
abc dfg

The second match should not occur, and we should not allow matching the rule as soon as the longest match is already found. The file that needs to be changed is PatternRuleMatcher.java, and there are tests that should pass in languagetool-core under resources for the demo (xx) language in the grammar.xml (the test with the first FIXME).

Requires: knowledge of Java

HTTPServerLoadTest can get stuck

It happened to me once that the test got stuck with 200% CPU load. The stacktrace was this (quoting only the part that matters):

"pool-3-thread-6" prio=10 tid=0x00007f790c2d4800 nid=0x6a93 runnable [0x00007f7950c87000]
   java.lang.Thread.State: RUNNABLE
    at java.util.HashMap.getEntry(HashMap.java:446)
    at java.util.HashMap.containsKey(HashMap.java:434)
    at org.languagetool.rules.patterns.UnifierConfiguration.setEquivalence(UnifierConfiguration.java:60)
    at org.languagetool.rules.patterns.XMLRuleHandler.finalizeTokens(XMLRuleHandler.java:569)
    at org.languagetool.rules.patterns.PatternRuleHandler.endElement(PatternRuleHandler.java:275)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:606)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamespaceScope(XMLDTDValidator.java:2054)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(XMLDTDValidator.java:2005)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(XMLDTDValidator.java:879)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1742)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2900)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:71)
    at org.languagetool.JLanguageTool.loadPatternRules(JLanguageTool.java:311)
    at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageTool.java:345)
    at org.languagetool.server.LanguageToolHttpHandler.getLanguageToolInstance(LanguageToolHttpHandler.java:296)
    at org.languagetool.server.LanguageToolHttpHandler.checkText(LanguageToolHttpHandler.java:230)
    at org.languagetool.server.LanguageToolHttpHandler.handle(LanguageToolHttpHandler.java:108)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
    at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:80)
    at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:677)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
    at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:649)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
"pool-3-thread-5" prio=10 tid=0x00007f790c2d2800 nid=0x6a92 runnable [0x00007f7950d88000]
   java.lang.Thread.State: RUNNABLE
    at java.util.HashMap.getEntry(HashMap.java:446)
    at java.util.HashMap.containsKey(HashMap.java:434)
    at org.languagetool.rules.patterns.UnifierConfiguration.setEquivalence(UnifierConfiguration.java:60)
    at org.languagetool.rules.patterns.XMLRuleHandler.finalizeTokens(XMLRuleHandler.java:569)
    at org.languagetool.rules.patterns.PatternRuleHandler.endElement(PatternRuleHandler.java:275)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:606)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamespaceScope(XMLDTDValidator.java:2054)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(XMLDTDValidator.java:2005)
    at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(XMLDTDValidator.java:879)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1742)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2900)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:71)
    at org.languagetool.JLanguageTool.loadPatternRules(JLanguageTool.java:311)
    at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageTool.java:345)
    at org.languagetool.server.LanguageToolHttpHandler.getLanguageToolInstance(LanguageToolHttpHandler.java:296)
    at org.languagetool.server.LanguageToolHttpHandler.checkText(LanguageToolHttpHandler.java:230)
    at org.languagetool.server.LanguageToolHttpHandler.handle(LanguageToolHttpHandler.java:108)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
    at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:80)
    at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:677)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
    at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:649)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)

'Standalone' jar is not standalone

I interpreted this to mean I could use the standalone jar anywhere and could forget about everything else extracted from the zipfile because the jar is, well, standalone. However, when moving the jar elsewhere I get the following exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/languagetool/language/RuleFilenameException
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
    at java.lang.Class.getMethod0(Class.java:2764)
    at java.lang.Class.getMethod(Class.java:1653)
    at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
Caused by: java.lang.ClassNotFoundException: org.languagetool.language.RuleFilenameException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

Is this expected?

pattern rule test for short messages

Extend the pattern rule test to make sure that the short message (<short> in the grammar XML file), if any, is actually shorter than the other error message (<message>).

Requires: knowledge of Java

implement a user dictionary for GUI and commandline

It should be possible to add words to the dictionary stored in the same directory where the configuration resides (i.e., user home directory). The words would be add to ignored words when initalizing LT, also in the command-line mode. The file should be a simple text file, just like ignore.txt, and the same reading routine can be used.

add proper <token min="0"> support for disambiguation rules

The code for is not complete in disambiguation rules, and they support this construct only partially. Bring PatternRule and DisambiguationRule in sync (move some methods from PatternRuleMatcher to AbstractPatternRulePerformer to use them in PatternRuleReplacer).

Combining Unicode characters not correctly handled by LanguageTool

Combining Unicode characters are not correctly handled by LanguageTool, at least not by the command line version of LT.

For example, the accentuated character in the French word "Café" can be written as:

  • Unicode character U+00E9 (utf8 sequence 0xc3 0xa9) (é). That's the most usual way of writing it.
  • or with combining character U+0065 + U+0301 (ut8 sequence 0x65 + 0xcc 0x81) (é)

They should both be treated as equivalent. However, the Hunspell rule of LT only accepts the U+00E9 character and not the U+0065 + U+0301.

Example:

  $ echo "Café. Café." | 
    java -jar LanguageTool-2.4-SNAPSHOT/languagetool-commandline.jar -l fr -
  Expected text language: French
  Working on STDIN...
  1.) Line 1, column 7, Rule ID: HUNSPELL_NO_SUGGEST_RULE
  Message: Faute de frappe possible trouvée
  Café. Café. 
        ^^^^   

Notice that the first word "Café" is recognized by LT, but the second word "Café" is signaled as a typo, yet it should also be correct just like the first word.

Furthermore, words highlighted after combining characters are highlighted at the wrong location:

Example:

  $ echo "Café. Café. Foobar." | 
    java -jar ~/sb/languagetool/languagetool-standalone/target/LanguageTool-2.4-SNAPSHOT/LanguageTool-2.4-SNAPSHOT/languagetool-commandline.jar  -l fr -
  Expected text language: French
  Working on STDIN...
  1.) Line 1, column 7, Rule ID: HUNSPELL_NO_SUGGEST_RULE
  Message: Faute de frappe possible trouvée
  Café. Café. Foobar. 
        ^^^^           

  2.) Line 1, column 14, Rule ID: HUNSPELL_NO_SUGGEST_RULE
  Message: Faute de frappe possible trouvée
  Café. Café. Foobar. 
               ^^^^^^  

Notice that the underline ^^^^^ under the word word "Foobar" is at the wrong location and LT also indicates the error at line 1, column 14, but it should be a line 1 column 13.

I'm using LT-2.4-SNAPSHOT (git rev dedc30d, from Fri Oct 11, 2013)

I suppose that this could be fixed by normalizing Unicode text before
checking it.

More info on combining Unicode characters and normalization:

http://en.wikipedia.org/wiki/Combining_characters
http://www.fileformat.info/info/unicode/char/301/index.htm
http://en.wikipedia.org/wiki/Unicode_normalization

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.