synhershko / hebmorph Goto Github PK

This is an open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. Includes Hebrew Analyzer for Lucene, and already produces results for Hebrew texts which are much better than the default Lucene implementation. Available for Java and .NET / Mono.

Home Page: http://code972.com/hebmorph

License: Other

C# 38.17% C 1.64% Java 60.19%

hebmorph's Introduction

HebMorph is an open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. All code and files are released under the GNU Affero General Public License version 3.

More details at http://code972.com/HebMorph

Lucene / Elasticsearch compatibility

Since March 2017 hebmorph-lucene is being released for every Lucene/Solr version, with a matching major version number (most of the time minor as well). Matching Elasticsearch plugin versions are also available, see https://github.com/synhershko/elasticsearch-analysis-hebrew.

hebmorph-lucene version	Lucene version	Elasticsearch version	Release date
6.2.x	6.1.x -> 6.2.x	5.1.x -> 5.2.x	3/2017
6.0.0	6.0.x	5.0.x	1/2017
2.4.0	5.5.x	2.4.x
2.3.x	5.4.x	2.2.x -> 2.3.x	4/2/2016
2.2.x	5.3.x	2.0.x -> 2.1.x	4/2/2016
2.1.x	4.10.4	1.6 -> 1.7.x	4/2/2016
2.0.x	4.10.x	1.4.x, 1.5.x	24/3/2015
1.5.0	4.9.0	1.3.x	9/9/2014
1.4.x	4.8.x	1.x -> 1.2.x	August 2014
1.3.x	4.6.x	0.90.8 -> 0.90.13	June 2014
1.2.0	4.5.x	0.90.6, 0.90.7	10/11/2013
1.1.0	4.4.0	0.90.3 -> 0.90.5
1.0.0	<= 4.3.0	<= 0.90.2

Tutorial for integrating HebMorph with Elasticsearch can be found here http://code972.com/blog/2013/08/129-hebrew-search-with-elasticsearch-and-hebmorph

Get it from Maven Central

For the analyzer support, get hebmorph-lucene:

        <dependency>
            <groupId>com.code972.hebmorph</groupId>
            <artifactId>hebmorph-lucene</artifactId>
            <version>6.6.0</version>
            <scope>compile</scope>
        </dependency>

Lucene.NET compatibility

The .NET version of the library is compatible with Lucene.NET version 3.0.3, but has some known bugs that were fixed in the Java version and haven't been ported back yet.

License

It is released to the public licensed under the GNU Affero General Public License v3. See the LICENSE file included in this distribution. Note that not only the programs in the distribution, but also the dictionary files and the generated word lists, are licensed under the AGPL. There is no warranty of any kind for the contents of this distribution.

hebmorph's People

Contributors

Stargazers

Watchers

Forkers

igaler efraimfeinstein kirillkh itsadok roman4git ulise28 srdee dsukharev egozyn binyam astrazone kariminf litvakm igorsmirnovdev modulexcite orenbochman yanivyhc doronuziel71 iprovalo meoril parshim irafishbein yotam17 fcavalieri nomoa asher-ye awesomedotnetcore yaakovhatam tool-recommender-bot vijayeluri tsimonyan rlebowitz immanuelbh bbelyeu yoni123100 scarwar iemarjay ilvar sabag black-jack-j

hebmorph's Issues

Lucene 4.10 / ElasticSearch 1.4

Is there any plan to support Lucene 4.10 / ElasticSearch 1.4?

Changing readPrefixesFromFile to Support Solr Resource Loading

BTW, Nice analyzer!

Would it be possible to change this method to take an inputStream object so that it can be used by Solr in the same way as any Solr plugin - SolrHSpellLoader.java:

public static HashMap<String, Integer> readPrefixesFromFile(InputStream inputStream)

Then in Solr one could provide the hspellFolder anywhere under the core conf directory:

E.g.:

@Override
public void inform(ResourceLoader loader) throws IOException {
String hspellFolder = "lang/he/hspell-data-files/";
InputStream sizeFile = loader.openResource(hspellFolder+ SolrHSpellLoader.sizesFile);
InputStream dmaskFile = loader.openResource(hspellFolder+ SolrHSpellLoader.dmaskFile);
InputStream dictionaryFile = loader.openResource(hspellFolder+ SolrHSpellLoader.dictionaryFile);
InputStream prefixesFile = loader.openResource(hspellFolder+ SolrHSpellLoader.prefixesFile);
InputStream descFile = loader.openResource(hspellFolder+ SolrHSpellLoader.descFile);
InputStream stemsFile = loader.openResource(hspellFolder+ SolrHSpellLoader.stemsFile);
InputStream prefixHFile = loader.openResource(hspellFolder+ SolrHSpellLoader.PREFIX_H);
SolrHSpellLoader solrHSpellLoader = new SolrHSpellLoader(sizeFile, dmaskFile, dictionaryFile, prefixesFile, descFile, stemsFile, loadMorphData);
dictionary = solrHSpellLoader.loadDictionaryFromHSpellData(prefixHFile);
}

If possible, can this change be made also for hebmorph-2.1.0?

Offsets in TermPositionVector are not populate correctly

Term start and end offsets in TermPositionVector are populated with zeros,only end offsets at the end of the text are populated correctly with the position of the last character in the text.

I think the issue is in the HebMorph Tokeniser where the offset attribute is set only when the end of the text is reached. If NextToken method returns due to 'space' character or the like the offset property isn't updated.

*I assume that the start and end offsets are the positions of the first and last letter of the token

מב"ל

Should be treated as a word / acronym

Support Stopwords (Java Impl)

There is no stopword support in the Java implementation. Would be really nice to have that.

Bug when use optimized version for loading HSpell's dictionary files

When I use optimized version for loading HSpell's dictionary files (Pass false in HSpellLoader loader = new HSpellLoader(new File(HSpellLoader.getHspellPath()), false);) I get that message:
Exception in thread "main" java.lang.NullPointerException.

java.lang.NullPointerException: Attempt to invoke virtual method 'int java.util.ArrayList.size()' on a null object reference

It's happen when I call the function parser.parse.

Full Stack Trace:
Exception in thread "main" java.lang.NullPointerException at com.code972.hebmorph.MorphData.getLemmas(MorphData.java:100) at com.code972.hebmorph.Lemmatizer.lemmatizeTolerant(Lemmatizer.java:154) at org.apache.lucene.analysis.hebrew.TokenFilters.HebrewLemmatizerTokenFilter.incrementToken(HebrewLemmatizerTokenFilter.java:99) at org.apache.lucene.analysis.hebrew.TokenFilters.AddSuffixTokenFilter.incrementToken(AddSuffixTokenFilter.java:42) at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:91) at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:70) at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:223) at org.apache.lucene.queryparser.classic.QueryParserBase.newFieldQuery(QueryParserBase.java:480) at org.apache.lucene.queryparser.classic.QueryParserBase.getFieldQuery(QueryParserBase.java:472) at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:857) at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:348) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:247) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:171) at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:160) at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118) at org.apache.lucene.queryparsers.HebrewQueryParser.parse(HebrewQueryParser.java:43) at test.beur.MySearchFiles.main(MySearchFiles.java:134)

Hope the bug will fix fast!

Collapse alternate Hebrew characters

FB20..FB28

See http://www.unicode.org/reports/tr30/tr30-4.html

List of applications using HebMorph in production?

Hi (-:
We are evaluating full-text search solutions for our mobile app. The app is in Hebrew, so we ended up looking at HebMorph.
It would be great to know of a production app using HebMorph.
Also, is it actively maintained?

Term positions in the TermPositionVector are not populated properly

I've attached a code sample that reproduces the issue - http://gist.github.com/656589

TermPositionVector stores position values which indicate the place of the word in the text, starting with 0 for the first word in the text. Right now all the positions are shown to be 0 instead of the real word positions. For example in the sample code the position of the word is shown as 0, not 5 as expected.

Lucene.NET Version 2.9.2

support elasticsearch 5

please!

Hebmorph for android

I'm able to run the hebmorph on Android device.
But loading the dictionary take very long time! About 3 minutes.

It is possible to reduce the time to load the dictionary?

Loaders for custom words and special tokenization cases

Stronger MorphData structure

Assuming this new data structure:

{  
   "Word":"foo",
   "Prefixes": 111,
   "Lemmas":[  
      {  
         "Lemma":"foo",
         "DescFlags":222
      },
      {  
         "Lemma":"bar",
         "DescFlags":222
      }
   ]
}

So we can add additional data per lemma as we go ahead

Sample project?

It would be great to be able to clone a simple project and just change the backing text to search against with my own. Is this feasible? I am not sure I have the skills set it up on my own.

is project supported?

Hi. Cannot download any builds from links like
~/elasticsearch-1.5.2$ bin/plugin --install analysis-hebrew --url https://bintray.com/artifact/download/synhershko/elasticsearch-analysis-hebrew/elasticsearch-analysis-hebrew-1.7.zip

If lowercasing in StreamLemmFilter, do ASCIIFoldingFilter as well

Update to Hspell 1.4

Would you please update to Hspell 1.4 (released on June 24th, 2017)?

Is this still working with recent Solr versions?

Hi.

I see that there wasn't any commit for a long time.
Is HebMorph still working with recent Solr versions, such as 9.1?

Thanks

Support fallback analyzer for NonHebrew tokens

Auto-expand some words

מס' -> מספר

More lemma filters

to remove or rank down Pronominal verbs and infinitive nouns and similar

Hebrew analyzers to support special tokenization cases

Treat \uFF07 as '

In accordance to Lucene's standard package behavior

jar files for elastic search

Hi,
Are the jar files(hebmorph core and lucene) for elastic 5.2 already exists?

Thanks.

VisualHebMorph not working + usability of project

I couldn't get VisualHebMorph to work.

There are no real instructions on how to use this project in general. The API is not documented, so anyone trying to use it needs to do a lot of digging.

Recommendation:

Create a more focused structure so that users can implement it easily. A breakdown of the functions is required, such as:

Class for stopwords
Class for synonyms
Class for lemmatizing
Simple API that integrates them all