Code Monkey home page Code Monkey logo

synhershko / hebmorph Goto Github PK

View Code? Open in Web Editor NEW
98.0 13.0 43.0 6.08 MB

This is an open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. Includes Hebrew Analyzer for Lucene, and already produces results for Hebrew texts which are much better than the default Lucene implementation. Available for Java and .NET / Mono.

Home Page: http://code972.com/hebmorph

License: Other

C# 38.17% C 1.64% Java 60.19%

hebmorph's Introduction

HebMorph is an open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. All code and files are released under the GNU Affero General Public License version 3.

More details at http://code972.com/HebMorph

Maven Central Build Status

Lucene / Elasticsearch compatibility

Since March 2017 hebmorph-lucene is being released for every Lucene/Solr version, with a matching major version number (most of the time minor as well). Matching Elasticsearch plugin versions are also available, see https://github.com/synhershko/elasticsearch-analysis-hebrew.

hebmorph-lucene version Lucene version Elasticsearch version Release date
6.2.x 6.1.x -> 6.2.x 5.1.x -> 5.2.x 3/2017
6.0.0 6.0.x 5.0.x 1/2017
2.4.0 5.5.x 2.4.x
2.3.x 5.4.x 2.2.x -> 2.3.x 4/2/2016
2.2.x 5.3.x 2.0.x -> 2.1.x 4/2/2016
2.1.x 4.10.4 1.6 -> 1.7.x 4/2/2016
2.0.x 4.10.x 1.4.x, 1.5.x 24/3/2015
1.5.0 4.9.0 1.3.x 9/9/2014
1.4.x 4.8.x 1.x -> 1.2.x August 2014
1.3.x 4.6.x 0.90.8 -> 0.90.13 June 2014
1.2.0 4.5.x 0.90.6, 0.90.7 10/11/2013
1.1.0 4.4.0 0.90.3 -> 0.90.5
1.0.0 <= 4.3.0 <= 0.90.2

Tutorial for integrating HebMorph with Elasticsearch can be found here http://code972.com/blog/2013/08/129-hebrew-search-with-elasticsearch-and-hebmorph

Get it from Maven Central

For the analyzer support, get hebmorph-lucene:

        <dependency>
            <groupId>com.code972.hebmorph</groupId>
            <artifactId>hebmorph-lucene</artifactId>
            <version>6.6.0</version>
            <scope>compile</scope>
        </dependency>

Lucene.NET compatibility

The .NET version of the library is compatible with Lucene.NET version 3.0.3, but has some known bugs that were fixed in the Java version and haven't been ported back yet.

License

HebMorph is copyright (C) 2010-2015, Itamar Syn-Hershko. HebMorph currently relies on Hspell, copyright (C) 2000-2013, Nadav Har'El and Dan Kenigsberg (http://hspell.ivrix.org.il/).

It is released to the public licensed under the GNU Affero General Public License v3. See the LICENSE file included in this distribution. Note that not only the programs in the distribution, but also the dictionary files and the generated word lists, are licensed under the AGPL. There is no warranty of any kind for the contents of this distribution.

hebmorph's People

Contributors

egozyn avatar iprovalo avatar itaifrenkel avatar itsadok avatar kirillkh avatar oferfort avatar roman4git avatar synhershko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hebmorph's Issues

Changing readPrefixesFromFile to Support Solr Resource Loading

BTW, Nice analyzer!

Would it be possible to change this method to take an inputStream object so that it can be used by Solr in the same way as any Solr plugin - SolrHSpellLoader.java:

public static HashMap<String, Integer> readPrefixesFromFile(InputStream inputStream)

Then in Solr one could provide the hspellFolder anywhere under the core conf directory:

E.g.:

@Override
public void inform(ResourceLoader loader) throws IOException {
String hspellFolder = "lang/he/hspell-data-files/";
InputStream sizeFile = loader.openResource(hspellFolder+ SolrHSpellLoader.sizesFile);
InputStream dmaskFile = loader.openResource(hspellFolder+ SolrHSpellLoader.dmaskFile);
InputStream dictionaryFile = loader.openResource(hspellFolder+ SolrHSpellLoader.dictionaryFile);
InputStream prefixesFile = loader.openResource(hspellFolder+ SolrHSpellLoader.prefixesFile);
InputStream descFile = loader.openResource(hspellFolder+ SolrHSpellLoader.descFile);
InputStream stemsFile = loader.openResource(hspellFolder+ SolrHSpellLoader.stemsFile);
InputStream prefixHFile = loader.openResource(hspellFolder+ SolrHSpellLoader.PREFIX_H);
SolrHSpellLoader solrHSpellLoader = new SolrHSpellLoader(sizeFile, dmaskFile, dictionaryFile, prefixesFile, descFile, stemsFile, loadMorphData);
dictionary = solrHSpellLoader.loadDictionaryFromHSpellData(prefixHFile);
}

If possible, can this change be made also for hebmorph-2.1.0?

Offsets in TermPositionVector are not populate correctly

Term start and end offsets in TermPositionVector are populated with zeros,only end offsets at the end of the text are populated correctly with the position of the last character in the text.

I think the issue is in the HebMorph Tokeniser where the offset attribute is set only when the end of the text is reached. If NextToken method returns due to 'space' character or the like the offset property isn't updated.

*I assume that the start and end offsets are the positions of the first and last letter of the token

מב"ל

Should be treated as a word / acronym

Bug when use optimized version for loading HSpell's dictionary files

Hi

When I use optimized version for loading HSpell's dictionary files (Pass false in HSpellLoader loader = new HSpellLoader(new File(HSpellLoader.getHspellPath()), false);) I get that message:
Exception in thread "main" java.lang.NullPointerException.

java.lang.NullPointerException: Attempt to invoke virtual method 'int java.util.ArrayList.size()' on a null object reference

It's happen when I call the function parser.parse.

Full Stack Trace:
Exception in thread "main" java.lang.NullPointerException at com.code972.hebmorph.MorphData.getLemmas(MorphData.java:100) at com.code972.hebmorph.Lemmatizer.lemmatizeTolerant(Lemmatizer.java:154) at org.apache.lucene.analysis.hebrew.TokenFilters.HebrewLemmatizerTokenFilter.incrementToken(HebrewLemmatizerTokenFilter.java:99) at org.apache.lucene.analysis.hebrew.TokenFilters.AddSuffixTokenFilter.incrementToken(AddSuffixTokenFilter.java:42) at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:91) at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:70) at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:223) at org.apache.lucene.queryparser.classic.QueryParserBase.newFieldQuery(QueryParserBase.java:480) at org.apache.lucene.queryparser.classic.QueryParserBase.getFieldQuery(QueryParserBase.java:472) at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:857) at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:348) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:247) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:171) at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:160) at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118) at org.apache.lucene.queryparsers.HebrewQueryParser.parse(HebrewQueryParser.java:43) at test.beur.MySearchFiles.main(MySearchFiles.java:134)

Hope the bug will fix fast!

List of applications using HebMorph in production?

Hi (-:
We are evaluating full-text search solutions for our mobile app. The app is in Hebrew, so we ended up looking at HebMorph.
It would be great to know of a production app using HebMorph.
Also, is it actively maintained?

Term positions in the TermPositionVector are not populated properly

I've attached a code sample that reproduces the issue - http://gist.github.com/656589

TermPositionVector stores position values which indicate the place of the word in the text, starting with 0 for the first word in the text. Right now all the positions are shown to be 0 instead of the real word positions. For example in the sample code the position of the word is shown as 0, not 5 as expected.

Lucene.NET Version 2.9.2

Hebmorph for android

Hi

I'm able to run the hebmorph on Android device.
But loading the dictionary take very long time! About 3 minutes.

It is possible to reduce the time to load the dictionary?

Stronger MorphData structure

Assuming this new data structure:

{  
   "Word":"foo",
   "Prefixes": 111,
   "Lemmas":[  
      {  
         "Lemma":"foo",
         "DescFlags":222
      },
      {  
         "Lemma":"bar",
         "DescFlags":222
      }
   ]
}

So we can add additional data per lemma as we go ahead

Sample project?

It would be great to be able to clone a simple project and just change the backing text to search against with my own. Is this feasible? I am not sure I have the skills set it up on my own.

More lemma filters

to remove or rank down Pronominal verbs and infinitive nouns and similar

VisualHebMorph not working + usability of project

I couldn't get VisualHebMorph to work.

There are no real instructions on how to use this project in general. The API is not documented, so anyone trying to use it needs to do a lot of digging.

Recommendation:

Create a more focused structure so that users can implement it easily. A breakdown of the functions is required, such as:

Class for stopwords
Class for synonyms
Class for lemmatizing
Simple API that integrates them all

HebMorph morphological

Hi.
how can i check your morphology?
I try to clone your project and it fails.
I want to writh some word and get few sentence with my input word.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.