coady / lupyne Goto Github PK

View Code? Open in Web Editor NEW

116.0 2.0 12.0 6.99 MB

Pythonic search engine based on PyLucene.

Home Page: https://coady.github.io/lupyne/

License: Other

Makefile 0.23% Python 99.77%

pylucene lucene search-engine fastapi strawberry-graphql

lupyne's People

Contributors

Stargazers

Watchers

Forkers

muyuwuxin napoler luolu-lg trendingtechnology sagnik jpribyl chaowang66 jmolpointerbp saulocatharino yuankangninggithub smuotoe sivamadhavan

lupyne's Issues

Example Failed

It seems that example failed:

lucene.initVM()
indexer = engine.Indexer()
indexer.set('name', stored=True)
indexer.set('text')
indexer.add(name='sample', text='hello world')
indexer.commit()

Will raise:

Traceback (most recent call last):
  File "index.py", line 78, in <module>
    indexer.set('text')
  File "/Users/Nasy/.pyenv/versions/3.8.0/lib/python3.8/site-packages/lupyne/engine/indexers.py", line 546, in set
    field = self.fields[name] = cls(name, **settings)
  File "/Users/Nasy/.pyenv/versions/3.8.0/lib/python3.8/site-packages/lupyne/engine/documents.py", line 61, in __init__
    assert self.stored or self.indexed or self.docvalues or self.dimensions
AssertionError

While looking around for how to run PyLucene, I stumbled around your docker image for PyLucene and eventually here. I'm curious why you have written Lupyne? Is it to provide a more Pythonic interface to Lucene? Why should one use Lupyne over PyLucene? The lack of documentation on PyLucene makes me feel like only a handful of people are actually using PyLucene...

Thanks,
Sep

Could anyone help with a simple example?

Hello,
I am new to search engine and lupyne. And I want to use search engine to help me achieve a simple target which is given a query, I want to search through all documents and return relevant ones in terms of BM25 score? How can I do it?
I tried examples in the doc:

How can I assign BM25 as scoring function? and should I give different setting like tokenization when searching different languages ?
Sorry for taking your time! Thanks !

Overriding dict.keys() with Hit.keys breaks Hit object displaying in IPython

First of all, thanks for your efforts in providing a high-level Lucene Python library! I really appreciate that I can almost completely omit Java-related code in my library.

I'm experimenting with the library in IPython and have problems with displaying Hit object:

In [101]: print(type(h))                                                                                                                                                         [0/11160]
<class 'lupyne.engine.documents.Hit'>

In [102]: print(repr(h))                      
{'LEMMA': ['кошка'], 'LEMMA_LANGUAGE': ['RU'], 'POS': ['n']}

In [103]: h                                   
Out[103]: ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)                                                                                                                                                   
    703             printer.flush()                                                                                                                                                       
    704             return stream.getvalue()                                                                                                                                              
                                                                                                                                                                                          
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)                                                                   
    383                 if cls in self.type_pprinters:                                                                                                                                    
    384                     # printer registered in self.type_pprinters                                                                                                                   
--> 385                     return self.type_pprinters[cls](obj, self, cycle)                                                                                                             
    386                 else:                                                                                                                                                             
    387                     # deferred printer                                                                                                                                            
                                                                                                                                                                                          
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)                                                                
    606         step = len(start)                                                                                                                                                         
    607         p.begin_group(step, start)                                                                                                                                                
--> 608         keys = obj.keys()                                                                                                                                                         
    609         # if dict isn't large enough to be truncated, sort keys before displaying                                                                                                 
    610         # From Python 3.7, dicts preserve order by definition, so we don't sort.                                                                                                  
                                                                                                                                                                                          
TypeError: 'tuple' object is not callable

I think that the problem is that the Hit object is the instance of dict and IPython tries to pretty print it as a dict, but when it calls Hit.keys() TypeError occurs because you've overridden it with a tuple. I suggest you rename Hit.keys to Hit.keys_ to fix that and to follow the principle of least astonishment.

Python Analyzer for lucene

I'm using classical solr/lucene. I'd like to write a python analyzer (part-of-speech NLP library). I couldn't find a good way to do this using solr/lucene. I came across this file which makes me think lupyne might be a solution:

https://github.com/coady/lupyne/blob/main/lupyne/engine/analyzers.py

Or should I instead process the text before inserting into lucene? Eg replace <word> with <pos>|<word>, then tell lucene to split at |?

indexer.commit get struck when using multiprocess

when indexer.commit() is run using a process (multiprocess), commit tends to get struck.
I've tried attachCurrentThread() as well, but it doesnt seem to work.

Is there any way where i ll be able to use multiprocess along with lypyne

Following is the code:

import lucene
from lupyne import engine
lucene.initVM()
#assert lucene.getVMEnv() or lucene.initVM()
from multiprocessing import Process

#vm_env = lucene.initVM(vmargs=['-Djava.awt.headless=true'])
#from org.apache.lucene import analysis, document, index, queryparser, search, store, util
class testd:
def idx(self):
#lucene.getVMEnv().attachCurrentThread()
print("init")
indexer = engine.Indexer()
indexer.set('fieldname', stored=True) # settings for all documents of indexer; indexed and tokenized is the default
indexer.add(fieldname="sample_test")
print("Trying to commit")
indexer.commit()
print("done")

if __name__ == '__main__':
#testd().idx()
p = Process(target=testd().idx)
p.start()
p.join()

ImportError: cannot import name 'spans' from 'org.apache.lucene.search' (unknown location)

I'm using coady/pylucene:latest docker file and trying to import from lupyne import engine. But it gives me this error.

Traceback (most recent call last): File "/opt/project/main.py", line 1, in <module> from lupyne import engine File "/usr/local/lib/python3.11/site-packages/lupyne/engine/__init__.py", line 10, in <module> from .queries import Query # noqa ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/lupyne/engine/queries.py", line 6, in <module> from org.apache.lucene.search import spans ImportError: cannot import name 'spans' from 'org.apache.lucene.search' (unknown location)

Combining Querys with BooleanQuerys

Hi @coady, thanks for all your hard work on lupyne, its been super helpful for me! I used your Dockerfile as a basis for compiling JCC & PyLucene to wheel files in my own non-Docker environment and now I've been able to successfully run some of the examples and setup my own 14 GB corpus, index it to a directory, and do some basic searches based on the examples you provided in the docs.

Right now I'm trying to write a slightly more complex query, but was having some trouble and hoping you might be able to point me in the right direction.

I have a fairly simple index that has 4 stored fields. A text field containing the article text, a text field containing the name of the company (the list of company names is finite and each document is associated with exactly one company), a datetime field that contains the date the article was published, and an article id.

I'm trying to write a query that does the following: find all documents that contain the phrase "lupyne is great" and occur between some arbitrary date range and that have a company_name field value of 'company a', 'company_b', or 'company_c'.

I've tried the following:

import lucene
from lupyne import engine
from datetime import date

assert lucene.getVMEnv() or lucene.initVM()

index_path: str = r'myindexdir'

query_str: str = 'lupyne is great'
start_date: date = date(year=2020, month=2, day=14)
companies: [str] = ['company a', 'company b', 'company c']

indexer = engine.Indexer(index_path, mode='r', nrt=True)

indexer.set('article_id', stored=True)
indexer.set('company_name', stored=True)
indexer.set('date', engine.DateTimeField, stored=True)
indexer.set('text', engine.Field.Text, stored=True)

query_engine = engine.Query

# The following works with the query string 'lupyne'
query_str: str = 'lupyne'
query = indexer.fields['date'].range(start_date, None) & query_engine.term('text', query_str)

# This does not with the query_string 'lupyne is great',
query_str: str = 'lupyne is great'
query = indexer.fields['date'].range(start_date, None) & query_engine.phrase('text', query_str)
# TypeError: unsupported operand type(s) for &: 'Query' and 'MultiPhraseQuery'

# This also does not work
range_query = query_engine.range('date', date_field.timestamp(start_date), None)
# java.lang.IncompatibleClassChangeError
#        at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84)

# This will also break
range_query = query_engine.range('date', start_date, None)
# lucene.InvalidArgsError: (<class 'org.apache.lucene.util.BytesRef'>, '__init__', (datetime.date(2021, 2, 2),))

Any suggestions on how I might go about this? Thanks again for all the hard work!

EDIT: So, it looks like this might be because Query.ranges() doesn't return a lupyne Query object as seen here, but instead directly returns a pylucene query object. Any good way to get around this?

coady / lupyne Goto Github PK

lupyne's People

Contributors

Stargazers

Watchers

Forkers

lupyne's Issues

Example Failed

Why Lupyne?

Could anyone help with a simple example?

Overriding dict.keys() with Hit.keys breaks Hit object displaying in IPython

Python Analyzer for lucene

indexer.commit get struck when using multiprocess

ImportError: cannot import name 'spans' from 'org.apache.lucene.search' (unknown location)

Combining Querys with BooleanQuerys

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent