Code Monkey home page Code Monkey logo

lupyne's Issues

Why Lupyne?

Hello,

While looking around for how to run PyLucene, I stumbled around your docker image for PyLucene and eventually here. I'm curious why you have written Lupyne? Is it to provide a more Pythonic interface to Lucene? Why should one use Lupyne over PyLucene? The lack of documentation on PyLucene makes me feel like only a handful of people are actually using PyLucene...

Thanks,
Sep

Could anyone help with a simple example?

Hello,
I am new to search engine and lupyne. And I want to use search engine to help me achieve a simple target which is given a query, I want to search through all documents and return relevant ones in terms of BM25 score? How can I do it?
I tried examples in the doc:
image
How can I assign BM25 as scoring function? and should I give different setting like tokenization when searching different languages ?
Sorry for taking your time! Thanks !

ImportError: cannot import name 'spans' from 'org.apache.lucene.search' (unknown location)

I'm using coady/pylucene:latest docker file and trying to import from lupyne import engine. But it gives me this error.

Traceback (most recent call last): File "/opt/project/main.py", line 1, in <module> from lupyne import engine File "/usr/local/lib/python3.11/site-packages/lupyne/engine/__init__.py", line 10, in <module> from .queries import Query # noqa ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/lupyne/engine/queries.py", line 6, in <module> from org.apache.lucene.search import spans ImportError: cannot import name 'spans' from 'org.apache.lucene.search' (unknown location)

Combining Querys with BooleanQuerys

Hi @coady, thanks for all your hard work on lupyne, its been super helpful for me! I used your Dockerfile as a basis for compiling JCC & PyLucene to wheel files in my own non-Docker environment and now I've been able to successfully run some of the examples and setup my own 14 GB corpus, index it to a directory, and do some basic searches based on the examples you provided in the docs.

Right now I'm trying to write a slightly more complex query, but was having some trouble and hoping you might be able to point me in the right direction.

I have a fairly simple index that has 4 stored fields. A text field containing the article text, a text field containing the name of the company (the list of company names is finite and each document is associated with exactly one company), a datetime field that contains the date the article was published, and an article id.

I'm trying to write a query that does the following: find all documents that contain the phrase "lupyne is great" and occur between some arbitrary date range and that have a company_name field value of 'company a', 'company_b', or 'company_c'.

I've tried the following:

import lucene
from lupyne import engine
from datetime import date

assert lucene.getVMEnv() or lucene.initVM()

index_path: str = r'myindexdir'

query_str: str = 'lupyne is great'
start_date: date = date(year=2020, month=2, day=14)
companies: [str] = ['company a', 'company b', 'company c']

indexer = engine.Indexer(index_path, mode='r', nrt=True)

indexer.set('article_id', stored=True)
indexer.set('company_name', stored=True)
indexer.set('date', engine.DateTimeField, stored=True)
indexer.set('text', engine.Field.Text, stored=True)

query_engine = engine.Query

# The following works with the query string 'lupyne'
query_str: str = 'lupyne'
query = indexer.fields['date'].range(start_date, None) & query_engine.term('text', query_str)

# This does not with the query_string 'lupyne is great',
query_str: str = 'lupyne is great'
query = indexer.fields['date'].range(start_date, None) & query_engine.phrase('text', query_str)
# TypeError: unsupported operand type(s) for &: 'Query' and 'MultiPhraseQuery'

# This also does not work
range_query = query_engine.range('date', date_field.timestamp(start_date), None)
# java.lang.IncompatibleClassChangeError
#        at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84)

# This will also break
range_query = query_engine.range('date', start_date, None)
# lucene.InvalidArgsError: (<class 'org.apache.lucene.util.BytesRef'>, '__init__', (datetime.date(2021, 2, 2),))

Any suggestions on how I might go about this? Thanks again for all the hard work!

EDIT: So, it looks like this might be because Query.ranges() doesn't return a lupyne Query object as seen here, but instead directly returns a pylucene query object. Any good way to get around this?

Example Failed

It seems that example failed:

lucene.initVM()
indexer = engine.Indexer()
indexer.set('name', stored=True)
indexer.set('text')
indexer.add(name='sample', text='hello world')
indexer.commit()

Will raise:

Traceback (most recent call last):
  File "index.py", line 78, in <module>
    indexer.set('text')
  File "/Users/Nasy/.pyenv/versions/3.8.0/lib/python3.8/site-packages/lupyne/engine/indexers.py", line 546, in set
    field = self.fields[name] = cls(name, **settings)
  File "/Users/Nasy/.pyenv/versions/3.8.0/lib/python3.8/site-packages/lupyne/engine/documents.py", line 61, in __init__
    assert self.stored or self.indexed or self.docvalues or self.dimensions
AssertionError

Overriding dict.keys() with Hit.keys breaks Hit object displaying in IPython

First of all, thanks for your efforts in providing a high-level Lucene Python library! I really appreciate that I can almost completely omit Java-related code in my library.

I'm experimenting with the library in IPython and have problems with displaying Hit object:

In [101]: print(type(h))                                                                                                                                                         [0/11160]
<class 'lupyne.engine.documents.Hit'>

In [102]: print(repr(h))                      
{'LEMMA': ['кошка'], 'LEMMA_LANGUAGE': ['RU'], 'POS': ['n']}

In [103]: h                                   
Out[103]: ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)                                                                                                                                                   
    703             printer.flush()                                                                                                                                                       
    704             return stream.getvalue()                                                                                                                                              
                                                                                                                                                                                          
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)                                                                   
    383                 if cls in self.type_pprinters:                                                                                                                                    
    384                     # printer registered in self.type_pprinters                                                                                                                   
--> 385                     return self.type_pprinters[cls](obj, self, cycle)                                                                                                             
    386                 else:                                                                                                                                                             
    387                     # deferred printer                                                                                                                                            
                                                                                                                                                                                          
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)                                                                
    606         step = len(start)                                                                                                                                                         
    607         p.begin_group(step, start)                                                                                                                                                
--> 608         keys = obj.keys()                                                                                                                                                         
    609         # if dict isn't large enough to be truncated, sort keys before displaying                                                                                                 
    610         # From Python 3.7, dicts preserve order by definition, so we don't sort.                                                                                                  
                                                                                                                                                                                          
TypeError: 'tuple' object is not callable

I think that the problem is that the Hit object is the instance of dict and IPython tries to pretty print it as a dict, but when it calls Hit.keys() TypeError occurs because you've overridden it with a tuple. I suggest you rename Hit.keys to Hit.keys_ to fix that and to follow the principle of least astonishment.

indexer.commit get struck when using multiprocess

when indexer.commit() is run using a process (multiprocess), commit tends to get struck.
I've tried attachCurrentThread() as well, but it doesnt seem to work.

Is there any way where i ll be able to use multiprocess along with lypyne

Following is the code:

import lucene
from lupyne import engine
lucene.initVM()
#assert lucene.getVMEnv() or lucene.initVM()
from multiprocessing import Process

#vm_env = lucene.initVM(vmargs=['-Djava.awt.headless=true'])
#from org.apache.lucene import analysis, document, index, queryparser, search, store, util
class testd:
def idx(self):
#lucene.getVMEnv().attachCurrentThread()
print("init")
indexer = engine.Indexer()
indexer.set('fieldname', stored=True) # settings for all documents of indexer; indexed and tokenized is the default
indexer.add(fieldname="sample_test")
print("Trying to commit")
indexer.commit()
print("done")

if __name__ == '__main__':
#testd().idx()
p = Process(target=testd().idx)
p.start()
p.join()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.