Code Monkey home page Code Monkey logo

lupyne's People

Contributors

coady avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

lupyne's Issues

Example Failed

It seems that example failed:

lucene.initVM()
indexer = engine.Indexer()
indexer.set('name', stored=True)
indexer.set('text')
indexer.add(name='sample', text='hello world')
indexer.commit()

Will raise:

Traceback (most recent call last):
  File "index.py", line 78, in <module>
    indexer.set('text')
  File "/Users/Nasy/.pyenv/versions/3.8.0/lib/python3.8/site-packages/lupyne/engine/indexers.py", line 546, in set
    field = self.fields[name] = cls(name, **settings)
  File "/Users/Nasy/.pyenv/versions/3.8.0/lib/python3.8/site-packages/lupyne/engine/documents.py", line 61, in __init__
    assert self.stored or self.indexed or self.docvalues or self.dimensions
AssertionError

Why Lupyne?

Hello,

While looking around for how to run PyLucene, I stumbled around your docker image for PyLucene and eventually here. I'm curious why you have written Lupyne? Is it to provide a more Pythonic interface to Lucene? Why should one use Lupyne over PyLucene? The lack of documentation on PyLucene makes me feel like only a handful of people are actually using PyLucene...

Thanks,
Sep

Could anyone help with a simple example?

Hello,
I am new to search engine and lupyne. And I want to use search engine to help me achieve a simple target which is given a query, I want to search through all documents and return relevant ones in terms of BM25 score? How can I do it?
I tried examples in the doc:
image
How can I assign BM25 as scoring function? and should I give different setting like tokenization when searching different languages ?
Sorry for taking your time! Thanks !

Overriding dict.keys() with Hit.keys breaks Hit object displaying in IPython

First of all, thanks for your efforts in providing a high-level Lucene Python library! I really appreciate that I can almost completely omit Java-related code in my library.

I'm experimenting with the library in IPython and have problems with displaying Hit object:

In [101]: print(type(h))                                                                                                                                                         [0/11160]
<class 'lupyne.engine.documents.Hit'>

In [102]: print(repr(h))                      
{'LEMMA': ['кошка'], 'LEMMA_LANGUAGE': ['RU'], 'POS': ['n']}

In [103]: h                                   
Out[103]: ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)                                                                                                                                                   
    703             printer.flush()                                                                                                                                                       
    704             return stream.getvalue()                                                                                                                                              
                                                                                                                                                                                          
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)                                                                   
    383                 if cls in self.type_pprinters:                                                                                                                                    
    384                     # printer registered in self.type_pprinters                                                                                                                   
--> 385                     return self.type_pprinters[cls](obj, self, cycle)                                                                                                             
    386                 else:                                                                                                                                                             
    387                     # deferred printer                                                                                                                                            
                                                                                                                                                                                          
~/.pyenv/versions/3.6.9/envs/babelnet-lite-3.6.9/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)                                                                
    606         step = len(start)                                                                                                                                                         
    607         p.begin_group(step, start)                                                                                                                                                
--> 608         keys = obj.keys()                                                                                                                                                         
    609         # if dict isn't large enough to be truncated, sort keys before displaying                                                                                                 
    610         # From Python 3.7, dicts preserve order by definition, so we don't sort.                                                                                                  
                                                                                                                                                                                          
TypeError: 'tuple' object is not callable

I think that the problem is that the Hit object is the instance of dict and IPython tries to pretty print it as a dict, but when it calls Hit.keys() TypeError occurs because you've overridden it with a tuple. I suggest you rename Hit.keys to Hit.keys_ to fix that and to follow the principle of least astonishment.

indexer.commit get struck when using multiprocess

when indexer.commit() is run using a process (multiprocess), commit tends to get struck.
I've tried attachCurrentThread() as well, but it doesnt seem to work.

Is there any way where i ll be able to use multiprocess along with lypyne

Following is the code:

import lucene
from lupyne import engine
lucene.initVM()
#assert lucene.getVMEnv() or lucene.initVM()
from multiprocessing import Process

#vm_env = lucene.initVM(vmargs=['-Djava.awt.headless=true'])
#from org.apache.lucene import analysis, document, index, queryparser, search, store, util
class testd:
def idx(self):
#lucene.getVMEnv().attachCurrentThread()
print("init")
indexer = engine.Indexer()
indexer.set('fieldname', stored=True) # settings for all documents of indexer; indexed and tokenized is the default
indexer.add(fieldname="sample_test")
print("Trying to commit")
indexer.commit()
print("done")

if __name__ == '__main__':
#testd().idx()
p = Process(target=testd().idx)
p.start()
p.join()

ImportError: cannot import name 'spans' from 'org.apache.lucene.search' (unknown location)

I'm using coady/pylucene:latest docker file and trying to import from lupyne import engine. But it gives me this error.

Traceback (most recent call last): File "/opt/project/main.py", line 1, in <module> from lupyne import engine File "/usr/local/lib/python3.11/site-packages/lupyne/engine/__init__.py", line 10, in <module> from .queries import Query # noqa ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/lupyne/engine/queries.py", line 6, in <module> from org.apache.lucene.search import spans ImportError: cannot import name 'spans' from 'org.apache.lucene.search' (unknown location)

Combining Querys with BooleanQuerys

Hi @coady, thanks for all your hard work on lupyne, its been super helpful for me! I used your Dockerfile as a basis for compiling JCC & PyLucene to wheel files in my own non-Docker environment and now I've been able to successfully run some of the examples and setup my own 14 GB corpus, index it to a directory, and do some basic searches based on the examples you provided in the docs.

Right now I'm trying to write a slightly more complex query, but was having some trouble and hoping you might be able to point me in the right direction.

I have a fairly simple index that has 4 stored fields. A text field containing the article text, a text field containing the name of the company (the list of company names is finite and each document is associated with exactly one company), a datetime field that contains the date the article was published, and an article id.

I'm trying to write a query that does the following: find all documents that contain the phrase "lupyne is great" and occur between some arbitrary date range and that have a company_name field value of 'company a', 'company_b', or 'company_c'.

I've tried the following:

import lucene
from lupyne import engine
from datetime import date

assert lucene.getVMEnv() or lucene.initVM()

index_path: str = r'myindexdir'

query_str: str = 'lupyne is great'
start_date: date = date(year=2020, month=2, day=14)
companies: [str] = ['company a', 'company b', 'company c']

indexer = engine.Indexer(index_path, mode='r', nrt=True)

indexer.set('article_id', stored=True)
indexer.set('company_name', stored=True)
indexer.set('date', engine.DateTimeField, stored=True)
indexer.set('text', engine.Field.Text, stored=True)

query_engine = engine.Query

# The following works with the query string 'lupyne'
query_str: str = 'lupyne'
query = indexer.fields['date'].range(start_date, None) & query_engine.term('text', query_str)

# This does not with the query_string 'lupyne is great',
query_str: str = 'lupyne is great'
query = indexer.fields['date'].range(start_date, None) & query_engine.phrase('text', query_str)
# TypeError: unsupported operand type(s) for &: 'Query' and 'MultiPhraseQuery'

# This also does not work
range_query = query_engine.range('date', date_field.timestamp(start_date), None)
# java.lang.IncompatibleClassChangeError
#        at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84)

# This will also break
range_query = query_engine.range('date', start_date, None)
# lucene.InvalidArgsError: (<class 'org.apache.lucene.util.BytesRef'>, '__init__', (datetime.date(2021, 2, 2),))

Any suggestions on how I might go about this? Thanks again for all the hard work!

EDIT: So, it looks like this might be because Query.ranges() doesn't return a lupyne Query object as seen here, but instead directly returns a pylucene query object. Any good way to get around this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.