Code Monkey home page Code Monkey logo

pysolr's Introduction

https://readthedocs.org/projects/django-haystack/badge/ https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336

Haystack

author:Daniel Lindsley
date:2013/07/28

Haystack provides modular search for Django. It features a unified, familiar API that allows you to plug in different search backends (such as Solr, Elasticsearch, Whoosh, Xapian, etc.) without having to modify your code.

Haystack is BSD licensed, plays nicely with third-party app without needing to modify the source and supports advanced features like faceting, More Like This, highlighting, spatial search and spelling suggestions.

You can find more information at http://haystacksearch.org/.

Getting Help

There is a mailing list (http://groups.google.com/group/django-haystack/) available for general discussion and an IRC channel (#haystack on irc.freenode.net).

Documentation

See the changelog

Requirements

Haystack has a relatively easily-met set of requirements.

  • Python 3.8+
  • Django 3-5

Additionally, each backend has its own requirements. You should refer to https://django-haystack.readthedocs.io/en/latest/installing_search_engines.html for more details.

pysolr's People

Contributors

acdha avatar ahankinson avatar alexwlchan avatar anti-social avatar atuljangra avatar cclauss avatar ch2ohch2oh avatar dcramer avatar dependabot-preview[bot] avatar dependabot[bot] avatar dwvisser avatar efagerberg avatar gryphius avatar gthb avatar hornn avatar ke4roh avatar mitchelljkotler avatar mitchellrj avatar mjumbewu avatar pabluk avatar phill-tornroth avatar pre-commit-ci[bot] avatar ramayer avatar samj1912 avatar skirsdeda avatar soypunk avatar timgates42 avatar toastdriven avatar tongwang avatar tuky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pysolr's Issues

Getting an error with pysolr - with python 3.3

Error is as Below... :(

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import pysolr
from pysolr import Solr
conn = Solr('http://127.0.0.1:8983/solr/')
conn.delete(q=':')
Traceback (most recent call last):
File "", line 1, in
File ".\pysolr.py", line 780, in delete
return self._update(m, commit=commit, waitFlush=waitFlush, waitSearcher=waitSearcher)
File ".\pysolr.py", line 359, in _update
return self._send_request('post', path, message, {'Content-type': 'text/xml; charset=utf-8'})
File ".\pysolr.py", line 293, in _send_request
raise SolrError(error_message)
pysolr.SolrError: [Reason: Error 404 Not Found]

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

Traceback (most recent call last):
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django/core/management/base.py", line 222, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django/core/management/base.py", line 255, in execute
    output = self.handle(*args, **options)
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django_haystack-2.0.0_facets-py2.7.egg/haystack/management/commands/update_index.py", line 170, in handle
    return super(Command, self).handle(*items, **options)
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django/core/management/base.py", line 355, in handle
    label_output = self.handle_label(label, **options)
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django_haystack-2.0.0_facets-py2.7.egg/haystack/management/commands/update_index.py", line 229, in handle_label
    do_update(self.backend, index, qs, start, end, total, self.verbosity)
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django_haystack-2.0.0_facets-py2.7.egg/haystack/management/commands/update_index.py", line 68, in do_update
    backend.update(index, current_qs, commit=True)
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django_haystack-2.0.0_facets-py2.7.egg/haystack/backends/solr_backend.py", line 70, in update
    self.conn.add(docs, commit=commit, boost=index.get_field_weights())
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/pysolr.py", line 740, in add
    message.append(self._build_doc(doc, boost=boost))
  File "/Users/mike/Virtualenv/lib/python2.7/site-packages/pysolr.py", line 696, in _build_doc
    field.text = self._from_python(bit)
  File "lxml.etree.pyx", line 922, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:40737)
  File "apihelpers.pxi", line 656, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18467)
  File "apihelpers.pxi", line 1339, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24233)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

search query escape function

would it make sense to make an escape function for raw data used in searches and include it within lib itself?

something like this

def solr_escape(input):
    special_chars = ["\\", '+', '-', '&&', '||', '!', '(', ')', '{', '}', '[', ']',
                  '^', '"', '~', '*', '?', ':']
    for char in special_chars:
        if char in input:
            input = input.replace(char, '\\' + char)
    return input

Issue when converting a multi-valued Solr field to native python type

I'm on Solr 4.2 and I'm observing a strange behavior when converting a multi-valued Solr field to native python type.
The issue seems to originate in the method _to_python (of the python class Solr), specifically line number 525-526:

if isinstance(value, (list, tuple)):
value = value[0]

Not sure why we just pick & return the first value from the list. In previous pysolr versions, this was working fine when we picked and returned the entire "value".

Had this been an issue, I don't think it would have remained undetected for this long. So, this makes me question what I'm doing wrong. Filing this issue in the off-chance there are others who are also experiencing this.

TypeError: Element() keywords must be strings

Traceback (most recent call last):
  File "/storage/pydev/feelfree-v4/feelfree/../lib/haystack/management/commands/update_index.py", line 210, in handle_
label
    self.update_backend(label, using)
  File "/storage/pydev/feelfree-v4/feelfree/../lib/haystack/management/commands/update_index.py", line 256, in update_
backend
    do_update(backend, index, qs, start, end, total, self.verbosity)
  File "/storage/pydev/feelfree-v4/feelfree/../lib/haystack/management/commands/update_index.py", line 78, in do_updat
e
    backend.update(index, current_qs)
  File "/storage/pydev/feelfree-v4/feelfree/../lib/haystack/backends/solr_backend.py", line 66, in update
    self.conn.add(docs, commit=commit, boost=index.get_field_weights())
  File "/storage/pydev/feelfree-v4/feelfree/../lib/pysolr.py", line 740, in add
    message.append(self._build_doc(doc, boost=boost))
  File "/storage/pydev/feelfree-v4/feelfree/../lib/pysolr.py", line 695, in _build_doc
    field = ET.Element('field', **attrs)
  File "lxml.etree.pyx", line 2560, in lxml.etree.Element (src/lxml/lxml.etree.c:52924)
TypeError: Element() keywords must be strings

attrs looks like {u'name': 'django_ct'}

env:

  • Linux 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
  • python 2.7.3
  • pysolr 3.0.5
  • lxml 2.3.5

Possible fix (works for me)

@@ -687,10 +687,10 @@ class Solr(object):
                 if self._is_null_value(bit):
                     continue

-                attrs = {'name': key}
+                attrs = {str('name'): key}

                 if boost and key in boost:
-                    attrs['boost'] = force_unicode(boost[key])
+                    attrs[str('boost')] = force_unicode(boost[key])

                 field = ET.Element('field', **attrs)
                 field.text = self._from_python(bit)

UnicodeDecodeError in _scrape_response

There is this bit in the code:

            except SyntaxError as err:
                full_html = "%s" % response

It's trying to get the response into a str when the xml parsing fails.

But if that response contains unicode, that line will error.

The response it's trying to parse is:

'{"responseHeader":{"status":400,"QTime":281,"params":{"q":"*:*","facet.field":["facet_keywords_exact","grade","{!ex=category_exact}category_exact","hierarchy4"],"fl":"* score","start":"0","fq":["django_ct:(campsite.campsite)","{!geofilt pt=45.2065182704,0.826721191406 sfield=point d=25.749504}"],"f.facet_keywords_exact.facet.limit":"1000","rows":"1","facet":"on","wt":"json"}},"response":{"numFound":5,"start":0,"maxScore":1.0,"docs":[{"rating":"0","point":"45.4129972217,0.931974828309","grade":3,"text":"Camping le Repaire\\nCamping Le Repaire is in the beautiful region of P\xc3\xa9rigord, surrounded by trees, nature and rivers. Located only a short walk from the small village of Thiviers, this family-run campsite is ideal for a relaxing holiday with family and friends.\\r\\n\\r\\nOur facilities include:\\r\\n- two toilet/shower blocks\\r\\n- disabled access facilities\\r\\n- swimming pool\\r\\n- playground\\r\\n- bar and takeaway\\r\\n- laundry room\\r\\n\\r\\nPlay p\xc3\xa9tanque, play table football, ping pong or pool - or simply relax by the pool with an ice cream or an aperitif.\\r\\n\\r\\nOur friendly barbecues are not to be missed!\\nThiviers, Dordogne, Aquitaine, France\\nThiviers\\nDordogne\\nFrance\\nDordogne\\nfrance/aquitaine/dordogne/thiviers/camping-le-repaire\\ncamping-le-repaire\\n aa-pennant-none bar-nearby bar-or-club-house barbecues-allowed beach-excellent-water-quality beaches book-pitchup canoeingkayaking-nearby car-parking-by-pitch cycle-hire-nearby cycling-nearby disabled-facilities dogs-allowed dogs-allowed-all-year drainage-hook-up-points-for-tourers electrical-hook-up-points-for-tourers english family-friendly fishing free-wifi games-room gastronomic-delight hard-standings horse-riding-nearby is_balance_of_payment_taken_on_arrival jumbo-tent-pitches large-51-200-pitches launderette leisuretheme-park-nearby motorcycle-friendly motorhome-service-point nearby-farmers-market on-site-restaurantcafe outdoor-pool-nearby outdoor-swimming-pool peaceful play-area public-telephone rallies-welcome recycling-available restaurant-nearby shop-nearby shower-available take-away tennis-nearby toilet-block tv-room washing-up-area water-hook-up-points-for-tourers wifi \\nOur campsite is a few minutes from Thiviers, the capital of foie gras and truffles, so make sure you taste the exquisite local cuisine.\\r\\n\\r\\nA must visit is the Grottes de Villars (10km), a cave with 17,000-year-old prehistoric paintings, stalactites and other cave formations. \\r\\n\\r\\nKids will love the Chateau de Puyguilhem (11km), a sixteenth century castle listed as a historic monument by UNESCO.\\r\\n\\r\\nOr go to one of the many summer concerts in the ruins of Abbey de Boschaud, a twelfth century Cistercian monastery.\\n\\n    Barbecues allowed\\n\\n    Games room\\n\\n    Water hook-up points for tourers\\n\\n    On-site restaurant/cafe\\n\\n    Drainage hook-up points for tourers\\n\\n    Electrical hook-up points for tourers\\n\\n    Shower available\\n\\n    Cycling nearby\\n\\n    Nearby farmers' market\\n\\n    Fishing\\n\\n    Leisure/theme park nearby\\n\\n    Canoeing/kayaking nearby\\n\\n    Take away\\n\\n    Cycle hire nearby\\n\\n    Dogs allowed all year\\n\\n    Recycling available\\n\\n    Car parking by pitch\\n\\n    Launderette\\n\\n    Bar or club house\\n\\n    Public telephone\\n\\n    Horse riding nearby\\n\\n    Outdoor swimming pool\\n\\n    Play area\\n\\n    Toilet block\\n\\n    TV room\\n\\n    Washing-up area\\n\\n    Bar nearby\\n\\n    Outdoor pool nearby\\n\\n    Restaurant nearby\\n\\n    Shop nearby\\n\\n    Tennis nearby\\n\\n    Wifi\\n\\n    Hard standings\\n\\n    Motorhome service point\\n\\n    Disabled facilities\\n\\n    Dogs allowed\\n\\n    Family friendly\\n\\n    Jumbo tent pitches\\n\\n    Free wifi\\n\\n    Large (51-200 pitches)\\n\\n    Rallies welcome\\n\\n    Motorcycle friendly\\n\\n    Peaceful\\n\\n    Gastronomic delight\\n","has_primary_photo":true,"django_ct":"campsite.campsite","facilities":["barbecues-allowed","games-room","water-hook-up-points-for-tourers","on-site-restaurantcafe","drainage-hook-up-points-for-tourers","electrical-hook-up-points-for-tourers","shower-available","cycling-nearby","nearby-farmers-market","fishing","leisuretheme-park-nearby","canoeingkayaking-nearby","take-away","cycle-hire-nearby","dogs-allowed-all-year","recycling-available","car-parking-by-pitch","launderette","bar-or-club-house","public-telephone","horse-riding-nearby","outdoor-swimming-pool","play-area","toilet-block","tv-room","washing-up-area","bar-nearby","outdoor-pool-nearby","restaurant-nearby","shop-nearby","tennis-nearby","wifi","hard-standings","motorhome-service-point","disabled-facilities","dogs-allowed","family-friendly","jumbo-tent-pitches","free-wifi","large-51-200-pitches","rallies-welcome","motorcycle-friendly","peaceful","gastronomic-delight"],"lead_price_one_night":0.0,"hierarchy_tree_exact":["france/aquitaine/dordogne/thiviers/","france/aquitaine/dordogne/","france/aquitaine/","france/"],"id":"campsite.campsite.12273","bookable":true,"category":["tent-pitches","touring-pitches","motorhomes"],"django_id":"12273","content_autocomplete_text":"\\n{\\n    \\"value\\": \\"Camping le Repaire, Thiviers\\",\\n    \\"tokens\\": [\\"Thiviers\\", \\"france/aquitaine/dordogne/thiviers/camping\\\\u002Dle\\\\u002Drepaire\\"],\\n    \\"categories\\": [\\n                    {\\"sprite_class\\": \\"tents\\", \\"id\\": 4, \\"name\\": \\"Tent pitches\\"}\\n                  \\n                    ,{\\"sprite_class\\": \\"tourers\\", \\"id\\": 3, \\"name\\": \\"Touring pitches\\"}\\n                  \\n                    ,{\\"sprite_class\\": \\"motorhomes\\", \\"id\\": 10, \\"name\\": \\"Motorhomes\\"}\\n                  ],\\n    \\"name\\": \\"Camping le Repaire, Thiviers\\",\\n    \\"url\\": \\"france/aquitaine/dordogne/thiviers/camping\\\\u002Dle\\\\u002Drepaire\\",\\n    \\"bookable\\": true,\\n    \\"thumb\\": \\"https://media.pitchup.co.uk/images/2/image/upload/t_thumb_v2/v1383148631/camping\\\\u002Dle\\\\u002Drepaire/camping\\\\u002Dle\\\\u002Drepaire\\\\u002Dthe\\\\u002Dpond.jpg\\"\\n}","hierarchy0_exact":"france/","expected_value":6.488E-5,"hosted_online_booking":true,"category_ids":"4,3,10","campsite_id":12273,"parent_hierarchy_name":"Thiviers","content_autocomplete":"Camping le Repaire","rate_count":0,"hierarchy3_exact":"france/aquitaine/dordogne/thiviers/","name_sortable":"CampingleRepaire","available_pitches_tents":false,"category_exact":["tent-pitches","touring-pitches","motorhomes"],"available_pitches_lodges":false,"hierarchy1":"france/aquitaine/","enable_availability":true,"has_tagged_url":false,"available_to_search":true,"path":"france/aquitaine/dordogne/thiviers/camping-le-repaire","available_pitches_motorhomes":false,"hierarchy2_exact":"france/aquitaine/dordogne/","facet_keywords":["is_balance_of_payment_taken_on_arrival","barbecues-allowed","aa-pennant-none","nearby-farmers-market","beach-excellent-water-quality","fishing","recycling-available","outdoor-swimming-pool","tv-room","outdoor-pool-nearby","washing-up-area","dogs-allowed","car-parking-by-pitch","play-area","leisuretheme-park-nearby","public-telephone","on-site-restaurantcafe","peaceful","large-51-200-pitches","tennis-nearby","canoeingkayaking-nearby","cycle-hire-nearby","jumbo-tent-pitches","horse-riding-nearby","cycling-nearby","hard-standings","drainage-hook-up-points-for-tourers","take-away","free-wifi","shower-available","water-hook-up-points-for-tourers","dogs-allowed-all-year","gastronomic-delight","shop-nearby","bar-nearby","beaches","disabled-facilities","electrical-hook-up-points-for-tourers","motorhome-service-point","restaurant-nearby","wifi","bar-or-club-house","book-pitchup","rallies-welcome","family-friendly","launderette","motorcycle-friendly","english","games-room","toilet-block"],"facet_keywords_exact":["is_balance_of_payment_taken_on_arrival","barbecues-allowed","aa-pennant-none","nearby-farmers-market","beach-excellent-water-quality","fishing","recycling-available","outdoor-swimming-pool","tv-room","outdoor-pool-nearby","washing-up-area","dogs-allowed","car-parking-by-pitch","play-area","leisuretheme-park-nearby","public-telephone","on-site-restaurantcafe","peaceful","large-51-200-pitches","tennis-nearby","canoeingkayaking-nearby","cycle-hire-nearby","jumbo-tent-pitches","horse-riding-nearby","cycling-nearby","hard-standings","drainage-hook-up-points-for-tourers","take-away","free-wifi","shower-available","water-hook-up-points-for-tourers","dogs-allowed-all-year","gastronomic-delight","shop-nearby","bar-nearby","beaches","disabled-facilities","electrical-hook-up-points-for-tourers","motorhome-service-point","restaurant-nearby","wifi","bar-or-club-house","book-pitchup","rallies-welcome","family-friendly","launderette","motorcycle-friendly","english","games-room","toilet-block"],"name":"Camping le Repaire","hierarchy1_exact":"france/aquitaine/","available_pitches_caravans":false,"hierarchy_tree":["france/aquitaine/dordogne/thiviers/","france/aquitaine/dordogne/","france/aquitaine/","france/"],"available_pitches_rent_a_tent":false,"has_availability":true,"pitchtype_ids":[3915],"hierarchy0":"france/","hierarchy3":"france/aquitaine/dordogne/thiviers/","hierarchy2":"france/aquitaine/dordogne/","available_pitches_tourers":false,"last_booking_time":"2014-03-08T09:14:53.327Z","api_type":-1,"_version_":1489653176005033984,"score":1.0}]},"error":{"msg":"undefined field: \\"hierarchy4\\"","code":400}}\n'

I am getting a TypeError message when running "rebuild_index" with django-haystack

I am getting this error:

File "/Users/aaron/Documents/virtualenvs/three/lib/python3.4/site-packages/pysolr.py", line 467, in _scrape_response
    full_html = full_html.replace('\n', '')
TypeError: expected bytes, bytearray or buffer compatible object

I saw this same issue in issue #102. Has this been resolved? Or should I change some of my configuration?

I am running:

  • Python 3.4.0
  • Django 1.6.2
  • Solr 4.8.1
  • pysolr 3.2.0

Thank you,
Aaron

Updating an existing document

I don't have much experience with Solr so I might miss something obvious sorry in advance if that' s the case.

I'm just trying to do the following thing:

  • adding a document
  • when the document is modified, read the dirty fields (with Django) and generate another dictionary to update the existing document.

However the problem is that my document gets always overwritten completely and I lose the previous fields.
I'm basically doing something like this

    data = ProductSerializer(self).data
    if fields is not None:
        data = {key: data[key] for key in fields}
        data['id'] = self.pk

    solr_connection().add([data])

I found that "overwrite" is true by default and I think I need to set it as False, but there is no way to do that with Pysolr from what I can see..
Any ideas?
thanks

Documentation for running tests is too complex

I reduce it for myself:

  1. Download and unpack solr:
curl -O http://apache.osuosl.org/lucene/solr/4.2.0/solr-4.2.0.tgz
tar xvzf solr-4.2.0.tgz
cd solr-4.2.0/example
  1. You should change solr.xml:
nano solr/solr.xml
  <cores adminPath="/admin/cores" defaultCoreName="core0" host="${host:}" hostPort="${jetty.port:}" hostContext="${hostContext:}" zkClientTimeout="${zkClientTimeout:15000}">
    <core name="core0" instanceDir="collection1" />
  </cores>
  1. Also you should add request handler for /mlt into solrconfig.xml (this is missed in README):
nano solr/collection1/conf/solrconfig.xml
  <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
  </requestHandler>
  1. And start Solr:
java -jar start.jar

We can rid of step 2 if we change solr url in tests, but this breaks tests for solr prior to 4.0

Maybe we should include solr configs for tests into repo and create Makefile to setup solr?
I think running tests should be simple as possible.

Add logging

From the original issue on Google Code by sjaday:

PySolr needs a basic logging mechanism to support debugging and performance
monitoring.
The other option is to raise explicit Exceptions for error conditions, so that
clients can log problems as needed, then provide debug level log messages to print
things like the created URL. I'll see if I can provide a patch.

Better support for arbitrary query params

If I just don't understand existing support for this I apologize in advance.

It looks like when a select query is formed, the only way to add additional parameters to your query is by using kwargs. This doesn't work for params with periods (`.') in their names because the python interpreter believes you have created an expression for a keyword. An example would be facet.field.

It seems a simple modification to solve this problem might be to allow another argument for search() that is a dictionary of params : argument pairs. It could have a default value so that not providing it wouldn't raise an exception. This way arbitrary select params could be supplied and if you don't get rid of **kwargs, existing code should just keep working. The arbitrary param arguments might also need to be encoded as utf-8 like you're currently doing with the query.

What do you think?

Use connection pooling (Session object in requests).

HTTP connections are expensive to create. For best performance using solr, you need to use the Keep-Alive header and connection pooling to keep http connections open. pysolr uses the requests library which includes support for connection pooling through urllib3 if you use a Session. When testing locally I found that without connection pooling consecutive requests all took around 1 second. With connection pooling the response time was barely larger than what solr reported as the query time.

Extract method comes with no specified headers

I may be getting silly or maybe I just can't understand the implementation, but it seems like the extract method is not implemented in a way to give proper parameters to _send_request.

There are no given headers. With the current _send_request implementation, when no headers are provided, it uses application/xml instead :

if not 'content-type' in [key.lower() for key in headers.keys()]:
    headers['Content-type'] = 'application/xml; charset=UTF-8'

Using pysolr with django-haystack, I was unable to correctly implement the extract_contents_file method from SolrBackend due to this fact. Every submitted files were considered as XML instead of the ContentType found by Tika, resulting in a continuous ParseError...

Commentting the given part of code solves my case, but I'm sure there may be another better option.

Do you have any clue ?

Dead repo?

Pull requests stacking up a bit here. Anyone still active?

UnicodeDecodeError

On line 468 in pysolr.py, the full_html needs to be decoded with utf-8 in order to handle the Swedish language.

Fix: full_html = full_html.decode('utf-8')

See comment in this commit: 2ea84356

safe_urlencode() is not as Unicode safe as it claims to be

The safe_urlencode() is intended to provide a Unicode-friendly urlencode function on top of urllib's url_encode() function. The implementation of safe_urlencode() contains this snippet:

if isinstance(v, basestring):
    new_params.append((k, v.encode("utf-8")))
elif isinstance(v, (list, tuple)):
    new_params.append((k, [i.encode("utf-8") for i in v]))
else:
    new_params.append((k, unicode(v)))
  • In the first case, a byte string in UTF-8 encoding is appended to the list.
  • In the second case a list of byte strings in UTF-8 encoding is appended to the list.
  • In the third case (the else clause), a unicode string is appended without being encoded to bytes.

The third case is incorrect, since this is exactly what urllib's urlencode() function chokes on – and the sole reason safe_urlencode() exists in the first place:

>>> urllib.urlencode([('ellipsis', u'\u2026')])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/urllib.py", line 1269, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 0: ordinal not in range(128)

In contrast, encoding the unicode string into UTF-8 bytes this produces the correct output (0xE2 0x80 0xA6 is the UTF-8 represntation of U+2026 HORIZONTAL ELLIPSIS):

>>> urllib.urlencode([('ellipsis', u'\u2026'.encode('UTF-8'))])
'ellipsis=%E2%80%A6'

The part in the else cause should therefor read:

else:
    new_params.append((k, unicode(v).encode('UTF-8'))

Note: this bug is only triggered for non-string objects (since those are handled in the first "if") with a unicode() result yielding characters outside the ASCII range.

(Oh, and while we're at it: params.items() near the top of the function should be params.iteritems() for more efficient looping.)

extract could use ExtractingRequestHandler better

The comments for the "extract" method state that ExtractingRequestHandler "is rarely useful as it allows no way to store additional data or otherwise customize the record. Instead, by default we'll use the extract-only mode..."

This is out-of-date (if it was ever true), as the documentation http://wiki.apache.org/solr/ExtractingRequestHandler states clearly that you can submit additional fields by preceding their name with "literal.".

The current implementation leads to odd, inefficient code that first sends the document, then re-posts the result to get the extracted text into the index.

Am I missing something?

Now that extract allows for additional keyword arguments, is it just a case of updating the docstring, or could a better interface be made?

Long queries are broken.

Need to flip body = urllib.urlencode(params, False) to body = urllib.urlencode(params, True) on line 297-ish.

Add more background to run tests.

Hi,
I want to submit a patch to pysolr, but I'm experiencing some issue when running the test suite.
To run the test suite I'm using:

$ python -m unittest tests
[...]
SolrError: [Reason: ERROR:unknown field 'price']
----------------------------------------------------------------------
Ran 36 tests in 1.481s

The tests are failing because of an unknown field 'price'.
Can you tell me which schema.xml are you using for the tests as well as which solrconfig.xml and Solr version(s) ?
Regards,

ElementTree using ascii encoding

I am trying to add unicode data to Solr and having some strange errors returned by ElementTree (as utilized by pysolr). The traceback:

File "test.py", line 44, in
conn.add(docs)
File "/usr/local/lib/python2.6/dist-packages/pysolr-2.0.11-py2.6.egg/pysolr.py", line 394, in add
m = ET.tostring(message)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 1009, in tostring
ElementTree(element).write(file, encoding)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 663, in write
self._write(file, self._root, encoding, {})
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 707, in _write
self._write(file, n, encoding, namespaces)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 707, in _write
self._write(file, n, encoding, namespaces)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 705, in _write
file.write(_escape_cdata(node.text, encoding))
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 807, in _escape_cdata
return _encode_entity(text)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 792, in _encode_entity
return _encode(pattern.sub(escape_entities, text), "ascii")
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 751, in _encode
return s.encode(encoding)

UnicodeEncodeError: 'ascii' codec can't encode character u'\U000b78b2' in position 76: ordinal not in range(128)

I was surprised to see ElementTree converting unicode to ascii here (doesn't Solr accept Unicode?). The simplest solution was to modify pysolr at line 394 by adding an encoding.

m = ET.tostring(message,encoding='utf8')

Am I missing something obvious here? I can't be the first person to try to use unicode....

-b

Error when running tests with Solr 4.3.1

$ nosetests tests                                      
..........................E...........
======================================================================
ERROR: test_extract (tests.SolrTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/alexk/projects/pysolr/tests/client.py", line 425, in test_extract
    extracted = self.solr.extract(fake_f)
  File "/home/alexk/projects/pysolr/pysolr.py", line 871, in extract
    files={'file': (file_obj.name, file_obj)})
  File "/home/alexk/projects/pysolr/pysolr.py", line 296, in _send_request
    raise SolrError(error_message)
SolrError: [Reason: None]
<response><lst name="responseHeader"><int name="status">400</int><int name="QTime">10</int></lst><lst name="error"><str name="msg">Document is missing mandatory uniqueKey field: id</str><int name="code">400</int></ls$
></response>

UnicodeDecodeError for JSON in Solr._scrape_response

Hello,

I'm getting UnicodeDecodeError's originating from the _scrape_response method of the Solr class. The line which causes the error can be found here: https://github.com/toastdriven/pysolr/blob/fee48b6401664c746d24f0abd9955761f5d362dd/pysolr.py#L465

My response is in JSON, and thus it can't be decoded as XML and goes into the except. It seems like this problem only occured when Sorl is returning an error. In my case, Sorl returns a 500 error, but that fact is hidden by Pysorl this way. This is the last part of my stacktrace:

  File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 578, in search
    response = self._select(params)
  File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 315, in _select
    return self._send_request('post', path, body=params_encoded, headers=headers)
  File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 293, in _send_request
    error_message = self._extract_error(resp)
  File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 372, in _extract_error
    reason, full_html = self._scrape_response(resp.headers, resp.content)
  File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 440, in _scrape_response
    full_html = "%s" % response

Too be honest, I don't know why that line is even trying to transform it to a bytestring, because the response is already a bytestring as far as I can see. All I know is that it originated from 2ea8435.

pysolr should support custom handlers

Solr provides the creation of various handlers for different purposes in solrconfig.xml - so, one could have a "/solr/select handler for default searching and a /solr/articlesearch for searching with some different filters and different settings.

In pysolr, however, only the default handlers (/solr/select, /solr/update, etc.) can be used.

Multiple facet.query

When i want to send request about multiple facet.query like:

solr/select/?q=:&facet=true&facet.query=ONE_FACET&facet.query=TWO_FACET&facet.query=ANOTHER_FACET&facet.field=USR&rows=0&start=0&facet.mincount=1000

I cant do this with thils lib bacause i cant add to params two 'facet.query' key. And 'facet.query': 'ONE_FACET&facet.query=TWO_FACET' is not working of courese, because urllib.urlencode changes all '&'

socket timeout exception handler: TypeError: not enough arguments for format string

conn.optimize()
File "C:\tools\Python27\lib\site-packages\pysolr.py", line 825, in optimize
return self._update(msg, waitFlush=waitFlush, waitSearcher=waitSearcher)
File "C:\tools\Python27\lib\site-packages\pysolr.py", line 359, in _update
return self._send_request('post', path, message, {'Content-type': 'text/xml; charset=utf-8'})
File "C:\tools\Python27\lib\site-packages\pysolr.py", line 278, in _send_request
raise SolrError(error_message % [url, self.timeout])
TypeError: not enough arguments for format string

bug fix: pysolr.py line: 278 should change to:
patch here:
-raise SolrError(error_message % [url, self.timeout])
+raise SolrError(error_message % (url, self.timeout))

Field datatype options?

I want to tell Solr to not index a field. Are you planning to add this functionality to pysolr?

Something like be able to pass in a dict of field type options (in my case, indexed=False)? Or, add them as attributes on a field object?

Error in "extract" method, if spaces or special characters in the filename of the provided file object

Hi all,

pysolr is a good library to communicate between python and solr server! But I found a little bug in "extract" method:
If there exists spaces or special characters in the filename of the provided file object, then occurs following error:
pysolr ERROR [Reason: None]
{"responseHeader":{"status":400,"QTime":1},"error":{"msg":"missing content stream","code":400}}
pysolr ERROR Failed to extract document metadata: [Reason: None]
{"responseHeader":{"status":400,"QTime":11},"error":{"msg":"missing content stream","code":400}}
Traceback (most recent call last):
File "pysolr.py", line 905, in extract
files={'file': (file_obj.name, file_obj)})
File "/home/cw/Aptana Studio 3 Workspace/arcdoc/pysolr.py", line 321, in _send_request
raise SolrError(error_message)
SolrError: [Reason: None]
{"responseHeader":{"status":400,"QTime":11},"error":{"msg":"missing content stream","code":400}}

To fix this bug, I added following code lines:
in line 891: filename = urllib.quote(file_obj.name)

in line 897: files={'file': (filename, file_obj)})

Here an overview of the extract method with the fixed bug:
def extract(self, file_obj, extractOnly=True, **kwargs):
"""
POSTs a file to the Solr ExtractingRequestHandler so rich content can
be processed using Apache Tika. See the Solr wiki for details:

        http://wiki.apache.org/solr/ExtractingRequestHandler

    The ExtractingRequestHandler has a very simply model: it extracts
    contents and metadata from the uploaded file and inserts it directly
    into the index. This is rarely useful as it allows no way to store
    additional data or otherwise customize the record. Instead, by default
    we'll use the extract-only mode to extract the data without indexing it
    so the caller has the opportunity to process it as appropriate; call
    with ``extractOnly=False`` if you want to insert with no additional
    processing.

    Returns None if metadata cannot be extracted; otherwise returns a
    dictionary containing at least two keys:

        :contents:
                    Extracted full-text content, if applicable
        :metadata:
                    key:value pairs of text strings
    """
    if not hasattr(file_obj, "name"):
        raise ValueError("extract() requires file-like objects which have a defined name property")

    params = {
        "extractOnly": "true" if extractOnly else "false",
        "lowernames": "true",
        "wt": "json",
    }
    params.update(kwargs)
    filename = urllib.quote(file_obj.name)
    try:
        # We'll provide the file using its true name as Tika may use that
        # as a file type hint:
        resp = self._send_request('post', 'update/extract',
                                  body=params,
                                  files={'file': (filename, file_obj)})
    except (IOError, SolrError) as err:
        self.log.error("Failed to extract document metadata: %s", err,
                       exc_info=True)
        raise

    try:
        data = json.loads(resp)
    except ValueError as err:
        self.log.error("Failed to load JSON response: %s", err,
                       exc_info=True)
        raise

    data['contents'] = data.pop(filename, None)
    data['metadata'] = metadata = {}

    raw_metadata = data.pop("%s_metadata" % filename, None)

    if raw_metadata:
        # The raw format is somewhat annoying: it's a flat list of
        # alternating keys and value lists
        while raw_metadata:
            metadata[raw_metadata.pop()] = raw_metadata.pop()

    return data

Please inform me, if this bug will be fixed in the next releases!
Thanks!

Long queries crash

For very long queries I get this:

pysolr.pyc in _select(self, params)
    240         else:
    241             # Handles very long queries by submitting as a POST.

--> 242             path = '%s/select/?%s' % (self.path,)
    243             headers = {
    244                 'Content-type': 'application/x-www-form-urlencoded',

TypeError: not enough arguments for format string

Unexpected behavior of boosting with multivalued fields

The boost keyword argument behaves in a strange way when used in combination with multivalued fields.

solr = Solr(solr_url)
solr.add([{'foo':'bar', 'bar':['foo', 'baz']}], boost={'foo':5})

Resulting message:

<add>
  <doc>
    <field boost="5" name="foo">bar</field>
    <field boost="5" name="bar">foo</field>
    <field name="bar">baz</field>
  </doc>
</add>

Is this the expected behavior?

Latest lxml.etree.tostring throws exception when using 'encoding' keyword arg positionally

lxml version 2.0 and greater require keyword arguments to be used explicitly in some methods. One of these is lxml.etree.tostring. Currently pysolr.py (2.0.13) is using this function with the encoding parameter as a positional argument which works fine with older versions of lxml. By explicitly using the keyword 'encoding' with this argument, pysolr will work with later versions of lxml as well.

unicode_literals does not work with lxml

When I want to add documents through pysolr with lxml installed I get this error:

TypeError: Element() keywords must be strings

This is due to the unicode_literals import that convert the keys of the attrs dictionary to unicode and lxml (2.3.2) does not like unicode keys. The xml.etree version seems to work.

pysolr doesn't support blocking until index changes are flushed on add/delete call

As indicated by http://wiki.apache.org/solr/UpdateXmlMessages , commit takes an optional waitFlush attribute which will block until index changes have been flushed to disk. This attribute can be passed in the querystring just like commit can. Pysolr should support this attribute as a kwarg to its add and delete functions.

This feature is particularly useful for unit tests which include large updates and then perform searches on based on those updates.

in case of solr error in python 3.4, pysolr tries to replace strings in bytes type

I've tried to index list of values in a field that wasn't set as multivalued.
On parsing solr error, pysolr does this:

full_html = full_html.replace('\n', '')

but full_html is bytes, so it fails.

File "../venv/lib/python3.4/site-packages/pysolr.py", line 467, in _scrape_response
full_html = full_html.replace('\n', '')
TypeError: expected bytes, bytearray or buffer compatible object

add() should not commit by default

In Solr, a "commit" is a heavy duty operation and shouldn't be taken lightly. A Solr client API like PySolr should NOT tell Solr to commit without the user code's deliberate request for it.

Of course, PySolr has been around for a while now and you can't simply change it without breaking user expectations; I'm not sure what the solution is.

Content-Type header is bytes - requests wants str

The content type provided to the requests library is b'Content-type' -- requests wants a str. The POST goes out with with multiple content type headers as a result:

POST /solr/update/?commit=true HTTP/1.1
Host: localhost:8983
Accept-Encoding: gzip, deflate, compress
Content-Type: application/x-www-form-urlencoded
Content-type: text/xml; charset=utf-8
Accept: */*
Content-Length: 89
User-Agent: python-requests/1.1.0 CPython/3.3.0 Linux/3.4.9

If Solr sees the text/xml header first, the document gets indexed. If x-www-form-urlencoded wins, nothing is indexed. Solr reports {"responseHeader":{"status":0,"QTime":67}} in both cases.

Applying this patch provides str headers instead, correcting the issue for me: https://gist.github.com/4683244

charset=utf-8

When I add documents in russian I get strings like "С�С, лв�". To make pysolr2 work correctly I changed line 272:

return self._send_request('POST', path, message, {'Content-type': 'text/xml; charset=utf-8'})

Solr Authentication

First time submitting a feature request on github! Hopefully this is the right place...

My solr service comes with a username and password. It would be cool to be able to take advantage of it. I'd try and modify the source myself but I've only been programming for 6 months and haven't the slightest clue where I'd begin!

Thanks!

commitWithin gives serializing error

An extract from our errors logs (running under mod_wsgi):

[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]     solr.add([data], commit=None, commitWithin=60 * 1000)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]   File "/opt/s4uadmin/eggs/pysolr-2.0.15-py2.7.egg/pysolr.py", line 686, in add
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]     m = ET.tostring(message, encoding='utf-8')
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1121, in tostring
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]     ElementTree(element).write(file, encoding, method=method)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]     serialize(write, self._root, encoding, qnames, namespaces)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 927, in _serialize_xml
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]     v = _escape_attrib(v, encoding)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1087, in _escape_attrib
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]     _raise_serialization_error(text)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1047, in _raise_serialization_error
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1]     "cannot serialize %r (type %s)" % (text, type(text).__name__)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] TypeError: cannot serialize 60000 (type int)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.