django-haystack / pysolr Goto Github PK
View Code? Open in Web Editor NEWPysolr — Python Solr client
License: BSD 3-Clause "New" or "Revised" License
Pysolr — Python Solr client
License: BSD 3-Clause "New" or "Revised" License
I want to tell Solr to not index a field. Are you planning to add this functionality to pysolr?
Something like be able to pass in a dict of field type options (in my case, indexed=False)? Or, add them as attributes on a field object?
First time submitting a feature request on github! Hopefully this is the right place...
My solr service comes with a username and password. It would be cool to be able to take advantage of it. I'd try and modify the source myself but I've only been programming for 6 months and haven't the slightest clue where I'd begin!
Thanks!
The comments for the "extract" method state that ExtractingRequestHandler "is rarely useful as it allows no way to store additional data or otherwise customize the record. Instead, by default we'll use the extract-only mode..."
This is out-of-date (if it was ever true), as the documentation http://wiki.apache.org/solr/ExtractingRequestHandler states clearly that you can submit additional fields by preceding their name with "literal.".
The current implementation leads to odd, inefficient code that first sends the document, then re-posts the result to get the extracted text into the index.
Am I missing something?
Now that extract allows for additional keyword arguments, is it just a case of updating the docstring, or could a better interface be made?
As indicated by http://wiki.apache.org/solr/UpdateXmlMessages , commit takes an optional waitFlush attribute which will block until index changes have been flushed to disk. This attribute can be passed in the querystring just like commit can. Pysolr should support this attribute as a kwarg to its add and delete functions.
This feature is particularly useful for unit tests which include large updates and then perform searches on based on those updates.
When I want to add documents through pysolr with lxml installed I get this error:
TypeError: Element() keywords must be strings
This is due to the unicode_literals import that convert the keys of the attrs
dictionary to unicode and lxml (2.3.2) does not like unicode keys. The xml.etree
version seems to work.
I don't have much experience with Solr so I might miss something obvious sorry in advance if that' s the case.
I'm just trying to do the following thing:
However the problem is that my document gets always overwritten completely and I lose the previous fields.
I'm basically doing something like this
data = ProductSerializer(self).data
if fields is not None:
data = {key: data[key] for key in fields}
data['id'] = self.pk
solr_connection().add([data])
I found that "overwrite" is true by default and I think I need to set it as False, but there is no way to do that with Pysolr from what I can see..
Any ideas?
thanks
I may be getting silly or maybe I just can't understand the implementation, but it seems like the extract
method is not implemented in a way to give proper parameters to _send_request
.
There are no given headers. With the current _send_request
implementation, when no headers are provided, it uses application/xml
instead :
if not 'content-type' in [key.lower() for key in headers.keys()]:
headers['Content-type'] = 'application/xml; charset=UTF-8'
Using pysolr with django-haystack, I was unable to correctly implement the extract_contents_file
method from SolrBackend due to this fact. Every submitted files were considered as XML instead of the ContentType found by Tika, resulting in a continuous ParseError...
Commentting the given part of code solves my case, but I'm sure there may be another better option.
Do you have any clue ?
In Solr, a "commit" is a heavy duty operation and shouldn't be taken lightly. A Solr client API like PySolr should NOT tell Solr to commit without the user code's deliberate request for it.
Of course, PySolr has been around for a while now and you can't simply change it without breaking user expectations; I'm not sure what the solution is.
I'm on Solr 4.2 and I'm observing a strange behavior when converting a multi-valued Solr field to native python type.
The issue seems to originate in the method _to_python (of the python class Solr), specifically line number 525-526:
if isinstance(value, (list, tuple)):
value = value[0]
Not sure why we just pick & return the first value from the list. In previous pysolr versions, this was working fine when we picked and returned the entire "value".
Had this been an issue, I don't think it would have remained undetected for this long. So, this makes me question what I'm doing wrong. Filing this issue in the off-chance there are others who are also experiencing this.
Looking at v3.1.0...master it looks like we're ready for a bug-fix release as other projects (i.e. django-haystack) would benefit from a PyPI release which includes @tongwang's fix for content extraction (#104) with recent versions of Solr.
@toastdriven @anti-social Anything else we might want to wait for?
I am getting this error:
File "/Users/aaron/Documents/virtualenvs/three/lib/python3.4/site-packages/pysolr.py", line 467, in _scrape_response full_html = full_html.replace('\n', '') TypeError: expected bytes, bytearray or buffer compatible object
I saw this same issue in issue #102. Has this been resolved? Or should I change some of my configuration?
I am running:
Thank you,
Aaron
In notanumber's fork of haystack with field weight support, Django's manage.py rebuild_index
fails when using Pysolr 80ad7b7. There's an UnboundLocalError with the variable v
on line 629.
Although the setup.py
does not specify it, pysolr
doesn't seem to run without simplejson
installed.
As mentioned in the original issue on GoogleCode, the previous patch I submitted only worked for Jetty and had a fallback for other servers: it would display the full HTML message.
I'm working on properly handling Tomcat as well, and I'll submit a patch pretty soon.
Traceback (most recent call last):
File "/storage/pydev/feelfree-v4/feelfree/../lib/haystack/management/commands/update_index.py", line 210, in handle_
label
self.update_backend(label, using)
File "/storage/pydev/feelfree-v4/feelfree/../lib/haystack/management/commands/update_index.py", line 256, in update_
backend
do_update(backend, index, qs, start, end, total, self.verbosity)
File "/storage/pydev/feelfree-v4/feelfree/../lib/haystack/management/commands/update_index.py", line 78, in do_updat
e
backend.update(index, current_qs)
File "/storage/pydev/feelfree-v4/feelfree/../lib/haystack/backends/solr_backend.py", line 66, in update
self.conn.add(docs, commit=commit, boost=index.get_field_weights())
File "/storage/pydev/feelfree-v4/feelfree/../lib/pysolr.py", line 740, in add
message.append(self._build_doc(doc, boost=boost))
File "/storage/pydev/feelfree-v4/feelfree/../lib/pysolr.py", line 695, in _build_doc
field = ET.Element('field', **attrs)
File "lxml.etree.pyx", line 2560, in lxml.etree.Element (src/lxml/lxml.etree.c:52924)
TypeError: Element() keywords must be strings
attrs looks like {u'name': 'django_ct'}
env:
Possible fix (works for me)
@@ -687,10 +687,10 @@ class Solr(object):
if self._is_null_value(bit):
continue
- attrs = {'name': key}
+ attrs = {str('name'): key}
if boost and key in boost:
- attrs['boost'] = force_unicode(boost[key])
+ attrs[str('boost')] = force_unicode(boost[key])
field = ET.Element('field', **attrs)
field.text = self._from_python(bit)
When i want to send request about multiple facet.query like:
solr/select/?q=:&facet=true&facet.query=ONE_FACET&facet.query=TWO_FACET&facet.query=ANOTHER_FACET&facet.field=USR&rows=0&start=0&facet.mincount=1000
I cant do this with thils lib bacause i cant add to params two 'facet.query' key. And 'facet.query': 'ONE_FACET&facet.query=TWO_FACET' is not working of courese, because urllib.urlencode changes all '&'
Error is as Below... :(
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
import pysolr
from pysolr import Solr
conn = Solr('http://127.0.0.1:8983/solr/')
conn.delete(q=':')
Traceback (most recent call last):
File "", line 1, in
File ".\pysolr.py", line 780, in delete
return self._update(m, commit=commit, waitFlush=waitFlush, waitSearcher=waitSearcher)
File ".\pysolr.py", line 359, in _update
return self._send_request('post', path, message, {'Content-type': 'text/xml; charset=utf-8'})
File ".\pysolr.py", line 293, in _send_request
raise SolrError(error_message)
pysolr.SolrError: [Reason: Error 404 Not Found]
I've tried to index list of values in a field that wasn't set as multivalued.
On parsing solr error, pysolr does this:
full_html = full_html.replace('\n', '')
but full_html is bytes, so it fails.
File "../venv/lib/python3.4/site-packages/pysolr.py", line 467, in _scrape_response
full_html = full_html.replace('\n', '')
TypeError: expected bytes, bytearray or buffer compatible object
Traceback (most recent call last):
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django/core/management/base.py", line 222, in run_from_argv
self.execute(*args, **options.__dict__)
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django/core/management/base.py", line 255, in execute
output = self.handle(*args, **options)
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django_haystack-2.0.0_facets-py2.7.egg/haystack/management/commands/update_index.py", line 170, in handle
return super(Command, self).handle(*items, **options)
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django/core/management/base.py", line 355, in handle
label_output = self.handle_label(label, **options)
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django_haystack-2.0.0_facets-py2.7.egg/haystack/management/commands/update_index.py", line 229, in handle_label
do_update(self.backend, index, qs, start, end, total, self.verbosity)
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django_haystack-2.0.0_facets-py2.7.egg/haystack/management/commands/update_index.py", line 68, in do_update
backend.update(index, current_qs, commit=True)
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/django_haystack-2.0.0_facets-py2.7.egg/haystack/backends/solr_backend.py", line 70, in update
self.conn.add(docs, commit=commit, boost=index.get_field_weights())
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/pysolr.py", line 740, in add
message.append(self._build_doc(doc, boost=boost))
File "/Users/mike/Virtualenv/lib/python2.7/site-packages/pysolr.py", line 696, in _build_doc
field.text = self._from_python(bit)
File "lxml.etree.pyx", line 922, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:40737)
File "apihelpers.pxi", line 656, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18467)
File "apihelpers.pxi", line 1339, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24233)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
Currently it's not possible to do the following:
conn.search(q="*", fq="year:2009", fq="author:foo")
Workaround:
multidict['fq'] = " AND".join(multidict.getall('fq'))
Need to flip body = urllib.urlencode(params, False)
to body = urllib.urlencode(params, True)
on line 297-ish.
HTTP connections are expensive to create. For best performance using solr, you need to use the Keep-Alive header and connection pooling to keep http connections open. pysolr uses the requests library which includes support for connection pooling through urllib3 if you use a Session. When testing locally I found that without connection pooling consecutive requests all took around 1 second. With connection pooling the response time was barely larger than what solr reported as the query time.
$ nosetests tests
..........................E...........
======================================================================
ERROR: test_extract (tests.SolrTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/alexk/projects/pysolr/tests/client.py", line 425, in test_extract
extracted = self.solr.extract(fake_f)
File "/home/alexk/projects/pysolr/pysolr.py", line 871, in extract
files={'file': (file_obj.name, file_obj)})
File "/home/alexk/projects/pysolr/pysolr.py", line 296, in _send_request
raise SolrError(error_message)
SolrError: [Reason: None]
<response><lst name="responseHeader"><int name="status">400</int><int name="QTime">10</int></lst><lst name="error"><str name="msg">Document is missing mandatory uniqueKey field: id</str><int name="code">400</int></ls$
></response>
From the original issue on Google Code by sjaday:
PySolr needs a basic logging mechanism to support debugging and performance
monitoring.
The other option is to raise explicit Exceptions for error conditions, so that
clients can log problems as needed, then provide debug level log messages to print
things like the created URL. I'll see if I can provide a patch.
I reduce it for myself:
curl -O http://apache.osuosl.org/lucene/solr/4.2.0/solr-4.2.0.tgz
tar xvzf solr-4.2.0.tgz
cd solr-4.2.0/example
nano solr/solr.xml
<cores adminPath="/admin/cores" defaultCoreName="core0" host="${host:}" hostPort="${jetty.port:}" hostContext="${hostContext:}" zkClientTimeout="${zkClientTimeout:15000}">
<core name="core0" instanceDir="collection1" />
</cores>
nano solr/collection1/conf/solrconfig.xml
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
</requestHandler>
java -jar start.jar
We can rid of step 2 if we change solr url in tests, but this breaks tests for solr prior to 4.0
Maybe we should include solr configs for tests into repo and create Makefile to setup solr?
I think running tests should be simple as possible.
would it make sense to make an escape function for raw data used in searches and include it within lib itself?
something like this
def solr_escape(input):
special_chars = ["\\", '+', '-', '&&', '||', '!', '(', ')', '{', '}', '[', ']',
'^', '"', '~', '*', '?', ':']
for char in special_chars:
if char in input:
input = input.replace(char, '\\' + char)
return input
An extract from our errors logs (running under mod_wsgi):
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] solr.add([data], commit=None, commitWithin=60 * 1000)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] File "/opt/s4uadmin/eggs/pysolr-2.0.15-py2.7.egg/pysolr.py", line 686, in add
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] m = ET.tostring(message, encoding='utf-8')
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1121, in tostring
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] ElementTree(element).write(file, encoding, method=method)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] serialize(write, self._root, encoding, qnames, namespaces)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 927, in _serialize_xml
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] v = _escape_attrib(v, encoding)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1087, in _escape_attrib
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] _raise_serialization_error(text)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1047, in _raise_serialization_error
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] "cannot serialize %r (type %s)" % (text, type(text).__name__)
[Tue Mar 06 10:11:10 2012] [error] [client 127.0.0.1] TypeError: cannot serialize 60000 (type int)
Hi,
I want to submit a patch to pysolr, but I'm experiencing some issue when running the test suite.
To run the test suite I'm using:
$ python -m unittest tests
[...]
SolrError: [Reason: ERROR:unknown field 'price']
----------------------------------------------------------------------
Ran 36 tests in 1.481s
The tests are failing because of an unknown field 'price'.
Can you tell me which schema.xml
are you using for the tests as well as which solrconfig.xml
and Solr version(s) ?
Regards,
Causes IOError because of https://github.com/toastdriven/pysolr/blob/master/setup.py#L13
There is this bit in the code:
except SyntaxError as err:
full_html = "%s" % response
It's trying to get the response into a str when the xml parsing fails.
But if that response contains unicode, that line will error.
The response it's trying to parse is:
'{"responseHeader":{"status":400,"QTime":281,"params":{"q":"*:*","facet.field":["facet_keywords_exact","grade","{!ex=category_exact}category_exact","hierarchy4"],"fl":"* score","start":"0","fq":["django_ct:(campsite.campsite)","{!geofilt pt=45.2065182704,0.826721191406 sfield=point d=25.749504}"],"f.facet_keywords_exact.facet.limit":"1000","rows":"1","facet":"on","wt":"json"}},"response":{"numFound":5,"start":0,"maxScore":1.0,"docs":[{"rating":"0","point":"45.4129972217,0.931974828309","grade":3,"text":"Camping le Repaire\\nCamping Le Repaire is in the beautiful region of P\xc3\xa9rigord, surrounded by trees, nature and rivers. Located only a short walk from the small village of Thiviers, this family-run campsite is ideal for a relaxing holiday with family and friends.\\r\\n\\r\\nOur facilities include:\\r\\n- two toilet/shower blocks\\r\\n- disabled access facilities\\r\\n- swimming pool\\r\\n- playground\\r\\n- bar and takeaway\\r\\n- laundry room\\r\\n\\r\\nPlay p\xc3\xa9tanque, play table football, ping pong or pool - or simply relax by the pool with an ice cream or an aperitif.\\r\\n\\r\\nOur friendly barbecues are not to be missed!\\nThiviers, Dordogne, Aquitaine, France\\nThiviers\\nDordogne\\nFrance\\nDordogne\\nfrance/aquitaine/dordogne/thiviers/camping-le-repaire\\ncamping-le-repaire\\n aa-pennant-none bar-nearby bar-or-club-house barbecues-allowed beach-excellent-water-quality beaches book-pitchup canoeingkayaking-nearby car-parking-by-pitch cycle-hire-nearby cycling-nearby disabled-facilities dogs-allowed dogs-allowed-all-year drainage-hook-up-points-for-tourers electrical-hook-up-points-for-tourers english family-friendly fishing free-wifi games-room gastronomic-delight hard-standings horse-riding-nearby is_balance_of_payment_taken_on_arrival jumbo-tent-pitches large-51-200-pitches launderette leisuretheme-park-nearby motorcycle-friendly motorhome-service-point nearby-farmers-market on-site-restaurantcafe outdoor-pool-nearby outdoor-swimming-pool peaceful play-area public-telephone rallies-welcome recycling-available restaurant-nearby shop-nearby shower-available take-away tennis-nearby toilet-block tv-room washing-up-area water-hook-up-points-for-tourers wifi \\nOur campsite is a few minutes from Thiviers, the capital of foie gras and truffles, so make sure you taste the exquisite local cuisine.\\r\\n\\r\\nA must visit is the Grottes de Villars (10km), a cave with 17,000-year-old prehistoric paintings, stalactites and other cave formations. \\r\\n\\r\\nKids will love the Chateau de Puyguilhem (11km), a sixteenth century castle listed as a historic monument by UNESCO.\\r\\n\\r\\nOr go to one of the many summer concerts in the ruins of Abbey de Boschaud, a twelfth century Cistercian monastery.\\n\\n Barbecues allowed\\n\\n Games room\\n\\n Water hook-up points for tourers\\n\\n On-site restaurant/cafe\\n\\n Drainage hook-up points for tourers\\n\\n Electrical hook-up points for tourers\\n\\n Shower available\\n\\n Cycling nearby\\n\\n Nearby farmers' market\\n\\n Fishing\\n\\n Leisure/theme park nearby\\n\\n Canoeing/kayaking nearby\\n\\n Take away\\n\\n Cycle hire nearby\\n\\n Dogs allowed all year\\n\\n Recycling available\\n\\n Car parking by pitch\\n\\n Launderette\\n\\n Bar or club house\\n\\n Public telephone\\n\\n Horse riding nearby\\n\\n Outdoor swimming pool\\n\\n Play area\\n\\n Toilet block\\n\\n TV room\\n\\n Washing-up area\\n\\n Bar nearby\\n\\n Outdoor pool nearby\\n\\n Restaurant nearby\\n\\n Shop nearby\\n\\n Tennis nearby\\n\\n Wifi\\n\\n Hard standings\\n\\n Motorhome service point\\n\\n Disabled facilities\\n\\n Dogs allowed\\n\\n Family friendly\\n\\n Jumbo tent pitches\\n\\n Free wifi\\n\\n Large (51-200 pitches)\\n\\n Rallies welcome\\n\\n Motorcycle friendly\\n\\n Peaceful\\n\\n Gastronomic delight\\n","has_primary_photo":true,"django_ct":"campsite.campsite","facilities":["barbecues-allowed","games-room","water-hook-up-points-for-tourers","on-site-restaurantcafe","drainage-hook-up-points-for-tourers","electrical-hook-up-points-for-tourers","shower-available","cycling-nearby","nearby-farmers-market","fishing","leisuretheme-park-nearby","canoeingkayaking-nearby","take-away","cycle-hire-nearby","dogs-allowed-all-year","recycling-available","car-parking-by-pitch","launderette","bar-or-club-house","public-telephone","horse-riding-nearby","outdoor-swimming-pool","play-area","toilet-block","tv-room","washing-up-area","bar-nearby","outdoor-pool-nearby","restaurant-nearby","shop-nearby","tennis-nearby","wifi","hard-standings","motorhome-service-point","disabled-facilities","dogs-allowed","family-friendly","jumbo-tent-pitches","free-wifi","large-51-200-pitches","rallies-welcome","motorcycle-friendly","peaceful","gastronomic-delight"],"lead_price_one_night":0.0,"hierarchy_tree_exact":["france/aquitaine/dordogne/thiviers/","france/aquitaine/dordogne/","france/aquitaine/","france/"],"id":"campsite.campsite.12273","bookable":true,"category":["tent-pitches","touring-pitches","motorhomes"],"django_id":"12273","content_autocomplete_text":"\\n{\\n \\"value\\": \\"Camping le Repaire, Thiviers\\",\\n \\"tokens\\": [\\"Thiviers\\", \\"france/aquitaine/dordogne/thiviers/camping\\\\u002Dle\\\\u002Drepaire\\"],\\n \\"categories\\": [\\n {\\"sprite_class\\": \\"tents\\", \\"id\\": 4, \\"name\\": \\"Tent pitches\\"}\\n \\n ,{\\"sprite_class\\": \\"tourers\\", \\"id\\": 3, \\"name\\": \\"Touring pitches\\"}\\n \\n ,{\\"sprite_class\\": \\"motorhomes\\", \\"id\\": 10, \\"name\\": \\"Motorhomes\\"}\\n ],\\n \\"name\\": \\"Camping le Repaire, Thiviers\\",\\n \\"url\\": \\"france/aquitaine/dordogne/thiviers/camping\\\\u002Dle\\\\u002Drepaire\\",\\n \\"bookable\\": true,\\n \\"thumb\\": \\"https://media.pitchup.co.uk/images/2/image/upload/t_thumb_v2/v1383148631/camping\\\\u002Dle\\\\u002Drepaire/camping\\\\u002Dle\\\\u002Drepaire\\\\u002Dthe\\\\u002Dpond.jpg\\"\\n}","hierarchy0_exact":"france/","expected_value":6.488E-5,"hosted_online_booking":true,"category_ids":"4,3,10","campsite_id":12273,"parent_hierarchy_name":"Thiviers","content_autocomplete":"Camping le Repaire","rate_count":0,"hierarchy3_exact":"france/aquitaine/dordogne/thiviers/","name_sortable":"CampingleRepaire","available_pitches_tents":false,"category_exact":["tent-pitches","touring-pitches","motorhomes"],"available_pitches_lodges":false,"hierarchy1":"france/aquitaine/","enable_availability":true,"has_tagged_url":false,"available_to_search":true,"path":"france/aquitaine/dordogne/thiviers/camping-le-repaire","available_pitches_motorhomes":false,"hierarchy2_exact":"france/aquitaine/dordogne/","facet_keywords":["is_balance_of_payment_taken_on_arrival","barbecues-allowed","aa-pennant-none","nearby-farmers-market","beach-excellent-water-quality","fishing","recycling-available","outdoor-swimming-pool","tv-room","outdoor-pool-nearby","washing-up-area","dogs-allowed","car-parking-by-pitch","play-area","leisuretheme-park-nearby","public-telephone","on-site-restaurantcafe","peaceful","large-51-200-pitches","tennis-nearby","canoeingkayaking-nearby","cycle-hire-nearby","jumbo-tent-pitches","horse-riding-nearby","cycling-nearby","hard-standings","drainage-hook-up-points-for-tourers","take-away","free-wifi","shower-available","water-hook-up-points-for-tourers","dogs-allowed-all-year","gastronomic-delight","shop-nearby","bar-nearby","beaches","disabled-facilities","electrical-hook-up-points-for-tourers","motorhome-service-point","restaurant-nearby","wifi","bar-or-club-house","book-pitchup","rallies-welcome","family-friendly","launderette","motorcycle-friendly","english","games-room","toilet-block"],"facet_keywords_exact":["is_balance_of_payment_taken_on_arrival","barbecues-allowed","aa-pennant-none","nearby-farmers-market","beach-excellent-water-quality","fishing","recycling-available","outdoor-swimming-pool","tv-room","outdoor-pool-nearby","washing-up-area","dogs-allowed","car-parking-by-pitch","play-area","leisuretheme-park-nearby","public-telephone","on-site-restaurantcafe","peaceful","large-51-200-pitches","tennis-nearby","canoeingkayaking-nearby","cycle-hire-nearby","jumbo-tent-pitches","horse-riding-nearby","cycling-nearby","hard-standings","drainage-hook-up-points-for-tourers","take-away","free-wifi","shower-available","water-hook-up-points-for-tourers","dogs-allowed-all-year","gastronomic-delight","shop-nearby","bar-nearby","beaches","disabled-facilities","electrical-hook-up-points-for-tourers","motorhome-service-point","restaurant-nearby","wifi","bar-or-club-house","book-pitchup","rallies-welcome","family-friendly","launderette","motorcycle-friendly","english","games-room","toilet-block"],"name":"Camping le Repaire","hierarchy1_exact":"france/aquitaine/","available_pitches_caravans":false,"hierarchy_tree":["france/aquitaine/dordogne/thiviers/","france/aquitaine/dordogne/","france/aquitaine/","france/"],"available_pitches_rent_a_tent":false,"has_availability":true,"pitchtype_ids":[3915],"hierarchy0":"france/","hierarchy3":"france/aquitaine/dordogne/thiviers/","hierarchy2":"france/aquitaine/dordogne/","available_pitches_tourers":false,"last_booking_time":"2014-03-08T09:14:53.327Z","api_type":-1,"_version_":1489653176005033984,"score":1.0}]},"error":{"msg":"undefined field: \\"hierarchy4\\"","code":400}}\n'
Pull requests stacking up a bit here. Anyone still active?
The content type provided to the requests library is b'Content-type'
-- requests wants a str. The POST
goes out with with multiple content type headers as a result:
POST /solr/update/?commit=true HTTP/1.1
Host: localhost:8983
Accept-Encoding: gzip, deflate, compress
Content-Type: application/x-www-form-urlencoded
Content-type: text/xml; charset=utf-8
Accept: */*
Content-Length: 89
User-Agent: python-requests/1.1.0 CPython/3.3.0 Linux/3.4.9
If Solr sees the text/xml
header first, the document gets indexed. If x-www-form-urlencoded
wins, nothing is indexed. Solr reports {"responseHeader":{"status":0,"QTime":67}}
in both cases.
Applying this patch provides str
headers instead, correcting the issue for me: https://gist.github.com/4683244
File "test.py", line 44, in
conn.add(docs)
File "/usr/local/lib/python2.6/dist-packages/pysolr-2.0.11-py2.6.egg/pysolr.py", line 394, in add
m = ET.tostring(message)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 1009, in tostring
ElementTree(element).write(file, encoding)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 663, in write
self._write(file, self._root, encoding, {})
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 707, in _write
self._write(file, n, encoding, namespaces)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 707, in _write
self._write(file, n, encoding, namespaces)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 705, in _write
file.write(_escape_cdata(node.text, encoding))
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 807, in _escape_cdata
return _encode_entity(text)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 792, in _encode_entity
return _encode(pattern.sub(escape_entities, text), "ascii")
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 751, in _encode
return s.encode(encoding)
I was surprised to see ElementTree converting unicode to ascii here (doesn't Solr accept Unicode?). The simplest solution was to modify pysolr at line 394 by adding an encoding.
m = ET.tostring(message,encoding='utf8')
Am I missing something obvious here? I can't be the first person to try to use unicode....
-b
If I just don't understand existing support for this I apologize in advance.
It looks like when a select query is formed, the only way to add additional parameters to your query is by using kwargs. This doesn't work for params with periods (`.') in their names because the python interpreter believes you have created an expression for a keyword. An example would be facet.field.
It seems a simple modification to solve this problem might be to allow another argument for search() that is a dictionary of params : argument pairs. It could have a default value so that not providing it wouldn't raise an exception. This way arbitrary select params could be supplied and if you don't get rid of **kwargs, existing code should just keep working. The arbitrary param arguments might also need to be encoded as utf-8 like you're currently doing with the query.
What do you think?
Hello,
I'm getting UnicodeDecodeError's originating from the _scrape_response
method of the Solr
class. The line which causes the error can be found here: https://github.com/toastdriven/pysolr/blob/fee48b6401664c746d24f0abd9955761f5d362dd/pysolr.py#L465
My response is in JSON, and thus it can't be decoded as XML and goes into the except. It seems like this problem only occured when Sorl is returning an error. In my case, Sorl returns a 500 error, but that fact is hidden by Pysorl this way. This is the last part of my stacktrace:
File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 578, in search
response = self._select(params)
File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 315, in _select
return self._send_request('post', path, body=params_encoded, headers=headers)
File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 293, in _send_request
error_message = self._extract_error(resp)
File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 372, in _extract_error
reason, full_html = self._scrape_response(resp.headers, resp.content)
File "pysolr-3.1.0-py2.7.egg/pysolr.py", line 440, in _scrape_response
full_html = "%s" % response
Too be honest, I don't know why that line is even trying to transform it to a bytestring, because the response is already a bytestring as far as I can see. All I know is that it originated from 2ea8435.
results = solr.search('query', *{'fq':'title', 'cursorMark':'', 'sort':'id desc'})
but i cant see nextcursorMark in results. for more details using cursorMark http://heliosearch.org/solr/paging-and-deep-paging/
any solution ?
On line 468 in pysolr.py, the full_html needs to be decoded with utf-8 in order to handle the Swedish language.
Fix: full_html = full_html.decode('utf-8')
See comment in this commit: 2ea84356
BeautifulSoup isn't declared as a requirement so it's not automatically installed with pysolr.
lxml version 2.0 and greater require keyword arguments to be used explicitly in some methods. One of these is lxml.etree.tostring. Currently pysolr.py (2.0.13) is using this function with the encoding parameter as a positional argument which works fine with older versions of lxml. By explicitly using the keyword 'encoding' with this argument, pysolr will work with later versions of lxml as well.
The boost keyword argument behaves in a strange way when used in combination with multivalued fields.
solr = Solr(solr_url)
solr.add([{'foo':'bar', 'bar':['foo', 'baz']}], boost={'foo':5})
Resulting message:
<add>
<doc>
<field boost="5" name="foo">bar</field>
<field boost="5" name="bar">foo</field>
<field name="bar">baz</field>
</doc>
</add>
Is this the expected behavior?
When I add documents in russian I get strings like "С�С, лв�". To make pysolr2 work correctly I changed line 272:
return self._send_request('POST', path, message, {'Content-type': 'text/xml; charset=utf-8'})
Hi all,
pysolr is a good library to communicate between python and solr server! But I found a little bug in "extract" method:
If there exists spaces or special characters in the filename of the provided file object, then occurs following error:
pysolr ERROR [Reason: None]
{"responseHeader":{"status":400,"QTime":1},"error":{"msg":"missing content stream","code":400}}
pysolr ERROR Failed to extract document metadata: [Reason: None]
{"responseHeader":{"status":400,"QTime":11},"error":{"msg":"missing content stream","code":400}}
Traceback (most recent call last):
File "pysolr.py", line 905, in extract
files={'file': (file_obj.name, file_obj)})
File "/home/cw/Aptana Studio 3 Workspace/arcdoc/pysolr.py", line 321, in _send_request
raise SolrError(error_message)
SolrError: [Reason: None]
{"responseHeader":{"status":400,"QTime":11},"error":{"msg":"missing content stream","code":400}}
To fix this bug, I added following code lines:
in line 891: filename = urllib.quote(file_obj.name)
in line 897: files={'file': (filename, file_obj)})
Here an overview of the extract method with the fixed bug:
def extract(self, file_obj, extractOnly=True, **kwargs):
"""
POSTs a file to the Solr ExtractingRequestHandler so rich content can
be processed using Apache Tika. See the Solr wiki for details:
http://wiki.apache.org/solr/ExtractingRequestHandler
The ExtractingRequestHandler has a very simply model: it extracts
contents and metadata from the uploaded file and inserts it directly
into the index. This is rarely useful as it allows no way to store
additional data or otherwise customize the record. Instead, by default
we'll use the extract-only mode to extract the data without indexing it
so the caller has the opportunity to process it as appropriate; call
with ``extractOnly=False`` if you want to insert with no additional
processing.
Returns None if metadata cannot be extracted; otherwise returns a
dictionary containing at least two keys:
:contents:
Extracted full-text content, if applicable
:metadata:
key:value pairs of text strings
"""
if not hasattr(file_obj, "name"):
raise ValueError("extract() requires file-like objects which have a defined name property")
params = {
"extractOnly": "true" if extractOnly else "false",
"lowernames": "true",
"wt": "json",
}
params.update(kwargs)
filename = urllib.quote(file_obj.name)
try:
# We'll provide the file using its true name as Tika may use that
# as a file type hint:
resp = self._send_request('post', 'update/extract',
body=params,
files={'file': (filename, file_obj)})
except (IOError, SolrError) as err:
self.log.error("Failed to extract document metadata: %s", err,
exc_info=True)
raise
try:
data = json.loads(resp)
except ValueError as err:
self.log.error("Failed to load JSON response: %s", err,
exc_info=True)
raise
data['contents'] = data.pop(filename, None)
data['metadata'] = metadata = {}
raw_metadata = data.pop("%s_metadata" % filename, None)
if raw_metadata:
# The raw format is somewhat annoying: it's a flat list of
# alternating keys and value lists
while raw_metadata:
metadata[raw_metadata.pop()] = raw_metadata.pop()
return data
Please inform me, if this bug will be fixed in the next releases!
Thanks!
Solr 3.3 has grouping (collapsing) support: http://wiki.apache.org/solr/FieldCollapsing
So it would be nice to add this feature to pysolr. I've added grouping support that works for me.
conn.optimize()
File "C:\tools\Python27\lib\site-packages\pysolr.py", line 825, in optimize
return self._update(msg, waitFlush=waitFlush, waitSearcher=waitSearcher)
File "C:\tools\Python27\lib\site-packages\pysolr.py", line 359, in _update
return self._send_request('post', path, message, {'Content-type': 'text/xml; charset=utf-8'})
File "C:\tools\Python27\lib\site-packages\pysolr.py", line 278, in _send_request
raise SolrError(error_message % [url, self.timeout])
TypeError: not enough arguments for format string
bug fix: pysolr.py line: 278 should change to:
patch here:
-raise SolrError(error_message % [url, self.timeout])
+raise SolrError(error_message % (url, self.timeout))
Solr provides the creation of various handlers for different purposes in solrconfig.xml - so, one could have a "/solr/select handler for default searching and a /solr/articlesearch for searching with some different filters and different settings.
In pysolr, however, only the default handlers (/solr/select, /solr/update, etc.) can be used.
For very long queries I get this:
pysolr.pyc in _select(self, params)
240 else:
241 # Handles very long queries by submitting as a POST.
--> 242 path = '%s/select/?%s' % (self.path,)
243 headers = {
244 'Content-type': 'application/x-www-form-urlencoded',
TypeError: not enough arguments for format string
The safe_urlencode() is intended to provide a Unicode-friendly urlencode function on top of urllib's url_encode() function. The implementation of safe_urlencode() contains this snippet:
if isinstance(v, basestring):
new_params.append((k, v.encode("utf-8")))
elif isinstance(v, (list, tuple)):
new_params.append((k, [i.encode("utf-8") for i in v]))
else:
new_params.append((k, unicode(v)))
The third case is incorrect, since this is exactly what urllib's urlencode() function chokes on – and the sole reason safe_urlencode() exists in the first place:
>>> urllib.urlencode([('ellipsis', u'\u2026')])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/urllib.py", line 1269, in urlencode
v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 0: ordinal not in range(128)
In contrast, encoding the unicode string into UTF-8 bytes this produces the correct output (0xE2 0x80 0xA6 is the UTF-8 represntation of U+2026 HORIZONTAL ELLIPSIS):
>>> urllib.urlencode([('ellipsis', u'\u2026'.encode('UTF-8'))])
'ellipsis=%E2%80%A6'
The part in the else cause should therefor read:
else:
new_params.append((k, unicode(v).encode('UTF-8'))
Note: this bug is only triggered for non-string objects (since those are handled in the first "if") with a unicode() result yielding characters outside the ASCII range.
(Oh, and while we're at it: params.items() near the top of the function should be params.iteritems() for more efficient looping.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.