mozilla / elasticutils Goto Github PK
View Code? Open in Web Editor NEW[deprecated] A friendly chainable ElasticSearch interface for python
Home Page: http://elasticutils.rtfd.org
License: BSD 3-Clause "New" or "Revised" License
[deprecated] A friendly chainable ElasticSearch interface for python
Home Page: http://elasticutils.rtfd.org
License: BSD 3-Clause "New" or "Revised" License
elasticutils currently doesn't do much on the indexing side of things. There are a few possible answers on how to solve this that have been implemented and are lying around in the various elasticutils uses.
django-haystack has an approach that I like. In 2.0, they create a SearchIndex class that works much like Django models. This has two interesting effects:
Rob is working on this right now in https://github.com/robhudson/elasticutils/tree/declarative-mapping . He's using pyes to do it because it's got a lot of the bits implemented already.
I think we should go through the django-haystack SearchIndex and figure out what bits we want for the elasticutils SearchIndex.
This bug is for continuing and finalizing that work.
Consider an index with a boolean value 'happy'.
>> S().facet('happy').facet_counts()
{u'happy': [{u'count': 600, u'term': u'T'}, {u'count': 600, u'term': u'F'}]
Notice that term
is "T"
or "F"
. It would make a lot more sense for it to be True
or False
, considering this is a boolean field. This is because, for some reason, ElasticSearch does it this way. When interacting with this field it gives back the proper json true
and false
, but in facets it does not.
Can we smooth over this oddity?
I was looking for statsd support in ElasticUtils, and found it contrib.django, which is great :)
However, is there any reason why this is not directly a part of ElasticUtils? (could be useful to folks using EU outside of django)
The test code is in tests/tests.py
which is kind of silly and because there are other files in the distribution with the word "test" in the filename, that creates a series of problems.
First, it forces us to run the tests like this:
DJANGO_SETTINGS_MODULE=es_settings nosetests -w tests
Second, one big file with all the tests in it isn't necessary. Plus we really should have more test coverage for various cases. I think it's time that the monolithic tests.py
should get broken up into smaller parts.
If you set your ES_HOSTS to a port not in the 9200-9299 range, say 9300 you will get an annoying failure in every single failing unit test.
..[snip]
File "/Users/andy/sandboxes/zamboni/vendor/lib/python/pyes/es.py", line 157, in init
self._init_connection()
File "/Users/andy/sandboxes/zamboni/vendor/lib/python/pyes/es.py", line 176, in _init_connection
raise RuntimeError("If you want to use thrift, please install pythrift")
RuntimeError: If you want to use thrift, please install pythrift
The cause is this:
https://github.com/mozilla/zamboni-lib/blob/master/lib/python/pyes/es.py#L172
I'm thinking that since pyes pretty much requires ports to start with 9200 unless you have thrift enabled we should pretty much blow up if that's the case.
Rob threw this together:
It'd be super to get that to a stable state (for whatever that means) and then toss it in contrib.django.
From Rob's email:
Kumar found this, and I was curious if you knew about it?
>>> S(UserProfile).filter(F(email__fuzzy='robh') | F(username__fuzzy='robh'))
------------------------------------------------------------
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
File "/Users/rob/git/zamboni/vendor/src/elasticutils/elasticutils/__init__.py", line 174, in __init__
self.filters = items[0]
IndexError: list index out of range
Almost matches a similar example in the docs.
Add support for text_phrase queries.
http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html
I'm on 0.19.8 and am running the tests on current elasticutils and get a lot of exceptions.
http://www.elasticsearch.org/guide/reference/api/search/explain.html
We should support that. Srsly.
When explain = True, then the returned hit has an additional _explanation
field which is this crazy-ass data structure. We want to look at search results for SUMO, so I want to search for or write a parser that turns that crazy-ass data structure into something that humans can read. That probably doesn't need to be part of this bug.
Add a link to this in the docs:
https://github.com/karmi/elasticsearch-paramedic
There's a section for debugging/seeing-into-elasticsearch. This should probably go there.
pyes keeps changing in non-trivial ways and the API isn't particularly stable. Additionally, it's pretty intense.
I think we should do one of three things:
This is pretty far future stuff. Probably shouldn't spend the time on this until there's a compelling reason to do so. We can hang out on pyes 0.15 for now.
Text queries and term queries are nice, but it'd be super nice to support query_string. That allows elasticutils to do search queries using the query parser syntax.
Relevant documentation:
Mozilla projects use a load balancer which sometimes in its infinite wisdom returns HTML responses along the lines of "oh noes! something is wrong! service unavailable!" That would cause elasticutils/pyes to raise an ElasticSearchException (not very helpful).
Need to find out what elasticutils/pyelasticsearch does in these situations and document it somewhere.
Probably best thing to do is to write a mocked test for it.
Currently, .facets()
takes raw ES that's sort of keyword driven and also it automatically applies facet_filtered
to all the facets.
It'd be better if it was args-driven and handled the obvious use case with additional global
and filtered
flags.
filtered
flag is set to True, then we copy all the filters over and the filters thus affect the facets. If it's False, we don't and the facets are only affected by the query.global
flag is set to True, then we set global=True
on the facets and they apply to the entire corpus.Examples of usage:
searcher = S().facet('style')
searcher = S().facet('style', 'color')
searcher = S().facet('style', 'color', filtered=True)
searcher = S().facet('style', 'color', global=True)
Currently .facet()
only does terms facet. ElasticSearch supports other facet types:
We want to use date histogram in Input, so we have a current need for implementing that. @rlr expressed a deep-seated yearning for histogram as well.
It'd be nice if ElasticUtils supported more facet types.
It's hard to see the list of all the transforms without wading through all the documentation. There are enough of them now that we should solve this in two ways:
https://github.com/mozilla/elasticutils/blob/master/elasticutils/__init__.py#L582
That should raise an exception if the action doesn't exist. Should do the same thing that _process_filters does:
https://github.com/mozilla/elasticutils/blob/master/elasticutils/__init__.py#L102
Add highlighting/excerpting support on a par with what we have in oedipus. We put enough thought in up front that the API should work equally well with ES.
There's another issue for ditching pyes, but for now, let's keep it.
This issue covers fixing elasticutils code and documentation to work with a more recent version of pyes.
One thing we have to be wary of is looking at all the pyes changes in previous versions and noting API changes. I remember Dave talking about indexes -> indices (or something similar). If there are a lot of those, then we need to note that explicitly in release notes of elasticutils so people upgrading have a clue that they're walking into a beehive of angry bees.
Get us listed on http://www.elasticsearch.org/guide/appendix/clients.html
In some cases, combining filter objects using the &
or |
operator will cause the original filters to be modified. This happens when the left parent filter has the combination in use already, ie. &
ing a filter that is already an and
filter.
>>> from elasticutils import F
>>> f1 = F(one=1, uno=1)
>>> f2 = F(two=1, dos=2)
>>> f1.filters
{'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}]}
>>> f2.filters
{'and': [{'term': {'dos': 2}}, {'term': {'two': 1}}]}
>>> f1 & f2
<elasticutils.F object at 0x7f8160dda450>
>>> f1.filters
{'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}, {'and': [{'term': {'dos': 2}}, {'term': {'two': 1}}]}]}
Notice that the last line shows that f1 has been modified, when the expected result is that it is not (in other words, f1.filters
should return {'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}]}
again).
@willkg pointed out that his is because the F._combine
method causes the filter
attribute to be shared.
Currently, there is no way to get all the results. You have to slice with a large enough number, which is kind of hacky.
This would be a nice convenience. We use this in oedipus to express not__in
constraints, since we didn't bother implementing full F
objects.
MLT expects get_indexes and get_doctypes to always return lists. Further the names suggest they should always return lists.
I mostly fixed this a few months ago, but didn't finish up the fixes in the django contrib code.
That needs to get fixed. Also, should probably get fixed in the branch-0.5 branch where @robhudson noticed the problem.
The cron task doesn't work as expected... well the registration bit just doesn't work.
So we should either re-write it or scrap it.
The docs for boost should be improved. At a minimum, they should link to the relevant ES docs.
If you do a facet_raw with a date_histogram type, then elasticutils ignores the data. For example:
Python 2.7.3rc2 (default, Apr 22 2012, 22:35:38)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import fjord.feedback.models
>>> s = fjord.feedback.models.SimpleIndex.search()
>>> s.facet_raw(histo1={'date_histogram': {'interval': 'day', 'field': 'created'}}).facet_counts()
histo1 {u'_type': u'date_histogram', u'entries': [{u'count': 10, u'time': 1346889600000L}]}
{}
>>>
This should get fixed in two ways:
_type
it doesn't recognizeThere's no test for __in
with filters.
I know it works because we use it heavily in Kitsune, but we really should have a test for it in the suite.
If you say .filter(some_key=None)
, EU tries to make a filter with a null value. ES throws an error about this. I believe this is because in ES, you can't filter by null
, instead you have to use a "missing field filter", but I can't access the ES docs right now to figure out exactly what this is.
This gist is a script that demonstrates the problem. There is also a full stack trace. https://gist.github.com/cdd8c58fab1294503261
It'd be interesting to pull that in. It catches ElasticSearch errors and throws up a "search unavailable" page instead of getting all "oh noes! i'm dying!"
https://github.com/davedash/elasticutils/blob/64a55ebaed0973a0027f3a0fffdfd8b7d083a903/elasticutils/__init__.py#L21 initializes the attr on _local
only for the thread importing the module. Any other thread won't see that attr, and es_required()
will throw an AttributeError the first time it's used on that thread.
values()
in the Django returns a dict, and values_list()
returns a list. We should do that. At the very least, we should add a values_list()
that returns a list (if nobody wants to update their client code).
elasticutils.rtfd.org is pointing to the davedash/elasticutils version of things and needs to point to the mozilla/elasticutils version of things.
I have vague memories of having this problem when we moved kitsune from jsocol/kitsune to mozilla/kitsune and what we ended up doing was deleting the kitsune docs project (or whatever it's called on rtfd) and then re-creating it later with the correct github url.
http://www.elasticsearch.org/guide/reference/query-dsl/terms-query.html
.query(foo__in=some_list)
Should be pretty straight forward.
@rlr went through and tested Kitsune (which uses elasticutils) with 0.19.8, 0.19.9, 0.19.10, and 0.19.11:
https://bugzilla.mozilla.org/show_bug.cgi?id=811300
Conclusion is that Kitsune tests fail with 0.19.9. It's probably the case that elasticutils doesn't work with 0.19.9, too.
This issue is about verifying that statement and noting incompatibility in the docs.
get_es()
needs a way to:
This fixes the problem where we really want a 30s timeout for indexing operations. Plus it might make other get_es()
usages easier, too.
This issue covers moving all the Django bits to elasticutils.contrib.django and also fixing the core so that it works without Django.
Rob and I worked out some of the kinks this causes in https://etherpad.mozilla.org/elasticutils-s
I'll follow that guide and rework things.
The docs lack an "advanced queries" section similar to the "advanced filters" section. Namely this should document: or_
, setting up boolean queries, ...
Add support for boosting query:
http://www.elasticsearch.org/guide/reference/query-dsl/boosting-query.html
I'm not really sure how to do this because the boosting query has a bunch of stuff you need to specify so it doesn't nicely fit in with our field__action=query motif.
I wrote up the issue here:
pyelasticsearch/pyelasticsearch#59
bulk indexing is pretty key. I'm pretty sure there are Mozilla projects that are still using ES pre-0.18.0, so I think we need to wait until this is fixed before we do a release so we can depend on the correct version of pyelasticsearch.
except Exception:
log.error(qs)
raise
I think it's better if we just let the application log whatever it wants since it has to handle the exception already. We hooked up SUMO to Sentry and added a proper root handler and now are getting flooded with these useless log error messages: https://errormill.mozilla.org/support/
Does that make sense?
Rob mentioned https://gist.github.com/2895421
Adding more docs along those lines would be super helpful especially in showing how elasticutils API translates to equivalent elasticsearch REST calls. If someone knows elasticsearch already, that makes it much easier to use elasticutils. If someone doesn't, then it'll make it easier to learn elasticsearch.
Add a weight()
method to S
as in http://github.com/erikrose/oedipus so we can have a portable (and shorter) way to weight things. This will also let us put default weights on an S
object rather than having to always set them at query time.
The sumo branch has highlighting/excerpting code. We should port that over, elasticsearch-ify it, and possibly change it to match django-haystack's syntax a bit more closely.
please :)
# This errors out.
import elasticutils
elasticutils.get_es().index({'object_id': 'id2'}, 'some_object_index', 'object', id='query_hash_1')
elasticutils.get_es().index({'object_id': 'some_object_id'}, 'some_object_index', 'object', id='query_hash_2')
[a for a in elasticutils.S().filter(object_id='some_object_id').indexes('some_object_index')]
# This works.
import elasticutils
elasticutils.get_es().index({'id': 1, 'object_id': 'id2'}, 'some_object_index', 'object', id='query_hash_1')
elasticutils.get_es().index({'id': 2, 'object_id': 'some_object_id'}, 'some_object_index', 'object', id='query_hash_2')
[a for a in elasticutils.S().filter(object_id='some_object_id').indexes('some_object_index')]
What's going on is that we default to asking for at least the id
field, but there is no id
field. In the case where the list of fields don't exist, ES doesn't return a fields
key in the result dict. Need to handle that case better.
This issue covers the decision and resulting work required for deciding on whether elasticutils should be a Django library or not.
If it is to be a Django library, then we need to update the docs, remove the "if we're not using Django, then ..." code, and update the requirements files.
If it is not to be a Django library, then we need to clean up the codebase so that it works in non-Django contexts better. Personally, I think we should change this around so that it's a library with a Django-library-shim where the latter is what makes it a Django library. It's possible the two could come together in the same package---maybe as a separate djangolib module or something.
elasticsearch allows you to apply boosts in the query. I would like to add support for that.
The sumo branch has a .weight()
transform that allows you to specify weights. We could use that, but I'd be interested in seeing other approaches.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.