mozilla / elasticutils Goto Github PK

View Code? Open in Web Editor NEW

243.0 15.0 76.0 2.49 MB

[deprecated] A friendly chainable ElasticSearch interface for python

Home Page: http://elasticutils.rtfd.org

License: BSD 3-Clause "New" or "Revised" License

Python 99.51% HTML 0.13% Shell 0.36%

elasticutils's People

Stargazers

Watchers

Forkers

sivy jbalogh seanmonstar erikrose pennyfx trentonstrong davedash aparo robhudson wraithan rlr pombredanne mt3 glogiotatidis fiedzia mythmon groovecoder lambdafu chiehwen gvigneron claudiababescu youen emidln eire1130 scoursen theonion wongtai gadventures sixtynine shanez smillaedler syrusakbary awentzonline infoxchange honzakral catalanojuan justinbuzzni kevinastone diox davidlundgren tictail liyocee mariadb-corporation colony005 nearlyfreeapps mihneadb alex-mcleod tomgruner alibozorgkhan sonicbids qudos-com einvalentin peopledoc rocket-listings patrick91 ei-grad regadas noahmiller ryanwang520 joshuamosesb qgerome likaiguo riidr rizziepit lxy2222 atomised metafight safwanrahman elijahahianyo wikirealtyinc iq-scm vido

elasticutils's Issues

support declarative mappings

elasticutils currently doesn't do much on the indexing side of things. There are a few possible answers on how to solve this that have been implemented and are lying around in the various elasticutils uses.

django-haystack has an approach that I like. In 2.0, they create a SearchIndex class that works much like Django models. This has two interesting effects:

It allows us to nix the implicit relationship between Django models and the documents that are indexed.
It allows us to more easily specify a mapping with defaults that are sane.

Rob is working on this right now in https://github.com/robhudson/elasticutils/tree/declarative-mapping . He's using pyes to do it because it's got a lot of the bits implemented already.

I think we should go through the django-haystack SearchIndex and figure out what bits we want for the elasticutils SearchIndex.

This bug is for continuing and finalizing that work.

Make faceting on booleans return True and False instead of "T" and "F"

Consider an index with a boolean value 'happy'.

>> S().facet('happy').facet_counts()
{u'happy': [{u'count': 600, u'term': u'T'}, {u'count': 600, u'term': u'F'}]

Notice that term is "T" or "F". It would make a lot more sense for it to be True or False, considering this is a boolean field. This is because, for some reason, ElasticSearch does it this way. When interacting with this field it gives back the proper json true and false, but in facets it does not.

Can we smooth over this oddity?

remove statsd call from django contrib code

I was looking for statsd support in ElasticUtils, and found it contrib.django, which is great :)

However, is there any reason why this is not directly a part of ElasticUtils? (could be useful to folks using EU outside of django)

clean up the test code

The test code is in tests/tests.py which is kind of silly and because there are other files in the distribution with the word "test" in the filename, that creates a series of problems.

First, it forces us to run the tests like this:

DJANGO_SETTINGS_MODULE=es_settings nosetests -w tests

Second, one big file with all the tests in it isn't necessary. Plus we really should have more test coverage for various cases. I think it's time that the monolithic tests.py should get broken up into smaller parts.

Strange errors if you don't use port 92xx

If you set your ES_HOSTS to a port not in the 9200-9299 range, say 9300 you will get an annoying failure in every single failing unit test.

..[snip]
File "/Users/andy/sandboxes/zamboni/vendor/lib/python/pyes/es.py", line 157, in init
self._init_connection()
File "/Users/andy/sandboxes/zamboni/vendor/lib/python/pyes/es.py", line 176, in _init_connection
raise RuntimeError("If you want to use thrift, please install pythrift")
RuntimeError: If you want to use thrift, please install pythrift

The cause is this:

https://github.com/mozilla/zamboni-lib/blob/master/lib/python/pyes/es.py#L172

I'm thinking that since pyes pretty much requires ports to start with 9200 unless you have thrift enabled we should pretty much blow up if that's the case.

django debug toolbar awesomeness

Rob threw this together:

robhudson@3eae871

It'd be super to get that to a stable state (for whatever that means) and then toss it in contrib.django.

IndexError when using F() (aka F is for fail)

From Rob's email:

Kumar found this, and I was curious if you knew about it?

>>> S(UserProfile).filter(F(email__fuzzy='robh') | F(username__fuzzy='robh'))
------------------------------------------------------------
Traceback (most recent call last):
   File "<ipython console>", line 1, in <module>
   File "/Users/rob/git/zamboni/vendor/src/elasticutils/elasticutils/__init__.py", line 174, in __init__
     self.filters = items[0]
IndexError: list index out of range

Almost matches a similar example in the docs.

support text_phrase query

Add support for text_phrase queries.

http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

Implement "more like this"

Please? :)

http://www.elasticsearch.org/guide/reference/api/more-like-this.html

IndexMissingException when running the tests

I'm on 0.19.8 and am running the tests on current elasticutils and get a lot of exceptions.

support explain

http://www.elasticsearch.org/guide/reference/api/search/explain.html

We should support that. Srsly.

When explain = True, then the returned hit has an additional _explanation field which is this crazy-ass data structure. We want to look at search results for SUMO, so I want to search for or write a parser that turns that crazy-ass data structure into something that humans can read. That probably doesn't need to be part of this bug.

add link to elasticsearch-paramedic to docs

Add a link to this in the docs:

https://github.com/karmi/elasticsearch-paramedic

There's a section for debugging/seeing-into-elasticsearch. This should probably go there.

switch to pyelasticsearch

pyes keeps changing in non-trivial ways and the API isn't particularly stable. Additionally, it's pretty intense.

I think we should do one of three things:

fork pyes and stabilize it
switch to pyelasticsearch--probably the fork that the django-haystack folks maintain
bundle the version that the django-haystack folks maintain into elasticutils an in that way maintain our own fork

This is pretty far future stuff. Probably shouldn't spend the time on this until there's a compelling reason to do so. We can hang out on pyes 0.15 for now.

support query_string

Text queries and term queries are nice, but it'd be super nice to support query_string. That allows elasticutils to do search queries using the query parser syntax.

Relevant documentation:

what does elasticutils/pyelasticsearch raise with non-JSON ES responses

Mozilla projects use a load balancer which sometimes in its infinite wisdom returns HTML responses along the lines of "oh noes! something is wrong! service unavailable!" That would cause elasticutils/pyes to raise an ElasticSearchException (not very helpful).

Need to find out what elasticutils/pyelasticsearch does in these situations and document it somewhere.

Probably best thing to do is to write a mocked test for it.

fix facets

Currently, .facets() takes raw ES that's sort of keyword driven and also it automatically applies facet_filtered to all the facets.

It'd be better if it was args-driven and handled the obvious use case with additional global and filtered flags.

Each arg specifies a field name.
A "terms" facet is created using the field name as the facet name.
If the filtered flag is set to True, then we copy all the filters over and the filters thus affect the facets. If it's False, we don't and the facets are only affected by the query.
If the global flag is set to True, then we set global=True on the facets and they apply to the entire corpus.

Examples of usage:

searcher = S().facet('style')
searcher = S().facet('style', 'color')
searcher = S().facet('style', 'color', filtered=True)
searcher = S().facet('style', 'color', global=True)

implement handling for other facet types

Currently .facet() only does terms facet. ElasticSearch supports other facet types:

We want to use date histogram in Input, so we have a current need for implementing that. @rlr expressed a deep-seated yearning for histogram as well.

It'd be nice if ElasticUtils supported more facet types.

add transforms to index and toc in querying docs

It's hard to see the list of all the transforms without wading through all the documentation. There are enough of them now that we should solve this in two ways:

add another document to the docs that's an API autodoc of the transforms
add these things to the index (though I think autodocing actually adds them to the index automatically)

raise exception if action doesn't exist

https://github.com/mozilla/elasticutils/blob/master/elasticutils/__init__.py#L582

That should raise an exception if the action doesn't exist. Should do the same thing that _process_filters does:

https://github.com/mozilla/elasticutils/blob/master/elasticutils/__init__.py#L102

highlight() support

Add highlighting/excerpting support on a par with what we have in oedipus. We put enough thought in up front that the API should work equally well with ES.

update and peg to a more recent version of pyes

There's another issue for ditching pyes, but for now, let's keep it.

This issue covers fixing elasticutils code and documentation to work with a more recent version of pyes.

One thing we have to be wary of is looking at all the pyes changes in previous versions and noting API changes. I remember Dave talking about indexes -> indices (or something similar). If there are a lot of those, then we need to note that explicitly in release notes of elasticutils so people upgrading have a clue that they're walking into a beehive of angry bees.

get listed on elasticsearch clients list

Get us listed on http://www.elasticsearch.org/guide/appendix/clients.html

Combining filter objects can modify the source filters

In some cases, combining filter objects using the & or | operator will cause the original filters to be modified. This happens when the left parent filter has the combination in use already, ie. &ing a filter that is already an and filter.

>>> from elasticutils import F
>>> f1 = F(one=1, uno=1)
>>> f2 = F(two=1, dos=2)
>>> f1.filters
{'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}]}
>>> f2.filters
{'and': [{'term': {'dos': 2}}, {'term': {'two': 1}}]}
>>> f1 & f2
<elasticutils.F object at 0x7f8160dda450>
>>> f1.filters
{'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}, {'and': [{'term': {'dos': 2}}, {'term': {'two': 1}}]}]}

Notice that the last line shows that f1 has been modified, when the expected result is that it is not (in other words, f1.filters should return {'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}]} again).

@willkg pointed out that his is because the F._combine method causes the filter attribute to be shared.

Add a way to get all results back.

Currently, there is no way to get all the results. You have to slice with a large enough number, which is kind of hacky.

Add exclude()

This would be a nice convenience. We use this in oedipus to express not__in constraints, since we didn't bother implementing full F objects.

get_indexes and get_doctypes should always returns lists

MLT expects get_indexes and get_doctypes to always return lists. Further the names suggest they should always return lists.

I mostly fixed this a few months ago, but didn't finish up the fixes in the django contrib code.

That needs to get fixed. Also, should probably get fixed in the branch-0.5 branch where @robhudson noticed the problem.

Magic cron task.. not so magic

The cron task doesn't work as expected... well the registration bit just doesn't work.

So we should either re-write it or scrap it.

improve docs for boost

The docs for boost should be improved. At a minimum, they should link to the relevant ES docs.

facet_counts ditches date_histogram data

If you do a facet_raw with a date_histogram type, then elasticutils ignores the data. For example:

Python 2.7.3rc2 (default, Apr 22 2012, 22:35:38) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import fjord.feedback.models
>>> s = fjord.feedback.models.SimpleIndex.search()
>>> s.facet_raw(histo1={'date_histogram': {'interval': 'day', 'field': 'created'}}).facet_counts()
histo1 {u'_type': u'date_histogram', u'entries': [{u'count': 10, u'time': 1346889600000L}]}
{}
>>>

This should get fixed in two ways:

add code to handle date_histogram
add code to raise an error with the relevant details when it sees a _type it doesn't recognize

add test for __in with filters

There's no test for __in with filters.

I know it works because we use it heavily in Kitsune, but we really should have a test for it in the suite.

Can't filter by none.

If you say .filter(some_key=None), EU tries to make a filter with a null value. ES throws an error about this. I believe this is because in ES, you can't filter by null, instead you have to use a "missing field filter", but I can't access the ES docs right now to figure out exactly what this is.

This gist is a script that demonstrates the problem. There is also a full stack trace. https://gist.github.com/cdd8c58fab1294503261

elasticsearch error catcher middleware for django

mozilla/zamboni@016d37c

It'd be interesting to pull that in. It catches ElasticSearch errors and throws up a "search unavailable" page instead of getting all "oh noes! i'm dying!"

es_required will throw an exception the first time it's called on a secondary thread

https://github.com/davedash/elasticutils/blob/64a55ebaed0973a0027f3a0fffdfd8b7d083a903/elasticutils/__init__.py#L21 initializes the attr on _local only for the thread importing the module. Any other thread won't see that attr, and es_required() will throw an AttributeError the first time it's used on that thread.

Change values() to do what Django's does

values() in the Django returns a dict, and values_list() returns a list. We should do that. At the very least, we should add a values_list() that returns a list (if nobody wants to update their client code).

update elasticutils.rtfd.org

elasticutils.rtfd.org is pointing to the davedash/elasticutils version of things and needs to point to the mozilla/elasticutils version of things.

I have vague memories of having this problem when we moved kitsune from jsocol/kitsune to mozilla/kitsune and what we ended up doing was deleting the kitsune docs project (or whatever it's called on rtfd) and then re-creating it later with the correct github url.

implement "terms" query

http://www.elasticsearch.org/guide/reference/query-dsl/terms-query.html

.query(foo__in=some_list)

Should be pretty straight forward.

document that it doesn't work with ES 0.19.9

@rlr went through and tested Kitsune (which uses elasticutils) with 0.19.8, 0.19.9, 0.19.10, and 0.19.11:

https://bugzilla.mozilla.org/show_bug.cgi?id=811300

Conclusion is that Kitsune tests fail with 0.19.9. It's probably the case that elasticutils doesn't work with 0.19.9, too.

This issue is about verifying that statement and noting incompatibility in the docs.

fix get_es to be more useful

get_es() needs a way to:

allow us to pass in overrides for how it sets up the ES
if we pass in overrides, it shouldn't thread-local cache it

This fixes the problem where we really want a 30s timeout for indexing operations. Plus it might make other get_es() usages easier, too.

move django bits to contrib

This issue covers moving all the Django bits to elasticutils.contrib.django and also fixing the core so that it works without Django.

Rob and I worked out some of the kinks this causes in https://etherpad.mozilla.org/elasticutils-s

I'll follow that guide and rework things.

improve docs for query

The docs lack an "advanced queries" section similar to the "advanced filters" section. Namely this should document: or_, setting up boolean queries, ...

support boosting query

Add support for boosting query:

http://www.elasticsearch.org/guide/reference/query-dsl/boosting-query.html

I'm not really sure how to do this because the boosting query has a bunch of stuff you need to specify so it doesn't nicely fit in with our field__action=query motif.

pyelasticsarch bulk_index doesn't work with ES 0.17.9

I wrote up the issue here:

pyelasticsearch/pyelasticsearch#59

bulk indexing is pretty key. I'm pretty sure there are Mozilla projects that are still using ES pre-0.18.0, so I think we need to wait until this is fixed before we do a release so we can depend on the correct version of pyelasticsearch.

Nix the catch all exception, log and reraise in raw()

except Exception:
            log.error(qs)
            raise

I think it's better if we just let the application log whatever it wants since it has to handle the exception already. We hooked up SUMO to Sentry and added a proper root handler and now are getting flooded with these useless log error messages: https://errormill.mozilla.org/support/

Does that make sense?

add more docs showing elasticsearch rest examples

Rob mentioned https://gist.github.com/2895421

Adding more docs along those lines would be super helpful especially in showing how elasticutils API translates to equivalent elasticsearch REST calls. If someone knows elasticsearch already, that makes it much easier to use elasticutils. If someone doesn't, then it'll make it easier to learn elasticsearch.

weight() method

Add a weight() method to S as in http://github.com/erikrose/oedipus so we can have a portable (and shorter) way to weight things. This will also let us put default weights on an S object rather than having to always set them at query time.

support highlight

The sumo branch has highlighting/excerpting code. We should port that over, elasticsearch-ify it, and possibly change it to match django-haystack's syntax a bit more closely.

Add order_by() information to the docs

please :)

search results handle no-id case poorly

# This errors out.

import elasticutils

elasticutils.get_es().index({'object_id': 'id2'}, 'some_object_index', 'object', id='query_hash_1')
elasticutils.get_es().index({'object_id': 'some_object_id'}, 'some_object_index', 'object', id='query_hash_2')

[a for a in elasticutils.S().filter(object_id='some_object_id').indexes('some_object_index')]



# This works.

import elasticutils

elasticutils.get_es().index({'id': 1, 'object_id': 'id2'}, 'some_object_index', 'object', id='query_hash_1')
elasticutils.get_es().index({'id': 2, 'object_id': 'some_object_id'}, 'some_object_index', 'object', id='query_hash_2')

[a for a in elasticutils.S().filter(object_id='some_object_id').indexes('some_object_index')]

What's going on is that we default to asking for at least the id field, but there is no id field. In the case where the list of fields don't exist, ES doesn't return a fields key in the result dict. Need to handle that case better.

require django or not

This issue covers the decision and resulting work required for deciding on whether elasticutils should be a Django library or not.

If it is to be a Django library, then we need to update the docs, remove the "if we're not using Django, then ..." code, and update the requirements files.

If it is not to be a Django library, then we need to clean up the codebase so that it works in non-Django contexts better. Personally, I think we should change this around so that it's a library with a Django-library-shim where the latter is what makes it a Django library. It's possible the two could come together in the same package---maybe as a separate djangolib module or something.

support weights/boosts

elasticsearch allows you to apply boosts in the query. I would like to add support for that.

The sumo branch has a .weight() transform that allows you to specify weights. We could use that, but I'd be interested in seeing other approaches.

mozilla / elasticutils Goto Github PK

elasticutils's People

Stargazers

Watchers

Forkers

elasticutils's Issues

Recommend Projects

Recommend Topics

Recommend Org