Code Monkey home page Code Monkey logo

elasticutils's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticutils's Issues

support declarative mappings

elasticutils currently doesn't do much on the indexing side of things. There are a few possible answers on how to solve this that have been implemented and are lying around in the various elasticutils uses.

django-haystack has an approach that I like. In 2.0, they create a SearchIndex class that works much like Django models. This has two interesting effects:

  1. It allows us to nix the implicit relationship between Django models and the documents that are indexed.
  2. It allows us to more easily specify a mapping with defaults that are sane.

Rob is working on this right now in https://github.com/robhudson/elasticutils/tree/declarative-mapping . He's using pyes to do it because it's got a lot of the bits implemented already.

I think we should go through the django-haystack SearchIndex and figure out what bits we want for the elasticutils SearchIndex.

This bug is for continuing and finalizing that work.

Make faceting on booleans return True and False instead of "T" and "F"

Consider an index with a boolean value 'happy'.

>> S().facet('happy').facet_counts()
{u'happy': [{u'count': 600, u'term': u'T'}, {u'count': 600, u'term': u'F'}]

Notice that term is "T" or "F". It would make a lot more sense for it to be True or False, considering this is a boolean field. This is because, for some reason, ElasticSearch does it this way. When interacting with this field it gives back the proper json true and false, but in facets it does not.

Can we smooth over this oddity?

remove statsd call from django contrib code

I was looking for statsd support in ElasticUtils, and found it contrib.django, which is great :)

However, is there any reason why this is not directly a part of ElasticUtils? (could be useful to folks using EU outside of django)

clean up the test code

The test code is in tests/tests.py which is kind of silly and because there are other files in the distribution with the word "test" in the filename, that creates a series of problems.

First, it forces us to run the tests like this:

DJANGO_SETTINGS_MODULE=es_settings nosetests -w tests

Second, one big file with all the tests in it isn't necessary. Plus we really should have more test coverage for various cases. I think it's time that the monolithic tests.py should get broken up into smaller parts.

Strange errors if you don't use port 92xx

If you set your ES_HOSTS to a port not in the 9200-9299 range, say 9300 you will get an annoying failure in every single failing unit test.

..[snip]
File "/Users/andy/sandboxes/zamboni/vendor/lib/python/pyes/es.py", line 157, in init
self._init_connection()
File "/Users/andy/sandboxes/zamboni/vendor/lib/python/pyes/es.py", line 176, in _init_connection
raise RuntimeError("If you want to use thrift, please install pythrift")
RuntimeError: If you want to use thrift, please install pythrift

The cause is this:

https://github.com/mozilla/zamboni-lib/blob/master/lib/python/pyes/es.py#L172

I'm thinking that since pyes pretty much requires ports to start with 9200 unless you have thrift enabled we should pretty much blow up if that's the case.

IndexError when using F() (aka F is for fail)

From Rob's email:

Kumar found this, and I was curious if you knew about it?

>>> S(UserProfile).filter(F(email__fuzzy='robh') | F(username__fuzzy='robh'))
------------------------------------------------------------
Traceback (most recent call last):
   File "<ipython console>", line 1, in <module>
   File "/Users/rob/git/zamboni/vendor/src/elasticutils/elasticutils/__init__.py", line 174, in __init__
     self.filters = items[0]
IndexError: list index out of range

Almost matches a similar example in the docs.

support explain

http://www.elasticsearch.org/guide/reference/api/search/explain.html

We should support that. Srsly.

When explain = True, then the returned hit has an additional _explanation field which is this crazy-ass data structure. We want to look at search results for SUMO, so I want to search for or write a parser that turns that crazy-ass data structure into something that humans can read. That probably doesn't need to be part of this bug.

switch to pyelasticsearch

pyes keeps changing in non-trivial ways and the API isn't particularly stable. Additionally, it's pretty intense.

I think we should do one of three things:

  1. fork pyes and stabilize it
  2. switch to pyelasticsearch--probably the fork that the django-haystack folks maintain
  3. bundle the version that the django-haystack folks maintain into elasticutils an in that way maintain our own fork

This is pretty far future stuff. Probably shouldn't spend the time on this until there's a compelling reason to do so. We can hang out on pyes 0.15 for now.

what does elasticutils/pyelasticsearch raise with non-JSON ES responses

Mozilla projects use a load balancer which sometimes in its infinite wisdom returns HTML responses along the lines of "oh noes! something is wrong! service unavailable!" That would cause elasticutils/pyes to raise an ElasticSearchException (not very helpful).

Need to find out what elasticutils/pyelasticsearch does in these situations and document it somewhere.

Probably best thing to do is to write a mocked test for it.

fix facets

Currently, .facets() takes raw ES that's sort of keyword driven and also it automatically applies facet_filtered to all the facets.

It'd be better if it was args-driven and handled the obvious use case with additional global and filtered flags.

  • Each arg specifies a field name.
  • A "terms" facet is created using the field name as the facet name.
  • If the filtered flag is set to True, then we copy all the filters over and the filters thus affect the facets. If it's False, we don't and the facets are only affected by the query.
  • If the global flag is set to True, then we set global=True on the facets and they apply to the entire corpus.

Examples of usage:

searcher = S().facet('style')
searcher = S().facet('style', 'color')
searcher = S().facet('style', 'color', filtered=True)
searcher = S().facet('style', 'color', global=True)

implement handling for other facet types

add transforms to index and toc in querying docs

It's hard to see the list of all the transforms without wading through all the documentation. There are enough of them now that we should solve this in two ways:

  1. add another document to the docs that's an API autodoc of the transforms
  2. add these things to the index (though I think autodocing actually adds them to the index automatically)

highlight() support

Add highlighting/excerpting support on a par with what we have in oedipus. We put enough thought in up front that the API should work equally well with ES.

update and peg to a more recent version of pyes

There's another issue for ditching pyes, but for now, let's keep it.

This issue covers fixing elasticutils code and documentation to work with a more recent version of pyes.

One thing we have to be wary of is looking at all the pyes changes in previous versions and noting API changes. I remember Dave talking about indexes -> indices (or something similar). If there are a lot of those, then we need to note that explicitly in release notes of elasticutils so people upgrading have a clue that they're walking into a beehive of angry bees.

Combining filter objects can modify the source filters

In some cases, combining filter objects using the & or | operator will cause the original filters to be modified. This happens when the left parent filter has the combination in use already, ie. &ing a filter that is already an and filter.

>>> from elasticutils import F
>>> f1 = F(one=1, uno=1)
>>> f2 = F(two=1, dos=2)
>>> f1.filters
{'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}]}
>>> f2.filters
{'and': [{'term': {'dos': 2}}, {'term': {'two': 1}}]}
>>> f1 & f2
<elasticutils.F object at 0x7f8160dda450>
>>> f1.filters
{'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}, {'and': [{'term': {'dos': 2}}, {'term': {'two': 1}}]}]}

Notice that the last line shows that f1 has been modified, when the expected result is that it is not (in other words, f1.filters should return {'and': [{'term': {'uno': 1}}, {'term': {'one': 1}}]} again).

@willkg pointed out that his is because the F._combine method causes the filter attribute to be shared.

Add exclude()

This would be a nice convenience. We use this in oedipus to express not__in constraints, since we didn't bother implementing full F objects.

get_indexes and get_doctypes should always returns lists

MLT expects get_indexes and get_doctypes to always return lists. Further the names suggest they should always return lists.

I mostly fixed this a few months ago, but didn't finish up the fixes in the django contrib code.

That needs to get fixed. Also, should probably get fixed in the branch-0.5 branch where @robhudson noticed the problem.

Magic cron task.. not so magic

The cron task doesn't work as expected... well the registration bit just doesn't work.

So we should either re-write it or scrap it.

improve docs for boost

The docs for boost should be improved. At a minimum, they should link to the relevant ES docs.

facet_counts ditches date_histogram data

If you do a facet_raw with a date_histogram type, then elasticutils ignores the data. For example:

Python 2.7.3rc2 (default, Apr 22 2012, 22:35:38) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import fjord.feedback.models
>>> s = fjord.feedback.models.SimpleIndex.search()
>>> s.facet_raw(histo1={'date_histogram': {'interval': 'day', 'field': 'created'}}).facet_counts()
histo1 {u'_type': u'date_histogram', u'entries': [{u'count': 10, u'time': 1346889600000L}]}
{}
>>> 

This should get fixed in two ways:

  1. add code to handle date_histogram
  2. add code to raise an error with the relevant details when it sees a _type it doesn't recognize

add test for __in with filters

There's no test for __in with filters.

I know it works because we use it heavily in Kitsune, but we really should have a test for it in the suite.

Can't filter by none.

If you say .filter(some_key=None), EU tries to make a filter with a null value. ES throws an error about this. I believe this is because in ES, you can't filter by null, instead you have to use a "missing field filter", but I can't access the ES docs right now to figure out exactly what this is.

This gist is a script that demonstrates the problem. There is also a full stack trace. https://gist.github.com/cdd8c58fab1294503261

Change values() to do what Django's does

values() in the Django returns a dict, and values_list() returns a list. We should do that. At the very least, we should add a values_list() that returns a list (if nobody wants to update their client code).

update elasticutils.rtfd.org

elasticutils.rtfd.org is pointing to the davedash/elasticutils version of things and needs to point to the mozilla/elasticutils version of things.

I have vague memories of having this problem when we moved kitsune from jsocol/kitsune to mozilla/kitsune and what we ended up doing was deleting the kitsune docs project (or whatever it's called on rtfd) and then re-creating it later with the correct github url.

fix get_es to be more useful

get_es() needs a way to:

  1. allow us to pass in overrides for how it sets up the ES
  2. if we pass in overrides, it shouldn't thread-local cache it

This fixes the problem where we really want a 30s timeout for indexing operations. Plus it might make other get_es() usages easier, too.

improve docs for query

The docs lack an "advanced queries" section similar to the "advanced filters" section. Namely this should document: or_, setting up boolean queries, ...

add more docs showing elasticsearch rest examples

Rob mentioned https://gist.github.com/2895421

Adding more docs along those lines would be super helpful especially in showing how elasticutils API translates to equivalent elasticsearch REST calls. If someone knows elasticsearch already, that makes it much easier to use elasticutils. If someone doesn't, then it'll make it easier to learn elasticsearch.

weight() method

Add a weight() method to S as in http://github.com/erikrose/oedipus so we can have a portable (and shorter) way to weight things. This will also let us put default weights on an S object rather than having to always set them at query time.

support highlight

The sumo branch has highlighting/excerpting code. We should port that over, elasticsearch-ify it, and possibly change it to match django-haystack's syntax a bit more closely.

search results handle no-id case poorly

# This errors out.

import elasticutils

elasticutils.get_es().index({'object_id': 'id2'}, 'some_object_index', 'object', id='query_hash_1')
elasticutils.get_es().index({'object_id': 'some_object_id'}, 'some_object_index', 'object', id='query_hash_2')

[a for a in elasticutils.S().filter(object_id='some_object_id').indexes('some_object_index')]



# This works.

import elasticutils

elasticutils.get_es().index({'id': 1, 'object_id': 'id2'}, 'some_object_index', 'object', id='query_hash_1')
elasticutils.get_es().index({'id': 2, 'object_id': 'some_object_id'}, 'some_object_index', 'object', id='query_hash_2')

[a for a in elasticutils.S().filter(object_id='some_object_id').indexes('some_object_index')]

What's going on is that we default to asking for at least the id field, but there is no id field. In the case where the list of fields don't exist, ES doesn't return a fields key in the result dict. Need to handle that case better.

require django or not

This issue covers the decision and resulting work required for deciding on whether elasticutils should be a Django library or not.

If it is to be a Django library, then we need to update the docs, remove the "if we're not using Django, then ..." code, and update the requirements files.

If it is not to be a Django library, then we need to clean up the codebase so that it works in non-Django contexts better. Personally, I think we should change this around so that it's a library with a Django-library-shim where the latter is what makes it a Django library. It's possible the two could come together in the same package---maybe as a separate djangolib module or something.

support weights/boosts

elasticsearch allows you to apply boosts in the query. I would like to add support for that.

The sumo branch has a .weight() transform that allows you to specify weights. We could use that, but I'd be interested in seeing other approaches.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.