This sample app is a PoC for implementing faceted search and autocompletion using elassticsearch. It's built using elasticsearch as a rest backend and react as the application. Also included is logstash config to load some data into the elastic database.
To run this app you'll need:
- elasticsearch
- logstash
- webpack
- nodejs
- npm
To build the code and be able to run, follow these steps:
- Install dependencies:
npm install
- Build the bundle:
webpack
- Open the app in your browser:
firefox file://./index.html
The faceted search part in this Proof-of-Concept application means there is below the search-box and next to the result list a few categories ("facets") where we can click on elements to filter the results by limiting to that category. For example there is a list of countries where you can click on one country to limit the results to companies from that country.
This is implemented in elasticsearch by doing a search and executing aggregations on those categories.
curl -XPOST 'http://localhost:9200/companies/_search?request_cache=true&search_type=count' -d '{
"query": {
"bool": {
"must": []
}
},
"aggs": {
"activities": {
"terms": {
"field": "activity_code"
}
},
"countries": {
"terms": {
"field": "country_code"
}
},
"years": {
"date_range": {
"field": "incorporation_date",
"format": "yyyy",
"ranges": [
{ "to": "1950" },
{ "from": "1950", "to": "1975" },
{ "from": "1975", "to": "2000" },
{ "from": "2000" }
]
}
}
}
}'
The above query will return the total number of search results and the top entries plus counts in each category.
Please also see the section about caching below. The above query is normally very slow (seconds...) and can only be executed efficiently using caching. Downside to the caching is, it does not cache the actual hits, so we need to do a second query to retrieve the hits. Luckily the second query needs only hits (no aggregations) and as such is much much faster. (Note we could optimise this and execute both request in one msearch/bulk call)
Some docs (companies) might not have a value for the field we're calculating the facet on. This can be addressed by specifying hw to handle them, see here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_missing_value_12
"activities": {
"terms": {
"field": "activity_code",
"missing": "N/A"
}
}
This means an extra bucket is added for docs missing the field.
There is also always a missing aggregation, that allows to count docs missing specific fields: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-missing-aggregation.html
To allow cross-origin request, enable this in your elasticsearch config:
http.cors:
enabled: true
allow-origin: /.*/
Note this is only needed for local operation of the app, because your browser directly accesses elasticsearch.
Because initially the faceted search aggregations are over the full dataset, performance is an important subject! For example aggregating over 30 million docs can easily take up to a second. This would result in unacceptable user experience.
To avoid such trouble, the aggregations for faceted search should be cached. See here:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/shard-request-cache.html
Configuration in elasticsearch.yml:
indices.requests.cache.size: 10%
It also needs to be enabled in the index's mapping, enabled via index settings or per-request:
index.requests.cache.enable: true
Note that only aggregations and total results are cached and only for queries set to search_type=count ; this means that the result list can't be fetched in the same query. In this prototype i'm doing to search requests therefore, one for the aggregations (cached) and one for the top results (uncached, but very fast).
Cache operation can be verified by checking the cache contents. You should see data is being cached:
curl -XGET 'http://localhost:9200/_nodes/stats/indices/request_cache?human'
On my laptop, the initial execution (uncached..) of this query is ~2 seconds, the second (cached) execution takes around 5-10 milliseconds.
For a nicer user experience, autocompletion is performed when typing a city name. This is implemented with the completion suggester in elasticsearch, see for this guide for a more complete explanation: https://www.elastic.co/blog/you-complete-me
The auto-complete on the frontend is built with react-autocomplete, see https://github.com/reactjs/react-autocomplete
For the completion in elasticsearch, in the basis, a special field for the suggestions was added to the mapping, see logstash/template.json
"city_suggest": {
"type": "completion"
}
In logstash we fill this field with a copy of the city, see logstash.conf
mutate {
add_field => {
"city_suggest" => "%{city}"
}
}
In the application we can the query the _suggest endpoint to fetch suggestions:
curl -X POST localhost:9200/companies/_suggest -d '{
"city" : {
"text" : "a",
"completion" : {
"field" : "city_suggest"
}
}
}'
In this part i'm describing some of the queries/filters the application uses to filter results. Whenever your enter a query (e.g. city name) or click on one of the filters, these are added to your search query and the results are filtered.
The basis structure for the query is:
"query": {
"bool": {
"must": []
}
}
Within the "must" element all required filters and queries should be added.
For simple filtering (not full text) by a value, e.g. by an activity code (a number) we use a terms query:
{ "terms": { "activity_code": [ 123 ] } }
There is many text-based query options, please read the elasticsearch docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
In this sample app the match query is used:
{ "match": { "city": "Amsterdam" }}
To filter on dates, for example year of incorporation of a company, we use a range filter. For example to filter on copmanies incorporated between 1995 and 1998 run the query below:
{ "range": { "incorporation_date": { "gte": "1995", "lte": "1998", format: 'yyyy'}}}
Note the format parameter. This concerns the format we're entering dates in this query, it has no effect on the format how dates are stored or indexed in elasticsearch. It's here so we only need to type the years here.
You can use a simple range query to filter by string range. For example to filter on all postcodes between 1111AA and 999ZZ, use the filter below:
{ "range": { "postcode": { "gte": "1111AA", "lte": "9999ZZ" }}}
Note this is not currenty implemented in the application, but documented here for future use! Please see here for a more complete explanation: https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html
An example of querying by geo distance (Distance from a specified point), for example for a radius of 5 kilometer around coordinate 52.22, 4.54 (Amsterdam central station):
{
"geo_distance" : {
"distance" : "5km",
"pin.location" : {
"lat" : 52.22,
"lon" : 4.54
}
}
}
To be able to search by geo-distance based on address (city) or postcode fields, thosee will first need to be translated gps (lat/long) coordinates. That should happen when loading the data, so e.g. it could be implemented in logstash.
See here for an example api that can be used: https://www.postcodeapi.nu