elastic / elasticsearch-definitive-guide Goto Github PK

The Definitive Guide to Elasticsearch

Home Page: https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

License: Other

HTML 42.69% Perl 10.55% CSS 13.43% XSLT 7.03% Python 26.30%

elasticsearch-definitive-guide's Introduction

Elasticsearch: The Definitive Guide

This repository contains the source for the legacy Elasticsearch: The Definitive Guide documentation and is no longer maintained. For the latest information, see the current Elasticsearch documentation.

Building the Definitive Guide

In order to build this project, we rely on our docs infrastructure.

To build the HTML of the complete project, run the following commands:

# clone this repo
git clone [email protected]:elastic/elasticsearch-definitive-guide.git
# clone the docs build infrastructure
git clone [email protected]:elastic/docs.git
# Build HTML and open a browser
cd elasticsearch-definitive-guide
../docs/build_docs.pl --doc book.asciidoc --open

This assumes that you have all necessary prerequisites installed. For a more complete reference, please refer to the README in the docs repo.

The Definitive Guide is written in Asciidoc and the docs repo also contains a short Asciidoc guide.

Supported versions

The Definitive Guide is available for multiple versions of Elasticsearch:

The 1.x branch applies to Elasticsearch 1.x
The 2.x and master branches apply to Elasticsearch 2.x

Contributing

This repository is no longer maintained. Pull requests and issues will not be addressed.

To contribute to the current Elasticsearch docs, refer to the Elasticsearch repository.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

See http://creativecommons.org/licenses/by-nc-nd/3.0/ for the full text of the License.

elasticsearch-definitive-guide's People

Contributors

$polyfractal avatar$

Stargazers

Watchers

Forkers

scharf flyhighplato stvkoch serkanh fth-ship davidjeet jcity rshariffdeen pradeepchhetri amr klimslim alansparrow batter nathanpeck dadoonet melinite dynamicguy submersibletoaster liseki abhinavzspace yanick jaymyers acoudeyras jschneier bakongo adamcanady jinbochen winglian bcbrr flippyhead hagun kzachara paulschwarz mahmoud84 nguyenvietduc sbellem japh ureimers adamjgray satya-ak martijndwars mitar fredericbertome shahnazk qq1350995917 wongtai gavinfoo cburgas doron2402 senthilraja39 khajavi 4th-turning thanabordij lucianprecup ricardo-rossi aseigneurin junche mrdnk elasticsearch-cn corochoone dominikdary peschlowp rmuir jepatti lashae icyxing hyungmok yinchunxiang hotbain gourneau ahmedwess santiago ishara tamcap jimolucy peihsun pcoucke mysza chandansinghraghuvanshi supreetoberoi thihy kylemclaren justin2061 alexey10 ahm-mha randolphgamo nandakishore15 burkmiers anilbhaila clebio bogdanb mehdyamazigh coreywright argestes saurzcode ioc32 marg51 scottwilkerson akahn edwardt

elasticsearch-definitive-guide's Issues

Hyphenation of keywords

As an example multi_match on page 279 is split over 2 pages. I am not a big fan of hyphenation at all but I think that all constant width words should not be hyphenated.

A Suggestion For The "Retrieving Multiple Documents" Page

Since you've shown the HTTP headers that result from requesting single documents, it would be worthwhile to show the HTTP headers that results from requesting multiple documents if one of the documents doesn't exist.

Missing Word

(Sorry about all these. I'm sending them in as I find them.)

On the "Partial Updates to Documents" page:

"we said that the smaller window between the retrieve and reindex steps, the smaller the opportunity for conflicting changes" ->

"we said that the smaller the window between the retrieve and reindex steps, the smaller the opportunity for conflicting changes"

typo in "navigating this book" page

Came across a typo here:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_navigating_this_book.html

Bullet #6, Line #2:
"...not as easy in a search egine as"

JavaScript does not allow request bodies with a GET request?

Here (in the Tip section) https://github.com/elasticsearch/elasticsearch-definitive-guide/blob/master/054_Query_DSL/55_Request_body_search.asciidoc it reads:

Some languages, such as JavaScript, don’t allow request bodies with a GET request.

Is it really limitation of JavaScript language?

Examples at 010_Intro/25_Tutorial_Indexing raises errors

While playing with tutorial, it appears that an error is raised by Employee 2 and 3 with :

{
  "error" : "MapperParsingException[object mapping for [employee] tried to parse as object, but got EOF, has a concrete value been provided to it?]",
  "status" : 400
}

After a digging with examples, it appears that issue is related to

"about": {
    "bio":         "Eco-warrior and defender of the weak",
    "age":         25,
    "interests": [ "dolphins", "whales" ]
},

for first employee and second employee

"about": "I like to collect rock albums",

By converting 'about' to an array, it works as intended :

"about": { "interests": "I like to collect rock albums" }

Issues From The Intro

The following are some issues I've found that aren't simple text changes. I've tried to make it clear what file the specific issues are from.

Preface.asciidoc]

The Preface says:

"[The Elasticsearch documentation] assumes that you are intimately familiar with information retrieval concepts, distributed systems, the query DSL and a host of other topics. This book makes no such assumptions. It has been written so that a complete beginner — to both search and distributed systems — can pick it up and start building a prototype within a few chapters."

and

"We explain concepts from first principles, helping novices gain sure footing in the complex world of search."

I don't think novices will know what distributed scalable real-time search and analytics engine, text search, structured search, analytics, structured and unstructured data, and the like are. I'm also not sure what "first principles" mean in this case, but it would be great if you could provide a general description of these concepts somewhere early in the book, maybe in a pre-Preface Preface. As is, there are too many terms that a novice, or even more experienced readers, might not know.

[05_What_is_it.asciidoc]

I'm bothered by the quick mention of Apache Lucene. It's described as a search
engine library. Is this enough information for new users?

"Document store" is used for the first time. A new user won't know how this
compares to a traditional relational database.

[10_Installing_ES.asciidoc]

"When installing Elasticsearch in production, you can use the method described above, "
What does installing into production have to do with anything. Installing from source
or packages can be done in any tier.

Say whether to install the jre or the jdk

Does it make sense to include "View in Sense" links in the text? Ideally these would
only appear in the online version. I'm not sure how you're planning on handling difference where something will appear online but not onpaper.

You mention Elasticsearch cluster in several places, but you don't
define it until after you've you've used it, e.g.

"You probably don’t want Marvel to monitor your local cluster, so you can disable data collection with this command:"

"This means that your Elasticsearch cluster is up and running, and we can start experimenting with it."

A new reader won't know what the difference is between a single Elasticsearch instance, such as the one they just installed, and a Elasticsearch cluster.
It's little issues like this that cause initial confusion. I know you explain this
later but it's important to bring up terms in the right order.

"echo 'marvel.agent.enabled: false' >> ./config/elasticsearch.yml""

What is "data collection"? If this is really necessary could you explain why. Otherwise this won't make sense to a new user.

"A cluster is a group of nodes with the same cluster.name "
How would a new user know what a cluster name is, or how to set one?

[15_API.asciidoc]
"If you are using Java, then Elasticsearch comes with two built-in clients which you can use in your code:

Node client
The node client joins a local cluster as a non-data node. In other words, it doesn’t hold any data itself, but it knows what data lives on which node in the cluster, and can forward requests directly to the correct node. "

This will be very confusing. A reader will probably think of a client as something
that talks to a server. Yet, in the description of the "node client", you say the client
"joins" a cluster.

Is it correct to say "Elasticsearch provides official clients for several languages, and there are numerous community-provided clients and integrations, all of which can be found in the Guide."? Is Elasticsearch really providing clients, or rather, libraries?

[20_Document.asciidoc]
The term "objects" was being used in too many ways so I changed the wording of the first couple of paragraphs.

Would a novice know what a "serialization format" is?

"Although the original user object was complex"
Where is the original user object shown? This is a confusing reference.

"Converting an object to JSON for indexing in Elasticsearch is much simpler than the equivalent process for a flat table structure."
If you're going to mention this, you should give an example of why this is true.

[25_Tutorial_Indexing.asciidoc]
What does it mean to build "dashboards over the data"? The "over" is what I don't understand.

"Elasticsearch and Lucene use a structure called an inverted index for exactly the same purpose."
Since you said that Lucene is the search engine, does it make any sense to include Elasticsearch in this sentence?

"The request body — the JSON document — contains all the information about this employee. "
Is it correct to call the request body a JSON document rather than the JSON object?

"which allows us to build much more complicated, robust queries"
What's a "robust" query?

"Our query will change a little to accommodate a filter, which allows us to execute structured searches efficiently:"
Saying this adds no value because a new reader won't see how this search is more
or less efficient than what was shown before.

Minor Style Issue

There's a sentence that says "cannot contain commas. Let’s use"

I strongly suggest that you be more consistent in your use of
contractions. So, either this should be

"can't contain commas. Let’s use"
or
"cannot contain commas. Let us use"

If you decide to use the first, then there are many places you'll need to collapse longer forms. If you decide to use the latter, there are many places you will need to expand shorter forms.

Inconsistent Sense examples

I noticed this in /080_Structured_Search/05_term.asciidoc

In other pages (at least from what I've seen in getting started), each Sense link is an isolated example - clicking it opens that specific query in Sense, which makes it easier to follow.

On this page, all the Sense links are a reference to the same, single, complete JSON example // SENSE: 080_Structured_Search/05_Term_text.json, containing all of the sample queries on this page.

I think this should be broken up to replicate the functionality of the other pages (or at least the getting started section).

Retrieving a document examples inconsistent

In the tutorial, an example returns document directly:

{
    "first_name" :  "John",
    "last_name" :   "Smith",
    "age" :         25,
    "about" :       "I love to go rock climbing",
    "interests":  [ "sports", "music" ]
}

But later on, in the retrieving a document section, it returns the result with embedded document:

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "title": "My first blog entry",
      "text":  "Just trying this out..."
      "date":  "2014/01/01"
  }
}

It is unclear now what is returned when using GET.

Minor Typo and Questions on distrib-read.html

First, the typo:

"The client sends a get request" ->
"The client sends a GET request"

Next, the questions:

A new reader, such as yours truly, will wonder why Node 1 sent the request to Node 2, given that there's a shard replica containing the requested document on Node 1. Why the extra step?
There should be a 3rd step shown here showing where Node 2 returns the document. I think it's returned to Node 1, but I'm not 100% sure. Figure 9, on the previous page, shows the equivalent step.
I'm curious about the reason why the node returning the results is the node receiving the request rather than the node that performed the query/update. Is this because there's already a network connection between the client and the receiving node (e.g. Node 1 in Figure 10) so that the response doesn't involve opening a new connection between the node the performed the query/update (Node 2 in Figure 10) and the client? I could see how a firewall might block this.

Add some refresh API calls in relevant Sense examples?

For example http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/snippets/080_Structured_Search/05_Term_number.json if the user executes the search too quickly there will be no result it may be a good idea to had refresh calls here and there?

Hunspell only accepts a single .aff file

While Hunspell can accept multiple .dic files, it only accepts a single .aff file.

错别字，应该是结构化

Gists point to non elasticsearch usernames

Right now, there are a couple of referalls for gists of clintons user

# ag -l gist.github.com/clintongormley
050_Search/00_Intro.asciidoc
snippets/050_Search/05_Empty_search.json
snippets/050_Search/15_Pagination.json
snippets/050_Search/20_All_field.json
snippets/050_Search/20_Query_string.json
snippets/052_Mapping_Analysis/25_Data_type_differences.json
snippets/054_Query_DSL/60_Bool_query.json
snippets/054_Query_DSL/60_Empty_query.json

Note about primary shards

On http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/routing-value.html#routing-value add a sidebar about:

"although you can't change number of primary shards, still easy to build a system that can grow dynamically, which we'll learn about in ..."

Chapter 3: Type used in example different from what's said in text

In the '_type' section of the 'Document Metadata' topic is the following:

'We shall use user for our type name'

However all the subsequent examples in the chapter use 'blog' as the _type.

Mispointing arrow in Figure for mget

In http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/distrib-multi-doc.html,
for Figure 12. Retrieving multiple documents with mget,
both arrows in step 2 are pointing to shard 0, but shouldn't one of these arrows point to shard 1?

Reversed chapters in sentence reads weird

The sentence is at the end of the section titled "How to read this book".

Later chapters like Chapter 16 and Chapter 15 are more standalone and can be referred to as needed.

Consider rewording to:

... Chapter 15 and Chapter 16 ...

multi-value field documentation somewhat misleading (?)

on http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/complex-core-fields.html , there's a box that says The elements inside an array are not ordered. You cannot refer to “the first element” or “the last element”. Rather think of an array as a bag of values. to me, this sounded like: the second you index a list of things in ES, it becomes an unordered set, and all order is lost forever. i asked a coworker about this, and he said that the contents of this box are extremely accurate at search time, but that once you've retrieved a specific document, its multi-value fields' values are still in the order they were in when you indexed them. [is there a better way of saying that?] [also, in any case, you've always got the _source field, which should be a relatively unmodified version of the document you originally indexed.]

if that's the case, i have a hard time understanding the sentence The correlation between {age: 35} and {name: Mary White} has been lost as each multi-value field is just a bag of values, not an ordered array at the bottom of this page - if multi-value fields are only bags-of-values at search time, why does this constraint have to apply?

apologies if i'm speaking gibberish, i'm still new to ES and trying to get my head around things. could you please enhance this page so that it doesn't cause people to get as confused as i currently am? :)

Minor Typo

On the "document metadata" page,

"The id is a string that, when combined with the _index and type"

should be

"The id is a string that, when combined with the _index and _type"

Running elasticsearch command incorrect if using yum

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/running-elasticsearch.html states "./bin/elasticsearch" to start. This is correct if installed via direct download but incorrect if using "yum install elasticsearch". In that case one should use "/etc/init.d/elasticsearch". Without that one gets a "Failure to configure logging" error. Credit to http://stackoverflow.com/questions/24975895/elasticsearch-cant-write-to-log-files for the fix and verified on CentOS 6.5 x86 install with ES 1.3.2

A Suggestion For The "Coping With Failure" Page

On the "Coping With Failure" page you describe what happens to replica shards when a node containing primary shards goes offline. That is, the replica shards are promoted to primary shards.

You might want to say something about what happens if the host that formally contained the primary shards comes back online. Is there a resyncing process that brings the previously offline shards uptodate? If so, after that happens, do the newly promoted shards get demoted back to replicas? I think adding a few words about this would be a good idea.

Question on _query_phase.html Page

Figure 14 shows that the coordinating node (e.g. Node 3) "forwards the search request to a primary or replica copy of every shard in the index". Yet, Node 3 itself has a primary or replica of every shard in the index. Why does Node 3 send the search request to other nodes when it has all the information it needs to perform the search itself? Is being a coordinating node such a resource intensive task that such nodes never perform the searches that they're being requested to coordinate? Or, is this simply an attempt to balance the work being done to satisfy the search among the nodes in the cluster?

Preface - Code examples. oreillymedia/title_title 404

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_using_code_examples.html

This is link
https://github.com/oreillymedia/title_title

I expect should point to a different title.

Another Minor Typo

On the "retrieving a document" page:

"By default, a get request" ->

"By default, a GET request"

By the way, for editing issues like this, would you rather I did a clone and then send you pull requests? If so, how many issues will you accept in one pull request? (I did some work on a GitHub book where I found hundreds of issues, and I sent them all in in one pull request. This wasn't what the author wanted.)

Remove parent agg

Fix all examples with MVEL scripts

As the targeted ES version for the print version is 1.4.0 all examples involving scripts and that are not yet ported from MVEL to groovy should be. For example in : http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/snippets/030_Data/45_Partial_update.json

Arrow is bi-directional

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/images/04-06_bulk.png

Figure 13 in http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/distrib-multi-doc.html (or figure 4-6 in e-book)

The arrow for step 2 is incorrectly bi-directional between nodes 1 and 3.

Clarify create/index/update in bulk

"Cheaper in bulk" section

It would be good if you can explain where 'index' can be used vs 'update'.

The example you specified is actually an update operation (where create fails but an index succeeds because the document already exists)

"All about caching" should mention _cache_key ?

If it is "All about caching" should it also mention that _cache_key can be set and how it can be used? If I am not mistaken right now one can spot this option only in filters and caching reference at the general level and then in some specific cases like in terms lookup twitter example where the cache clear API is mentioned.

Invalid Sense example code

I first noticed this in /080_Structured_Search/05_term.asciidoc - The following example is in the docs, but it doesn't work when you try to copy and paste it into Sense, and the link to view in Sense actually sends different content.

Docs (results in an error, text is not passed to the request)

GET /my_store/_analyze?field=productID
XHDK-A-1293-#fJ3

note: the request is not escaped (# vs %23), so even if you fix the syntax, you get different results (the content from # on is dropped) - perhaps this should also be pointed out in the docs.

Sense

GET /my_store/_analyze?field=productID&text=XHDK-A-1293-%23fJ3

This is also an issue with the code in /052_Mapping_Analysis/40_Analysis.asciidoc and I suspect in other parts of the docs as well.

It's possible this is actually a Sense bug, but someone took the time to update the Sense json in the docs without updating the examples, and the docs are technically broken as-is.

Checking whether a document exists

Curl examples should have http://localhost:9200 in front of /website/blog/...

Another Minor Typo

On the custom-analyzers.html page:

" it will contains HTML tags" ->
" it will contain HTML tags"

Clinton, are you reading these issues?

How to addd an failover node

020_Distributed_Cluster/20_Add_failover.asciidoc does not explain how to start a failover node

I tried

elasticsearch -f -D es.config=/usr/local/opt/elasticsearch/config/elasticsearch.yml

but that does not start a second node in the cluster

30_Ngram_intro.asciidoc missing trigram

the trigram uic is missing in the list

Minor: Another Missing Word

On the "queries_and_filters.html" page:

"Queries not only have to find matching documents, but also to calculate how relevant each document is" ->

"Queries not only have to find matching documents, but also have to calculate how relevant each document is"

Incorrect table about Inverted Index

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/inverted-index.html

The first table incorrectly breaks down the tokens for each document. It seems like the tokens were reversed for Doc_1 and Doc_2.

For instance, the row for the Term "Quick" should have an "X" under Doc_2, not Doc_1.

node-number error?

on http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/distrib-multi-doc.html , should item 2 instead read "node 1 builds the response and returns it to the client"? i don't follow how node 3 would be responsible for returning the response. if this is a typo, no big deal, please fix; if it's not a typo, please explain why node 3 returns the response to the client, because either i'm dumb or that's not immediately apparent from the text. hopefully it's a typo :)

Minor: Another Missing Word

On the "_queries_and_filters.html" page:

"Queries not only have to find matching documents, but also to calculate how relevant each document is" ->
"Queries not only have to find matching documents, but also have to calculate how relevant each document is"

missing a p

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_installing_elasticsearch.html

Under the section "Installing Marvel", in the first sentence, the word "development" is missing a p.

Thank you soooo much for writing this!

Retrieving a document: second example

Should be

curl -i -XGET http://localhost:9200/website/blog/124?pretty

instead of

curl -i -XGET /website/blog/124?pretty

track_score typo

on http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_sorting.html , "track_score" should be "track_scores", as per http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-sort.html .

Match query produces different results from the guide

If you run the code for (single word query match)[http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/match-query.html#_a_single_word_query], the results will differ from those shown in the guide.

Guide shows:

"hits": [
 {
    "_id":      "3",
    "_score":   0.53033006, 
    "_source": {
       "title": "The quick brown fox jumps over the quick dog"
    }
 },
 {
    "_id":      "1",
    "_score":   0.5, 
    "_source": {
       "title": "The quick brown fox"
    }
 },
 {
    "_id":      "2",
    "_score":   0.375, 
    "_source": {
       "title": "The quick brown fox jumps over the lazy dog"
    }
 }
]

But using ES 1.3.2, I get the following results using Sense:

   "hits": {
      "total": 3,
      "max_score": 0.5,
      "hits": [
         {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "1",
            "_score": 0.5,
            "_source": {
               "title": "The quick brown fox"
            }
         },
         {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "3",
            "_score": 0.44194174,
            "_source": {
               "title": "The quick brown fox jumps over the quick dog"
            }
         },
         {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "2",
            "_score": 0.3125,
            "_source": {
               "title": "The quick brown fox jumps over the lazy dog"
            }
         }
      ]
   }

It would be no big deal that the scores calculated are not the same for each entry, but these scores change the order of the returned documents. I thinks this is due to (new scoring factors)[http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/practical-scoring-function.html#query-norm]

I would gladly make changes to the guide myself, but since I only started studying ES 2 days ago, I do not think I'm yet competent enough to change any examples or moreover, try to explain the changes.

Document Oriented - JSON example is invalid

The example JSON data here is not valid due to the trailing comma.

Very Minor Style Issue

On the "_most_important_queries_and_filters.html" page you say:

"The term filter is used to filter by exact values, be they numbers, dates, booleans, or not_analyzed exact value string fields".

Later on the page you say:

"such as a number, a date, a boolean or a not_analyzed string field"

Note the presence of a comma before the "or" in the first example, and its absence in the second. This issue exists in other places in the book.

I know this is a controversial editing issue. I suggest including the comma before the last "and" or "or". This is what New Yorker magazine does. In any case, whichever approach you choose, you should be consistent.

Node 2 should received the master star

Shouldn't Node 2 receive its master star in the schema in the Life inside a cluster / Coping with failure page Figure 6 as it is said "the first thing that happened was that the nodes elected a new master: Node 2"?

Preface - Using code examples section

In the Using Code Examples section, the link for downloads points to a non existent page https://github.com/oreillymedia/title_title

45_filter_order.asciidoc bad example

The example with the cached filter that is applied before the uncached filter actually is not equivalent to the initial filter example.

If we assume that now is in the range from 00:00 to 01:00, then the cached filter restricts the results to today in the time range midnight to now while the initial query still allowed to go back further than midnight (e.g. if now is 00:15, the filtered timerange would be [2014-04-03T23:15:00 to 2014-04-04T00:15:00]).

So, the example with the cached filter restricts the results more than the initial example

dynamic scripting

In the section 'using scripts to make partial updates' it should probably be mentioned that dynamic scripting is disabled by default since version 1.2.0. Probably, one should also add how to enable it (and warn that it is a bad practise?)