Code Monkey home page Code Monkey logo

elasticsearch-definitive-guide's Introduction

Elasticsearch: The Definitive Guide

This repository contains the source for the legacy Elasticsearch: The Definitive Guide documentation and is no longer maintained. For the latest information, see the current Elasticsearch documentation.

Building the Definitive Guide

In order to build this project, we rely on our docs infrastructure.

To build the HTML of the complete project, run the following commands:

# clone this repo
git clone [email protected]:elastic/elasticsearch-definitive-guide.git
# clone the docs build infrastructure
git clone [email protected]:elastic/docs.git
# Build HTML and open a browser
cd elasticsearch-definitive-guide
../docs/build_docs.pl --doc book.asciidoc --open

This assumes that you have all necessary prerequisites installed. For a more complete reference, please refer to the README in the docs repo.

The Definitive Guide is written in Asciidoc and the docs repo also contains a short Asciidoc guide.

Supported versions

The Definitive Guide is available for multiple versions of Elasticsearch:

Contributing

This repository is no longer maintained. Pull requests and issues will not be addressed.

To contribute to the current Elasticsearch docs, refer to the Elasticsearch repository.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

See http://creativecommons.org/licenses/by-nc-nd/3.0/ for the full text of the License.

elasticsearch-definitive-guide's People

Contributors

ashishthakur avatar bandersonoreilly avatar cittatva avatar clintongormley avatar dangitoreilly avatar danielmitterdorfer avatar debadair avatar eefi avatar ericamick avatar eskibars avatar glenrsmith avatar igal-getrailo avatar joelbourbon avatar johtani avatar joshuar avatar jrodewig avatar kristenorm avatar lcawl avatar markwalkom avatar mcascallares avatar mrdnk avatar nik9000 avatar paikan avatar peschlowp avatar polyfractal avatar rabu3082 avatar sevab avatar skalapurakkel avatar tylerjryan avatar ureimers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-definitive-guide's Issues

Hyphenation of keywords

As an example multi_match on page 279 is split over 2 pages. I am not a big fan of hyphenation at all but I think that all constant width words should not be hyphenated.

Missing Word

(Sorry about all these. I'm sending them in as I find them.)

On the "Partial Updates to Documents" page:

"we said that the smaller window between the retrieve and reindex steps, the smaller the opportunity for conflicting changes" ->

"we said that the smaller the window between the retrieve and reindex steps, the smaller the opportunity for conflicting changes"

Examples at 010_Intro/25_Tutorial_Indexing raises errors

While playing with tutorial, it appears that an error is raised by Employee 2 and 3 with :

{
  "error" : "MapperParsingException[object mapping for [employee] tried to parse as object, but got EOF, has a concrete value been provided to it?]",
  "status" : 400
}

After a digging with examples, it appears that issue is related to

"about": {
    "bio":         "Eco-warrior and defender of the weak",
    "age":         25,
    "interests": [ "dolphins", "whales" ]
},

for first employee and second employee

"about": "I like to collect rock albums",

By converting 'about' to an array, it works as intended :

"about": { "interests": "I like to collect rock albums" }

Issues From The Intro

The following are some issues I've found that aren't simple text changes. I've tried to make it clear what file the specific issues are from.

Preface.asciidoc]

The Preface says:

"[The Elasticsearch documentation] assumes that you are intimately familiar with information retrieval concepts, distributed systems, the query DSL and a host of other topics. This book makes no such assumptions. It has been written so that a complete beginner — to both search and distributed systems — can pick it up and start building a prototype within a few chapters."

and

"We explain concepts from first principles, helping novices gain sure footing in the complex world of search."

I don't think novices will know what distributed scalable real-time search and analytics engine, text search, structured search, analytics, structured and unstructured data, and the like are. I'm also not sure what "first principles" mean in this case, but it would be great if you could provide a general description of these concepts somewhere early in the book, maybe in a pre-Preface Preface. As is, there are too many terms that a novice, or even more experienced readers, might not know.

[05_What_is_it.asciidoc]

I'm bothered by the quick mention of Apache Lucene. It's described as a search
engine library. Is this enough information for new users?

"Document store" is used for the first time. A new user won't know how this
compares to a traditional relational database.

[10_Installing_ES.asciidoc]

"When installing Elasticsearch in production, you can use the method described above, "
What does installing into production have to do with anything. Installing from source
or packages can be done in any tier.

Say whether to install the jre or the jdk

Does it make sense to include "View in Sense" links in the text? Ideally these would
only appear in the online version. I'm not sure how you're planning on handling difference where something will appear online but not onpaper.

You mention Elasticsearch cluster in several places, but you don't
define it until after you've you've used it, e.g.

"You probably don’t want Marvel to monitor your local cluster, so you can disable data collection with this command:"

"This means that your Elasticsearch cluster is up and running, and we can start experimenting with it."

A new reader won't know what the difference is between a single Elasticsearch instance, such as the one they just installed, and a Elasticsearch cluster.
It's little issues like this that cause initial confusion. I know you explain this
later but it's important to bring up terms in the right order.

"echo 'marvel.agent.enabled: false' >> ./config/elasticsearch.yml""

What is "data collection"? If this is really necessary could you explain why. Otherwise this won't make sense to a new user.

"A cluster is a group of nodes with the same cluster.name "
How would a new user know what a cluster name is, or how to set one?

[15_API.asciidoc]
"If you are using Java, then Elasticsearch comes with two built-in clients which you can use in your code:

Node client
The node client joins a local cluster as a non-data node. In other words, it doesn’t hold any data itself, but it knows what data lives on which node in the cluster, and can forward requests directly to the correct node. "

This will be very confusing. A reader will probably think of a client as something
that talks to a server. Yet, in the description of the "node client", you say the client
"joins" a cluster.

Is it correct to say "Elasticsearch provides official clients for several languages, and there are numerous community-provided clients and integrations, all of which can be found in the Guide."? Is Elasticsearch really providing clients, or rather, libraries?

[20_Document.asciidoc]
The term "objects" was being used in too many ways so I changed the wording of the first couple of paragraphs.

Would a novice know what a "serialization format" is?

"Although the original user object was complex"
Where is the original user object shown? This is a confusing reference.

"Converting an object to JSON for indexing in Elasticsearch is much simpler than the equivalent process for a flat table structure."
If you're going to mention this, you should give an example of why this is true.

[25_Tutorial_Indexing.asciidoc]
What does it mean to build "dashboards over the data"? The "over" is what I don't understand.

"Elasticsearch and Lucene use a structure called an inverted index for exactly the same purpose."
Since you said that Lucene is the search engine, does it make any sense to include Elasticsearch in this sentence?

"The request body — the JSON document — contains all the information about this employee. "
Is it correct to call the request body a JSON document rather than the JSON object?

"which allows us to build much more complicated, robust queries"
What's a "robust" query?

"Our query will change a little to accommodate a filter, which allows us to execute structured searches efficiently:"
Saying this adds no value because a new reader won't see how this search is more
or less efficient than what was shown before.

Minor Style Issue

There's a sentence that says "cannot contain commas. Let’s use"

I strongly suggest that you be more consistent in your use of
contractions. So, either this should be

"can't contain commas. Let’s use"
or
"cannot contain commas. Let us use"

If you decide to use the first, then there are many places you'll need to collapse longer forms. If you decide to use the latter, there are many places you will need to expand shorter forms.

Inconsistent Sense examples

I noticed this in /080_Structured_Search/05_term.asciidoc

In other pages (at least from what I've seen in getting started), each Sense link is an isolated example - clicking it opens that specific query in Sense, which makes it easier to follow.

On this page, all the Sense links are a reference to the same, single, complete JSON example // SENSE: 080_Structured_Search/05_Term_text.json, containing all of the sample queries on this page.

I think this should be broken up to replicate the functionality of the other pages (or at least the getting started section).

Retrieving a document examples inconsistent

In the tutorial, an example returns document directly:

{
    "first_name" :  "John",
    "last_name" :   "Smith",
    "age" :         25,
    "about" :       "I love to go rock climbing",
    "interests":  [ "sports", "music" ]
}

But later on, in the retrieving a document section, it returns the result with embedded document:

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "title": "My first blog entry",
      "text":  "Just trying this out..."
      "date":  "2014/01/01"
  }
}

It is unclear now what is returned when using GET.

Minor Typo and Questions on distrib-read.html

First, the typo:

"The client sends a get request" ->
"The client sends a GET request"

Next, the questions:

  1. A new reader, such as yours truly, will wonder why Node 1 sent the request to Node 2, given that there's a shard replica containing the requested document on Node 1. Why the extra step?

  2. There should be a 3rd step shown here showing where Node 2 returns the document. I think it's returned to Node 1, but I'm not 100% sure. Figure 9, on the previous page, shows the equivalent step.

  3. I'm curious about the reason why the node returning the results is the node receiving the request rather than the node that performed the query/update. Is this because there's already a network connection between the client and the receiving node (e.g. Node 1 in Figure 10) so that the response doesn't involve opening a new connection between the node the performed the query/update (Node 2 in Figure 10) and the client? I could see how a firewall might block this.

Gists point to non elasticsearch usernames

Right now, there are a couple of referalls for gists of clintons user

# ag -l gist.github.com/clintongormley
050_Search/00_Intro.asciidoc
snippets/050_Search/05_Empty_search.json
snippets/050_Search/15_Pagination.json
snippets/050_Search/20_All_field.json
snippets/050_Search/20_Query_string.json
snippets/052_Mapping_Analysis/25_Data_type_differences.json
snippets/054_Query_DSL/60_Bool_query.json
snippets/054_Query_DSL/60_Empty_query.json

Reversed chapters in sentence reads weird

The sentence is at the end of the section titled "How to read this book".

Later chapters like Chapter 16 and Chapter 15 are more standalone and can be referred to as needed.

Consider rewording to:

... Chapter 15 and Chapter 16 ...

multi-value field documentation somewhat misleading (?)

on http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/complex-core-fields.html , there's a box that says The elements inside an array are not ordered. You cannot refer to “the first element” or “the last element”. Rather think of an array as a bag of values. to me, this sounded like: the second you index a list of things in ES, it becomes an unordered set, and all order is lost forever. i asked a coworker about this, and he said that the contents of this box are extremely accurate at search time, but that once you've retrieved a specific document, its multi-value fields' values are still in the order they were in when you indexed them. [is there a better way of saying that?] [also, in any case, you've always got the _source field, which should be a relatively unmodified version of the document you originally indexed.]

if that's the case, i have a hard time understanding the sentence The correlation between {age: 35} and {name: Mary White} has been lost as each multi-value field is just a bag of values, not an ordered array at the bottom of this page - if multi-value fields are only bags-of-values at search time, why does this constraint have to apply?

apologies if i'm speaking gibberish, i'm still new to ES and trying to get my head around things. could you please enhance this page so that it doesn't cause people to get as confused as i currently am? :)

Minor Typo

On the "document metadata" page,

"The id is a string that, when combined with the _index and type"

should be

"The id is a string that, when combined with the _index and _type"

Running elasticsearch command incorrect if using yum

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/running-elasticsearch.html states "./bin/elasticsearch" to start. This is correct if installed via direct download but incorrect if using "yum install elasticsearch". In that case one should use "/etc/init.d/elasticsearch". Without that one gets a "Failure to configure logging" error. Credit to http://stackoverflow.com/questions/24975895/elasticsearch-cant-write-to-log-files for the fix and verified on CentOS 6.5 x86 install with ES 1.3.2

A Suggestion For The "Coping With Failure" Page

On the "Coping With Failure" page you describe what happens to replica shards when a node containing primary shards goes offline. That is, the replica shards are promoted to primary shards.

You might want to say something about what happens if the host that formally contained the primary shards comes back online. Is there a resyncing process that brings the previously offline shards uptodate? If so, after that happens, do the newly promoted shards get demoted back to replicas? I think adding a few words about this would be a good idea.

Question on _query_phase.html Page

Figure 14 shows that the coordinating node (e.g. Node 3) "forwards the search request to a primary or replica copy of every shard in the index". Yet, Node 3 itself has a primary or replica of every shard in the index. Why does Node 3 send the search request to other nodes when it has all the information it needs to perform the search itself? Is being a coordinating node such a resource intensive task that such nodes never perform the searches that they're being requested to coordinate? Or, is this simply an attempt to balance the work being done to satisfy the search among the nodes in the cluster?

Another Minor Typo

On the "retrieving a document" page:

"By default, a get request" ->

"By default, a GET request"

By the way, for editing issues like this, would you rather I did a clone and then send you pull requests? If so, how many issues will you accept in one pull request? (I did some work on a GitHub book where I found hundreds of issues, and I sent them all in in one pull request. This wasn't what the author wanted.)

Clarify create/index/update in bulk

"Cheaper in bulk" section

It would be good if you can explain where 'index' can be used vs 'update'.

The example you specified is actually an update operation (where create fails but an index succeeds because the document already exists)

Invalid Sense example code

I first noticed this in /080_Structured_Search/05_term.asciidoc - The following example is in the docs, but it doesn't work when you try to copy and paste it into Sense, and the link to view in Sense actually sends different content.

Docs (results in an error, text is not passed to the request)

GET /my_store/_analyze?field=productID
XHDK-A-1293-#fJ3

note: the request is not escaped (# vs %23), so even if you fix the syntax, you get different results (the content from # on is dropped) - perhaps this should also be pointed out in the docs.

Sense

GET /my_store/_analyze?field=productID&text=XHDK-A-1293-%23fJ3

This is also an issue with the code in /052_Mapping_Analysis/40_Analysis.asciidoc and I suspect in other parts of the docs as well.

It's possible this is actually a Sense bug, but someone took the time to update the Sense json in the docs without updating the examples, and the docs are technically broken as-is.

Another Minor Typo

On the custom-analyzers.html page:

" it will contains HTML tags" ->
" it will contain HTML tags"

Clinton, are you reading these issues?

How to addd an failover node

020_Distributed_Cluster/20_Add_failover.asciidoc does not explain how to start a failover node

I tried

elasticsearch -f -D es.config=/usr/local/opt/elasticsearch/config/elasticsearch.yml

but that does not start a second node in the cluster

Minor: Another Missing Word

On the "queries_and_filters.html" page:

"Queries not only have to find matching documents, but also to calculate how relevant each document is" ->

"Queries not only have to find matching documents, but also have to calculate how relevant each document is"

Minor: Another Missing Word

On the "_queries_and_filters.html" page:

"Queries not only have to find matching documents, but also to calculate how relevant each document is" ->
"Queries not only have to find matching documents, but also have to calculate how relevant each document is"

Match query produces different results from the guide

If you run the code for (single word query match)[http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/match-query.html#_a_single_word_query], the results will differ from those shown in the guide.

Guide shows:

"hits": [
 {
    "_id":      "3",
    "_score":   0.53033006, 
    "_source": {
       "title": "The quick brown fox jumps over the quick dog"
    }
 },
 {
    "_id":      "1",
    "_score":   0.5, 
    "_source": {
       "title": "The quick brown fox"
    }
 },
 {
    "_id":      "2",
    "_score":   0.375, 
    "_source": {
       "title": "The quick brown fox jumps over the lazy dog"
    }
 }
]

But using ES 1.3.2, I get the following results using Sense:

   "hits": {
      "total": 3,
      "max_score": 0.5,
      "hits": [
         {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "1",
            "_score": 0.5,
            "_source": {
               "title": "The quick brown fox"
            }
         },
         {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "3",
            "_score": 0.44194174,
            "_source": {
               "title": "The quick brown fox jumps over the quick dog"
            }
         },
         {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "2",
            "_score": 0.3125,
            "_source": {
               "title": "The quick brown fox jumps over the lazy dog"
            }
         }
      ]
   }

It would be no big deal that the scores calculated are not the same for each entry, but these scores change the order of the returned documents. I thinks this is due to (new scoring factors)[http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/practical-scoring-function.html#query-norm]

I would gladly make changes to the guide myself, but since I only started studying ES 2 days ago, I do not think I'm yet competent enough to change any examples or moreover, try to explain the changes.

Very Minor Style Issue

On the "_most_important_queries_and_filters.html" page you say:

"The term filter is used to filter by exact values, be they numbers, dates, booleans, or not_analyzed exact value string fields".

Later on the page you say:

"such as a number, a date, a boolean or a not_analyzed string field"

Note the presence of a comma before the "or" in the first example, and its absence in the second. This issue exists in other places in the book.

I know this is a controversial editing issue. I suggest including the comma before the last "and" or "or". This is what New Yorker magazine does. In any case, whichever approach you choose, you should be consistent.

Node 2 should received the master star

Shouldn't Node 2 receive its master star in the schema in the Life inside a cluster / Coping with failure page Figure 6 as it is said "the first thing that happened was that the nodes elected a new master: Node 2"?

45_filter_order.asciidoc bad example

The example with the cached filter that is applied before the uncached filter actually is not equivalent to the initial filter example.

If we assume that now is in the range from 00:00 to 01:00, then the cached filter restricts the results to today in the time range midnight to now while the initial query still allowed to go back further than midnight (e.g. if now is 00:15, the filtered timerange would be [2014-04-03T23:15:00 to 2014-04-04T00:15:00]).

So, the example with the cached filter restricts the results more than the initial example

dynamic scripting

In the section 'using scripts to make partial updates' it should probably be mentioned that dynamic scripting is disabled by default since version 1.2.0. Probably, one should also add how to enable it (and warn that it is a bad practise?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.