The esbench from mkocikowski

Add 'mapping' to config file

The bench config file, in addition to the queries which will be executed in each observation, should have a 'mapping' section, providing the mapping which will be applied then the test index is created.

Try to get system-specific temp dir location

As opposed to writing temp files to /tmp. The reason for not using the python tempfile is that since these files take a while to download, I want them to be reusable between program runs, yet, since they are large-ish, I also want the os to be able to easily remove them as needed.

Refer to benchmarks by their id, or by their name

Currently the commands which take benchmark ids as their parameters ('dump', 'clear', 'show') take only benchmark ids. It should be also possible to use benchmark names ('mik_bench01', etc) which are stored in bench.benchmark_name field.

track query variance over all n queries per run

it may be interesting to monitor not just the total time for n queries, but to also be able to track how long each query takes. this way, the variance over the set of n queries can be examined (in case that variance scales differently for different queries or configs).

--no-random flag

Flag would disable the random string from being inserted into bench queries in the %(random)s placeholder. This would allow for testing of cacheing of results.

Single config file for 'bench'

Single config file, called 'config.json', currently providing queries for observations.

Change index names from 'test' and 'stats' to 'esbench_test' and 'esbench_stats'

Make things clearer.

Allow different config files

--config flag will allow to specify path to alternative config file

--sample flag for 'show' command

--sample (default 1) specifies step at which observations are sampled. So if you have a benchmark with 50 observations, and you don't want to look at them all, run 'show --sample 5' to see only 10 observations.

Compute a node 'score'

Figure out how to come up with a simple numeric score of a node's performance.

Simplify the stock config.json

Maybe just a simple match, sorted match, a filtered match, and a faceted query?

Query disco in 'analyze'

Analyze should be able to 'discover' all the queries (and so associated stats groups) for each benchmark, and extract information accordingly.

--append flag

Do not clean existing test data (not benchmarks, but the test data itself). Append new documents to the existing ones.

Remove 'list' command

Redundant with 'show'

Use 'requests' instead of the home grown HTTP api

The decision to use home-made HTTP request api was made so that there would be no external dependencies. I don't think this really is a very important requirement. There already is a 'soft' requirement for 'tabulate' for output formatting. 'requests' is very solid, may as well use it.

Add ability to 'dump' straight into a gist

Using the github gist api.

Ability to use data set size, not only doc count, as the basic benchmark parameter

Right now, the you specify the number of documents as the basic parameter of a benchmark. But really it is the byte size of the data set, not the number of documents, which is of primary importance. Give the ability to use that, instead of doc count.

Record 'client' time for query sets

Record the total time it takes the client to get the results - not just the stats (query and fetch) captured in stats groups.

Add concurrent queries to observation

Currently, there is only one 'client'. Add ability to execute observation queries concurrently, to better simulate 'real life' scenarios.

Command line options also available in config file

All the command line options should be configurable in the config file; when they are again specified on the command line, they should override the config file. This means that the config file should be loaded and parsed on any invocation of the command line, so that even if it is just the 'help' command, the defaults can be read from the config file, and displayed appropriately.

Download data files to 'tmp' directory

Have data files downloaded from s3 into 'tmp' directory, to keep things clean.

Basic graphing

Using 'pygal' to generate SVGs. Basic capability, simple charts, but something visual.

Add README documentation about the config file

How to define your own mappings and queries, etc.

Record host environment information

RAM (system, ES_HEAP_SIZE), cpu info, os, whatever else.
In 90.8 use the cluster stats api: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-stats.html

Before 90.8
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-info.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Using alternative data sources mini-tutorial in README

Documentation on how to use other data sources (instead of the provided patents).

Linux kernel data source

Clone and index a large repo from github, such as the linux kernel. Iterate through commit messages, do facets on them, etc. Another large repo is webkit - 18GB.

Record JVM stats

Information from here:
http://localhost:9200/_nodes/stats/jvm/
http://localhost:9200/_nodes/stats?all=true

Refactor observations

Query objects should be assigned per-observation, this would get rid off all the weirdness with stats group names.

Named benchmarks

Ability to name benchmarks, so that they are easy to identify. When you have the ability to dump and load benchmarks, you can end up with a lot of different benchmarks in a single spot - important to be able to tell them apart.

Ability to run against host other than localhost

Also, ability to specify a different port.

Update 'stats' index create statement to be 1 primary no replicas

Keep things simple. Right now it does the default 5/5, which leads to 'yellow' cluster status when running on a single local node. And there is no need for 5/5, 1/0 is good for this use, and this is what the 'test' index uses.

Basic ES capacity planning info

Intro to ES capacity planning, goals of this project. Talk about ES indexes, Lucene indexes, Lucene segments, optimizations.

Clear field caches between observations or between query sets

Give ability to flush between observations (or also between individual query types in an observation? does this make sense, as results are recorded only for entire query run sets?)

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-clearcache.html

Include not just mapping, but entire index creation call in the config.json file

This will allow easier tweaks to custom benchmarks.

'rerun' capability

Ability to 'rerun' a benchmark, that is, to run a benchmark with the same parameters as some other benchmark. The command line arguments, the the content of the 'config' file are recorded with each benchmark. So lets say this gets 'dump'-ed, put into gist, then someone else loads it into their machine, but they want to run the same benchmark on their machine - there could be a --rerun argument to 'run' a benchmark with the same parameters as some other previously recorded benchmark.

Eliminate the 'observe' command

Replace it with --no-load flag to 'run'. Make for a simpler interface.

Add a 'legend' to the output of 'show'

Specifically, explain the units, and that the timings are aggregate, not per call.

Make recording of per-segment stats optional

Lucene segments have unique names, and their stats are recorded in observations as for example "obs.segments.segments._vu.deleted_docs". This means that where there are a lot of segments, there may be thousands of separate segment fields, which makes 'elsec' do introspection really slow.

Make per-segment stat recording optional, with '--record-segments' flag to 'run' command.

Record field data heap stats

This works as of 0.90, and will return per-node stats:
http://localhost:9200/_nodes/stats/indices/fielddata/*

The magic is in fields=* -- this will record data on individual field cache sizes, this can be done on per-index level:
_stats?clear=true&docs=true&store=true&search=true&merge=true&indexing=true&fielddata=true&fields=*

Normalize test data structure

Change field names from 'title', 'abstract', 'body' to something like 'short', 'medium', 'long' - this would allow for other data sources to use the same queries / setup as the default benchmark.

Record the size / count of data being inserted

Inserting 100MB of data into the index doesn't mean the index will be 100MB - it looks like the index ends up being up to 10% smaller. It would be good to capture the actual byte size of the data as it is being inserted.

Ability to run a 'quick observation' against the current state of the index

Just run an observation against the current state of the index, without any loading or whatnot. If you spent all this time putting your data in, be able to play around with it.

Allow for test data to be read from files with the --data argument

Other than the default data source from the 'data.py' (reading patent data from s3), allow to specify input file on the command line (including /dev/stdin).

Add 'background' inserts

Add low-volume background inserts to be executed as observations (query runs) are being executed. Observations measure search performance; this will add ability to see how search performance varies when there are simultaneous inserts.

'dump' command bombs

esbench dump - bombs

Record the actual command which executed the benchmark

Record the 'command line', as part of benchmark metadata. The reason for that is if you dump some benchmarks, then you load them, but you want to run one more that is kind of similar - it will be much easier if you know the actual command which was used to run the other benchmarks.

mkocikowski / esbench Goto Github PK

esbench's People

Contributors

Stargazers

Watchers

Forkers

esbench's Issues

Recommend Projects

Recommend Topics

Recommend Org