Code Monkey home page Code Monkey logo

esbench's People

Contributors

mkocikowski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

esbench's Issues

Add 'mapping' to config file

The bench config file, in addition to the queries which will be executed in each observation, should have a 'mapping' section, providing the mapping which will be applied then the test index is created.

Try to get system-specific temp dir location

As opposed to writing temp files to /tmp. The reason for not using the python tempfile is that since these files take a while to download, I want them to be reusable between program runs, yet, since they are large-ish, I also want the os to be able to easily remove them as needed.

Refer to benchmarks by their id, or by their name

Currently the commands which take benchmark ids as their parameters ('dump', 'clear', 'show') take only benchmark ids. It should be also possible to use benchmark names ('mik_bench01', etc) which are stored in bench.benchmark_name field.

track query variance over all n queries per run

it may be interesting to monitor not just the total time for n queries, but to also be able to track how long each query takes. this way, the variance over the set of n queries can be examined (in case that variance scales differently for different queries or configs).

--no-random flag

Flag would disable the random string from being inserted into bench queries in the %(random)s placeholder. This would allow for testing of cacheing of results.

--sample flag for 'show' command

--sample (default 1) specifies step at which observations are sampled. So if you have a benchmark with 50 observations, and you don't want to look at them all, run 'show --sample 5' to see only 10 observations.

Query disco in 'analyze'

Analyze should be able to 'discover' all the queries (and so associated stats groups) for each benchmark, and extract information accordingly.

--append flag

Do not clean existing test data (not benchmarks, but the test data itself). Append new documents to the existing ones.

Use 'requests' instead of the home grown HTTP api

The decision to use home-made HTTP request api was made so that there would be no external dependencies. I don't think this really is a very important requirement. There already is a 'soft' requirement for 'tabulate' for output formatting. 'requests' is very solid, may as well use it.

Command line options also available in config file

All the command line options should be configurable in the config file; when they are again specified on the command line, they should override the config file. This means that the config file should be loaded and parsed on any invocation of the command line, so that even if it is just the 'help' command, the defaults can be read from the config file, and displayed appropriately.

Basic graphing

Using 'pygal' to generate SVGs. Basic capability, simple charts, but something visual.

Linux kernel data source

Clone and index a large repo from github, such as the linux kernel. Iterate through commit messages, do facets on them, etc. Another large repo is webkit - 18GB.

More unambiguous presentation of stats data

Specify magnitude (milliseconds?), and whether results are per query or aggregate. I looked at the results for the first time in a while and had no idea what they meant.

Refactor observations

Query objects should be assigned per-observation, this would get rid off all the weirdness with stats group names.

Named benchmarks

Ability to name benchmarks, so that they are easy to identify. When you have the ability to dump and load benchmarks, you can end up with a lot of different benchmarks in a single spot - important to be able to tell them apart.

Basic ES capacity planning info

Intro to ES capacity planning, goals of this project. Talk about ES indexes, Lucene indexes, Lucene segments, optimizations.

'rerun' capability

Ability to 'rerun' a benchmark, that is, to run a benchmark with the same parameters as some other benchmark. The command line arguments, the the content of the 'config' file are recorded with each benchmark. So lets say this gets 'dump'-ed, put into gist, then someone else loads it into their machine, but they want to run the same benchmark on their machine - there could be a --rerun argument to 'run' a benchmark with the same parameters as some other previously recorded benchmark.

Make recording of per-segment stats optional

Lucene segments have unique names, and their stats are recorded in observations as for example "obs.segments.segments._vu.deleted_docs". This means that where there are a lot of segments, there may be thousands of separate segment fields, which makes 'elsec' do introspection really slow.

Make per-segment stat recording optional, with '--record-segments' flag to 'run' command.

Normalize test data structure

Change field names from 'title', 'abstract', 'body' to something like 'short', 'medium', 'long' - this would allow for other data sources to use the same queries / setup as the default benchmark.

Record the size / count of data being inserted

Inserting 100MB of data into the index doesn't mean the index will be 100MB - it looks like the index ends up being up to 10% smaller. It would be good to capture the actual byte size of the data as it is being inserted.

Add 'background' inserts

Add low-volume background inserts to be executed as observations (query runs) are being executed. Observations measure search performance; this will add ability to see how search performance varies when there are simultaneous inserts.

Record the actual command which executed the benchmark

Record the 'command line', as part of benchmark metadata. The reason for that is if you dump some benchmarks, then you load them, but you want to run one more that is kind of similar - it will be much easier if you know the actual command which was used to run the other benchmarks.

Bulk insert option

Default to bulk inserts, against current 1 by one. At this time, there isn't really stats gathering around inserts - should there be?

Tutorial on how to benchmark an existing deployment

Run benchmarks against your current deployment, then use 'esdump' and 'esload' to copy the data over to a new cluster / different config, and run the tests again. Also, see #44 - running benchmark against an existing node.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.