intel / cetune Goto Github PK

License: Other

Shell 5.72% Python 71.86% CSS 6.10% HTML 4.58% JavaScript 11.75%

cetune's Introduction

DISCONTINUATION OF PROJECT

This project will no longer be maintained by Intel.

Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.

Intel no longer accepts patches to this project.

If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Contact: [email protected]

Functionality Description

CeTune is a toolkit/framework to deploy, benchmark, profile and tune *Ceph cluster performance.
Aim to speed up the procedure of benchmarking *Ceph performance, and provide clear data charts of system metrics, latency breakdown data for users to analyze *Ceph performance.
CeTune provides test performance through three interfaces: block, file system and object to evaluate *Ceph.

Maintainance

CeTune is an opensource project, under LGPL V2.1, Drived by INTEL BDT CSE team.
Maillist: https://github.com/01org/CeTune
Subscribe maillist: https://lists.01.org/mailman/listinfo/cephperformance

Prepare

one node as CeTune controller(AKA head), Other nodes as CeTune worker(AKA worker).
Head is able to autossh to all workers include himself, head has a 'hosts' file contains all workers info.
All nodes are able to connect to yum/apt-get repository and also being able to wget/curl from ceph.com.

Installation

Install to head and workers:

head and workers need deploy apt-get,wget,pip proxy.
apt-get install -y python

Install to head:

git clone https://github.com/01org/CeTune.git

cd /CeTune/deploy/
python controller_dependencies_install.py

# make sure head is able to autossh all worker nodes and 127.0.0.1
cd ${CeTune_PATH}/deploy/prepare-scripts; ./configure_autossh.sh ${host} ${ssh_password}

Install to workers:

cd /CeTune/deploy/
python worker_dependencies_install.py

Start CeTune with WebUI

# install webpy python module
cd ${CeTune_PATH}/webui/ 
git clone https://github.com/webpy/webpy.git

cd webpy
python setup.py install

# run CeTune webui
cd ${CeTune_PATH}/webui/
Python webui.py

# you will see below output
root@client01:/CeTune/webui# python webui.py
http://0.0.0.0:8080/

Add user for CeTune

cd /CeTune/visualizer/
# show help
python user_Management.py --help

# add a user
cd /CeTune/visualizer/
python user_Management.py -o add --user_name {set username} --passwd {set passwd} --role {set user role[admin|readonly]}

# delete a user
python user_Management.py -o del --user_name {username}

# list all user
python user_Management.py -o list

# update a user role
python user_Management.py -o up --user_name {username} --role {set user role[admin|readonly]}

CeTune WebUI

Configure

Use WebUI 'Test Configuration' Page, you can specify all the deploy and benchmark required configuration.
Also users are also able to directly modify conf/all.conf, conf/tuner.yaml, conf/cases.conf to do configuration.
Configuration helper is both under 'helper' tag, right after 'User Guide' and shows on the configuration page.
Below is a brief intro of all configuration files' objective:
- conf/all.conf
  - This is a configuration file to describe cluster, benchmark.
- conf/tuner.yaml
  - This is a configuration file to tune ceph cluster, including pool configuration, ceph.conf, disk tuning, etc.
- conf/cases.conf
  - This is a configuration file to decide which test case to run.

Deploy Ceph

Assume ceph is installed on all nodes, this part is demonstrate the workflow of using CeTune to deploy osd and mon to bring up a healthy ceph cluster.

Configure nodes info under 'Cluster Configuration'

KEY	VALUE	DESCRIPTION
clean build	true / false	Set true, clean current deployed ceph and redeploy a new cluster; Set false, try obtain current cluster layout, and add new osd to the existing cluster
head	${hostname}	Cetune controller node hostname
user	root	Only support root currently
enable_rgw	true / false	Set true, cetune will also deploy radosgw; Set false, only deploy osd and rbd nodes
list_server	${hostname1},${hostname2},...	List osd nodes here, split by ','
list_client	${hostname1},${hostname2},...	List client(rbd/cosbench worker) nodes here, split by ','
list_mon	${hostname1},${hostname2},...	List mon nodes here, split by ','
${server_name}	${osd_device1}:${journal_device1},${osd_device2}:${journal_device2},...	After adding nodes at 'list_server', cetune will add new lines whose key is the server's name;Add osd:journal pair to corresponding node, split by ','

Uncheck 'Benchmark' and only check 'Deploy', then click 'Execute'

WebUI will jump to 'CeTune Status' and you will about to see below console logs

Benchmark Ceph

Users are able to configure disk_read_ahead, scheduler, etc at 'system' settings.
Ceph.conf Tuning can be added to 'Ceph Tuning', so CeTune will runtime apply to ceph cluster.
'Benchmark Configuration' is how we control the benchmark process, will give a detail explaination below.
- There are two parts under 'Benchmark Configuration'.
- the first table is to control some basic settings like where to store result data, what data will be collected, etc.
- The second table is to control what testcase will be run, users can add multi testcase, so all the testcases will be run one by one.

Check Benchmark Results

User Guidance PDF

CeTune Documents Download Url

cetune's People

Contributors

Stargazers

Watchers

Forkers

xuechendi seasonkwok cfanz zhubx007 szymongraczyk yangwang1211 estack-2 xinxinsh wrongfan zhouyuan xdonghai weidoun zhiteng charpty shun-s eval-printer ericyuanhui hjwsm1989 linkensphere201 shaoshian yonghengdexin735 billthebest nickxiao c744402859 ningli16 morganwang010 bigluster exinton burningyu pythonfucku x-ion-de xxl-tx kevindavidmitnick lixiaoy1 ommoreno tanyy tanghaodong25 jian-zhang charmingcow hl10502 gencer zhanglei1949 romanbogachev luyidong liangzhaowang digideskio liuxiaojuan beauji dudd yunfeiguan feiyaogzs neverlosing wubob cnutshell cooldavid imquanquan cjsteel rnz hfengzhe dcy652701 hualongfeng aahsing isabella232 zhiwei-dai terry1504 perrynzhou changfuwu linux-kern ngwind michael-m-zhang legionxiong superleo bdwarr6

cetune's Issues

disable/enable cephx

To run tests with cephx enabled or disabled

vclient.tmp.img

where can I get the vclient.tmp.img?

Always show "Benchmark is running..."

I have just installed CeTune and start up the web ui. However, it shows "Benchmark is running..." on the browser, and seems not working at all.

fiorbd test failed

when I start benchmark test,faild,how to resolve this problem
[2017-10-17T08:33:28.541710][LOG]: start to run performance test
[2017-10-17T08:33:28.546344][LOG]: Calculate Difference between Current Ceph Cluster Configuration with tuning
[2017-10-17T08:33:33.438428][LOG]: Tuning[analyzer] is not same with current configuration
[2017-10-17T08:33:33.908256][LOG]: Tuning has applied to ceph cluster, ceph is Healthy now
[2017-10-17T08:33:36.914239][LOG]: ============start deploy============
[2017-10-17T08:33:39.591453][LOG]: Shutting down mon daemon
[2017-10-17T08:33:40.237892][LOG]: Shutting down osd daemon
[2017-10-17T08:33:40.576835][LOG]: Starting mon daemon
[2017-10-17T08:33:40.943608][LOG]: Started mon.node02 daemon on node02
[2017-10-17T08:33:41.319379][LOG]: Started mon.node03 daemon on node03
[2017-10-17T08:33:41.696928][LOG]: Started mon.node01 daemon on node01
[2017-10-17T08:33:41.697034][LOG]: Starting osd daemon
[2017-10-17T08:33:42.005184][LOG]: Started osd.0 daemon on node01
[2017-10-17T08:33:42.317962][LOG]: Started osd.1 daemon on node01
[2017-10-17T08:33:42.634781][LOG]: Started osd.2 daemon on node02
[2017-10-17T08:33:42.954045][LOG]: Started osd.3 daemon on node03
[2017-10-17T08:33:43.432979][LOG]: not need create mgr
[2017-10-17T08:33:43.443442][LOG]: Clean process log file.
[2017-10-17T08:33:43.919403][WARNING]: Applied tuning, waiting ceph to be healthy
[2017-10-17T08:33:47.403901][WARNING]: Applied tuning, waiting ceph to be healthy
[2017-10-17T08:33:50.888564][LOG]: Tuning has applied to ceph cluster, ceph is Healthy now
[2017-10-17T08:33:52.350882][LOG]: RUNID: 13, RESULT_DIR: //mnt/data//13-3-fiorbd-seqwrite-4k-qd64-2g-100-400-rbd
[2017-10-17T08:33:52.351263][LOG]: Prerun_check: check if sysstat installed
[2017-10-17T08:33:52.658104][LOG]: Prerun_check: check if blktrace installed
[2017-10-17T08:33:53.332501][LOG]: check if FIO rbd engine installed
[2017-10-17T08:33:53.720802][LOG]: check if rbd volume fully initialized
[2017-10-17T08:33:54.206562][WARNING]: Ceph cluster used data occupied: 2.698 KB, planned_space: 10485760.0 KB
[2017-10-17T08:33:54.206722][WARNING]: rbd volume initialization has not be done
[2017-10-17T08:33:54.206871][LOG]: Preparing rbd volume
[2017-10-17T08:33:55.164264][LOG]: 1 FIO Jobs starts on node02
[2017-10-17T08:33:55.474774][LOG]: 1 FIO Jobs starts on node03
[2017-10-17T08:33:55.783428][LOG]: 1 FIO Jobs starts on node01
[2017-10-17T08:33:57.122272][WARNING]: 0 fio job still runing
[2017-10-17T08:33:57.122398][ERROR]: Planed to run 0 Fio Job, please check all.conf
[2017-10-17T08:33:57.123074][ERROR]: The test has been stopped, error_log: Traceback (most recent call last):
File "/CeTune/benchmarking/mod/benchmark.py", line 46, in go
self.prerun_check()
File "/CeTune/benchmarking/mod/bblock/fiorbd.py", line 89, in prerun_check
self.prepare_images()
File "/CeTune/benchmarking/mod/bblock/fiorbd.py", line 52, in prepare_images
raise KeyboardInterrupt
KeyboardInterrupt

ceph cluster status during the tests

It's better to collect the ceph cluster status during the tests also.

[ERROR]: Generating history view failed

[2015-11-05T16:56:35.588191][LOG]: Generating ceph view
[2015-11-05T16:56:35.620374][LOG]: generate ceph line chart
[2015-11-05T16:56:37.263420][LOG]: generate ceph line chart
[2015-11-05T16:56:37.300566][LOG]: generate ceph line chart
[2015-11-05T16:56:39.344226][LOG]: generate ceph line chart
[2015-11-05T16:56:41.038648][LOG]: generate ceph line chart
[2015-11-05T16:56:42.281420][LOG]: generate ceph line chart
[2015-11-05T16:56:44.422202][LOG]: generate ceph line chart
[2015-11-05T16:56:47.786397][LOG]: generate ceph line chart
[2015-11-05T16:56:47.817943][LOG]: generate ceph line chart
[2015-11-05T16:56:47.855909][LOG]: generate ceph line chart
[2015-11-05T16:56:53.526524][LOG]: Generating client view
[2015-11-05T16:56:53.559592][LOG]: generate client line chart
[2015-11-05T16:56:55.205861][LOG]: generate client line chart
[2015-11-05T16:56:57.319046][LOG]: generate client line chart
[2015-11-05T16:56:59.110973][LOG]: generate client line chart
[2015-11-05T16:57:00.355822][LOG]: generate client line chart
[2015-11-05T17:00:36.680553][LOG]: Generating history view
[2015-11-05T17:00:36.867727][ERROR]: Generating history view failed

after i run my test
i just see [ERROR]: Generating history view failed and no result in the result reports
i want to know what's wrong and how to find where has problem ?

thank you

how can I use cosbench to test?

After reading CeTune Document.pdf,I confuse about the cosbench config.
Do I need to download the cosbench by myself? Or the cetune will download the cosbench itself?
would you give me an example for me to how to use cosbench to test

CentOS dependency install

Some pkg name in Cent are different vs. in Ubuntu OS

Introduce <randrepeat> in fio test

randrepeat=bool
Seed the random number generator in a predictable way so results are repeatable across runs. Default: true.

I think we would be better to default this to false, every run has same offset is really not a good random

open drop cache level in tuner.yaml

drop = 1 not droping inode, some test may think it not suitable for test

conf/common.py: Incorrect IP address returned by getIpByHostInSubnet()

CeTune uses getIpByHostInSubnet() method in conf/common.py to grab the IP address of the specified host. If the host has a non-localhost interface(s) (i.e. docker/VM bridge), then that will be used even if it is not on the correct network defined in ceph.conf.

include 99.99% latency in results reports

Currently we have 99.99% latency in each fio job's result, can we also include this number in the summary page?

cetune ui improvement

help doc
configuration description
result, timestamp
result filter(js)
after run test description
a cetune log to record cetune webui logon and script run

Kill admin_socket dump when done

analyzer module refine

provide arg to decide which data wanna to parse
provide arg to device the interval of iostat, sar, etc.
fio result handling for mix read write

Refine the read me

Refine the ReadMe file
Adding version number
Maintain the change log

Add redeploy and restart checkbox besides each testcase

Jianpeng suggests we should add a checkbox beside each testcase, so they can say I wanna redploy the cluster before run the test.

ceph-mgr daemon deployment

Need install 'ceph-mgr' daemon when deploy ceph 12.1.0+, or ceph will remain 'health_err' status...

Referring to the document: http://docs.ceph.com/docs/master/mgr/administrator/

disk performance on clients

https://gist.github.com/38fa7dd5b9db35a218654163c69c68e6

launch more than one analyze process to do data process.

launch more than one analyze process to do data process.
To speed up the process on one node, the idea is we can check how many core the node has, and launch that much process then.

Erasure Coding and Cache Tier

need to cover EC and CT also. May need to setup some additional SSD pool/crushrule set.

cetune stable release

need to start a branch on cetune stable release

need to check qemurbd, fiorbd(with mix read write), cosbench
verify the result visualization
bash scripts hostname with '-'

Due data: 7/15

kraken & bluestore disk_format

made some hack to support kraken and custom bluestore disk_format
https://gist.github.com/zhouyuan/f8aebbbd533d10187e677b7dbaa84875

visualizer module refine

produce excel/csv file to easily download data
input a scheme to parse the result dict -- so we can decide which data we wanna to show

osd on NVMe Device error

If choose the NVMe SSD as the osd device, it could occur error with "cat /sys/block/nvme0n/" no such directory or file. I do not do any partition operation on the NVMe. So the name of NVMe is "nvme0n1". Additionally, the error may come from tuner/tuner.py : line 84, 99.

mix read write data process

Need to verify the mix read write fio data

developing testsuite for CeTune

RGW default pools are wrong since jewel

Since Jewel RGW moved all its pool to default.rgw.*

Improve History summary

sort the page by run_id and [optional] can sort by OP_TYPE
use double click instead of single click in summary page

CeTune broke at haproxy step for CentOS nodes

Hello guys ,
I am running CeTune on CentOS nodes and got stuck at haproxy step. On RHEL based machines there is no /etc/default/haproxy configuration file so CeTune fails , is this a known issue ?

BTW haproxy rpm is properly installed on these nodes

[2016-05-08T11:51:07.632079][LOG]: Ceph already installed on below nodes
[2016-05-08T11:51:07.632285][LOG]: Install opts is '--release '
[2016-05-08T11:51:08.581212][WARNING]: Found different configuration from conf/ceph_current_conf with your desired config : OrderedDict([('mon', {}), ('osd', {}), ('mds', {}), ('osd_num', 3), ('radosgw', [])])
[2016-05-08T11:51:08.582611][LOG]: Generating rgw ceph.conf parameters
[2016-05-08T11:51:08.582849][LOG]: configure node2-1 in ceph.conf
[2016-05-08T11:51:13.282435][LOG]: deploy radosgw instances
[2016-05-08T11:51:19.644885][LOG]: Creating rgw required pools
[2016-05-08T11:58:03.390218][LOG]: Updating haproxy configuration
[2016-05-08T11:58:05.927822][ERROR]: node2: sed: can't read /etc/default/haproxy: No such file or directory
pdsh@node1: node2: ssh exited with exit code 2

As a temporary fix i created a dummy file /etc/default/haproxyand it worked

mailing list missing

We should setup a mailing list for cetune

osd down when benchmark is running

I do not know if this is a problem which should deal with it.

when benchmark is running,the osd down and Ceph becomes recovery.But everything goes as normal,except the node_ceph_health.log.

In my long test history,I do not have enough attention to it。So I need check it one by one.

Maybe cetune can give some tips if the health is not OK when runing benchmark.

fio zipf support

It's better to have zipf of fio in some tests, e.g, testing with rbd-cache

[ERROR]:analyzer Failed

Hi Guys:
I try to run a fiorbd driver use CeTune.The output logs show a little error:
[ERROR]:analyzer Failed,pls try cd analyzer;python analyzer.py --path //mnt/data//1-140-fiorbd-seqwrite-4k-qd64-10g-100-400-rbd process_data
Seem when the py script get the path parameter,more then two '/' symbol.
I run this step by manually,it's ok
This error can be ignored?
B & R, thanks.

Let's track cetune bugs and new features here

Chendi, Xiaojuan will be responsible to track and fix bugs

Add user role permit to cetune

Basically, the idea is to create a admin account and read-only account, so read-only user can only read data report without changing any cetune data.

I think the trick here, is we should encrypt data to avoid any hack.

AttributeError: 'ThreadedDict' object has no attribute 'userrole'

Reproduce steps:

Install CeTune master
Install CeTune webui
Add user:
python user_Management.py -o add --user_name admin --passwd 123456 --role admin
Run benchmark.

There's error in "python webui.py" runtime log:

10.239.44.90:63246 - - [30/Jun/2017 10:05:00] "HTTP/1.1 GET /configuration/user_role" - 500 Internal Server Error
<Storage {'timestamp': u'2017-06-30T10:04:04.898784'}>
10.239.44.90:63246 - - [30/Jun/2017 10:05:01] "HTTP/1.1 POST /monitor/tail_console" - 200 OK
get_param:<Storage {}>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/web.py-0.40.dev0-py2.7.egg/web/application.py", line 257, in process
return self.handle()
File "/usr/local/lib/python2.7/dist-packages/web.py-0.40.dev0-py2.7.egg/web/application.py", line 248, in handle
return self._delegate(fn, self.fvars, args)
File "/usr/local/lib/python2.7/dist-packages/web.py-0.40.dev0-py2.7.egg/web/application.py", line 488, in _delegate
return handle_class(cls)
File "/usr/local/lib/python2.7/dist-packages/web.py-0.40.dev0-py2.7.egg/web/application.py", line 466, in handle_class
return tocall(*args)
File "webui.py", line 75, in GET
return common.eval_args( self, function_name, web.input() )
File "/home/ning/upload/docker/CeTune/conf/common.py", line 605, in eval_args
if function_name != "":
File "webui.py", line 82, in user_role
output = session.userrole
File "/usr/local/lib/python2.7/dist-packages/web.py-0.40.dev0-py2.7.egg/web/session.py", line 68, in getattr
return getattr(self._data, name)
AttributeError: 'ThreadedDict' object has no attribute 'userrole'

Actually, when running python user_Management.py -o list
there is admin user listed.

keep more precise data of latency perf counter

current latency performance counter is precise to millisecond. for SSD setup, this is not enough

stable release

need to switch to some release system, either version or tags should be OK.

allow to specify keyring if cephx enabled

clientname was set to ${rbdname} which will block fio rbd if cephx enabled
https://github.com/01org/CeTune/blob/master/benchmarking/mod/bblock/fiorbd.py#L176

cephfs workload

We still lack of a stable and reliable cephfs workload, submit an issue here per some user requests.

Result report export to excel

Result report export to excel
Add a button to select columns, then export to an excel

Inconsistent problem with cetune_report.db

Data from db is not consistent with data in dest_dir.

For example:
In my test environment, there is two test result in dest_dir, I just get one test result from web UI. I must rm cetune_report.db in dest_dir and reload it to get consistent result.

custom Ceph repo

allow to deploy ceph with custom repo instead of the official one on ceph.com.

using fdisk to check whether the rbd image is attached is not reliable

fdisk output is :

but the output of lsblk is:

root@vclient01:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 20G 0 disk
├─vda1 253:1 0 3.8G 0 part [SWAP]
└─vda2 253:2 0 16.2G 0 part /

this error occur every time when I create two or more benchmark cases.
when it deattch,the fdisk info is still exist until reboot vclient.

unsupported operand type(s) for /: 'str' and 'int', write_SN_Latency/osd_node_count

Traceback (most recent call last):
  File "analyzer.py", line 591, in <module>
    main(sys.argv[1:])
  File "analyzer.py", line 587, in main
    func()
  File "analyzer.py", line 109, in process_data
    result = self.summary_result( result )
  File "analyzer.py", line 265, in summary_result
    tmp_data["SN_Latency(ms)"] = "%.3f" % write_SN_Latency/osd_node_count
TypeError: unsupported operand type(s) for /: 'str' and 'int'

Virtual Image example

HI,

First, congratulation this tool is perfect to test, I would like to know if you have example of vclient.tmp.img, and you have a fixed path into file prepare-vm.sh it's hard coded to your path address (/home/xuechendi/remote_access/vclient.tmp.img) Line: 63.

what ceph-disk does:
Prepare:

create GPT partition
mark the partition with the ceph type uuid
create a file system
mark the fs as ready for ceph consumption
entire data disk is used (one big partition)
a new partition is added to the journal disk (so it can be easily shared)
triggered by administrator or ceph-deploy, e.g. 'ceph-disk [journal disk]

Activate:

if encrypted, map the dmcrypt volume
mount the volume in a temp location
allocate an osd id (if needed)
if deactived, no-op (to activate with --reactivate flag)
remount in the correct location /var/lib/ceph/osd/$cluster-$id
remove the deactive flag (with --reactivate flag)
start ceph-osd
triggered by udev when it sees the OSD gpt partition type
triggered by admin 'ceph-disk activate '
triggered on ceph service startup with 'ceph-disk activate-all'

Deactivate:

check partition type (support dmcrypt, mpath, normal)
stop ceph-osd service if needed (make osd out with option --mark-out)
remove 'ready', 'active', and INIT-specific files
create deactive flag
umount device and remove mount point
if the partition type is dmcrypt, remove the data dmcrypt map.

Destroy:

check partition type (support dmcrypt, mpath, normal)
remove OSD from CRUSH map
remove OSD cephx key
deallocate OSD ID
if the partition type is dmcrypt, remove the journal dmcrypt map.
destroy data (with --zap option)