Code Monkey home page Code Monkey logo

clues's Introduction

CLUES

CLUES is an energy management system for High Performance Computing (HPC) Clusters and Cloud infrastructures. The main function of the system is to power off internal cluster nodes when they are not being used, and conversely to power them on when they are needed. CLUES system integrates with the cluster management middleware, such as a batch-queuing system or a cloud infrastructure management system, by means of different connectors.

CLUES also integrates with the physical infrastructure by means of different plug-ins, so that nodes can be powered on/off using the techniques which best suit each particular infrastructure (e.g. using wake-on-LAN, Intelligent Platform Management Interface (IPMI) or Power Device Units, PDU).

Although there exist some batch-queuing systems that provide energy saving mechanisms, some of the most popular choices, such as Torque/PBS, lack this possibility. As far as cloud infrastructure management middleware is concerned, none of the most usual options for scientific environments provide similar features. The additional advantage of the approach taken by CLUES is that it can be integrated with virtually any resource manager, whether or not the manager provides energy saving features.

Installing

In order to install CLUES you can follow the next steps:

Prerrequisites

You need a python interpreter and the easy_install commandline tool. In ubuntu, you can install these things as

$ apt-get -y install python python-setuptools

Git is also needed, in order to get the source code

$ apt-get -y install git

Now you need to install the cpyutils

$ git clone https://github.com/grycap/cpyutils
$ mv cpyutils /opt
$ cd /opt/cpyutils
$ python setup.py install --record installed-files.txt

In case that you want, you can safely remove the /opt/cpyutils folder. But it is recommended to keep the installed-files.txt file just in order to be able to uninstall the cpyutils.

Finally you need to install two python modules from pip:

$ easy_install ply web.py

Installing CLUES

Firt of all, you need to get the CLUES source code and then install it.

$ git clone https://github.com/grycap/clues
$ mv clues /opt
$ cd /opt/clues
$ python setup.py install --record installed-files.txt

In case that you want, you can safely remove the /opt/clues folder. But it is recommended to keep the installed-files.txt file just in order to be able to uninstall CLUES.

Now you must config CLUES, as it won't work unless you have a valid configuration.

Configuring CLUES

You need a /etc/clues2/clues2.cfg file. So you can get the template and use it for your convenience.

$ cd /etc/clues2
$ cp clues2.cfg-example clues2.cfg

Now you can edit the /etc/clues2/clues2.cfg and adjust its parameters for your specific deployment.

The most important parameters that you MUST adjust are LRMS_CLASS, POWERMANAGER_CLASS and SCHEDULER_CLASSES.

For the LRMS_CLASS you have different options available (you MUST state one and only one of them):

  • cluesplugins.one that is designed to work in an OpenNebula deployment.
  • cluesplugins.pbs that is designed to work in a Torque/PBS environment.
  • cluesplugins.sge that is designed to work in a SGE-like environment
  • cluesplugins.slurm that is designed to work in a SLURM environment
  • cluesplugins.mesos that is designed to work in a Mesos environment
  • cluesplugins.kubernetes that is designed to work in a Kubernetes environment
  • cluesplugins.nomad that is designed to work in a Nomad environment

For the POWERMANAGER_CLASS you have different options available (you MUST state one and only one of them):

  • cluesplugins.ipmi to power on or off working nodes in an physical infrastructure using IPMI calls
  • cluesplugins.wol to power on working nodes in an physical infrastructure using Wake-on-Lan calls, and powering them off using password-less SSH connections.
  • cluesplugins.one to create and destoy virtual machines as working nodes in a OpenNebula IaaS.
  • cluesplugins.onetemplate to create and destoy virtual machines as working nodes in a in a OpenNebula IaaS (creating the template inline instead of using existing templates).
  • cluesplugins.im that is designed to work in an multi-IaaS environment managed by the Infrastructure Manager IM.

Finally, you should state the CLUES schedulers that you want to use. It is a comma-separated ordered list where the schedulers are being called in the same order that they are stated.

For the SCHEDULER_CLASSES parameter you have the following options available:

  • clueslib.schedulers.CLUES_Scheduler_PowOn_Requests that will react up on the requests for resources from the underlying middleware. It will take into account the requests for resources and will power on some nodes if needed.
  • clueslib.schedulers.CLUES_Scheduler_Reconsider_Jobs, that will monitor the jobs in the LRMS and will power on some resources if the jobs are in the queue for too long.
  • clueslib.schedulers.CLUES_Scheduler_PowOff_IDLE, that will power off the nodes that are IDLE after a period of time.
  • clueslib.schedulers.CLUES_Scheduler_PowOn_Free, that will keep extra empty slots or nodes.

Each of the LRMS, POWERMANAGER or SCHEDULER has its own options that should be properly configured.

Example configuration with SLURM

In this example we are integrating CLUES in a working SLURM 16.05.8 deployment, which is prepared to ppower on or off the working nodes using IPMI. In the next steps we are configuring CLUES to monitor the SLURM deployment and to intercept the requests for new jobs using sbatch.

On the one side, we must set the proper values in /etc/clues2/clues2.cfg. The most important values are:

[general]
CONFIG_DIR=conf.d
LRMS_CLASS=cluesplugins.slurm
POWERMANAGER_CLASS=cluesplugins.ipmi
MAX_WAIT_POWERON=300
...
[monitoring]
COOLDOWN_SERVED_REQUESTS=300
...
[scheduling]
SCHEDULER_CLASSES=clueslib.schedulers.CLUES_Scheduler_PowOn_Requests, clueslib.schedulers.CLUES_Scheduler_Reconsider_Jobs, clueslib.schedulers.CLUES_Scheduler_PowOff_IDLE, clueslib.schedulers.CLUES_Scheduler_PowOn_Free
IDLE_TIME=600
RECONSIDER_JOB_TIME=600
EXTRA_SLOTS_FREE=0
EXTRA_NODES_PERIOD=60
  • CONFIG_DIR is the folder (relative to the CLUES configuration folder: /etc/clues2), where the *.cfg files will be considered as part of the configuration (e.g. for the configuration of the plugins).
  • LRMS_CLASS is set to use the ONE plugin to monitor the deployment.
  • POWERMANAGER_CLASS is set to use IPMI to power on or off the working nodes.
  • MAX_WAIT_POWERON is set to an upper bound of the time that a working node lasts to be power on and ready from the IPMI order to power on (in our case 5 minutes). If this time passes, CLUES will consider that the working node has failed to be powered on.
  • COOLDOWN_SERVED_REQUESTS is the time during which the requested resources for a VM will be booked by CLUES, once it has been attended (e.g. some working nodes have been powered on). It is needed to take into account the time that passes from when a VM is released to ONE to when the VM is finally deployed into a working node. In case of ONE, when the VM is finally hosted in a host, this time is aborted (it does not happen in other LRMS).
  • SCHEDULER_CLASSES are the power-on features that we want for the deployment. In this case, we are reacting up on requests, and we will also consider the requests for resources of jobs that are in the queue for too long. Then, we will power off the working nodes that have been idle for too long, but we will keep some slots free.
  • IDLE_TIME is related to the CLUES_Scheduler_PowOff_IDLE and is the time during which a working node has to be idle to be considered to be powered off.
  • RECONSIDER_JOB_TIME is related to the CLUES_Scheduler_Reconsider_Jobs scheduler, and states the frequency (in seconds) that a job has to be in the queue before its resources are reconsidered.
  • EXTRA_SLOTS_FREE is related to the CLUES_Scheduler_PowOn_Free scheduler and states how many slots should be free in the platform.
  • EXTRA_NODES_PERIOD=60 is also related to CLUES_Scheduler_PowOn_Free and states the frequency of the scheduler. It is not executed all the time to try to avoid transient allocations.

Once this file is configured, we can use the templates in the /etc/clues2/conf.d folder to configure the SLURM and IPMI plugins. So we are creating the proper files:

$ cd /etc/clues2/conf.d/
$ cp plugin-slurm.cfg-example plugin-slurm.cfg         
$ cp plugin-ipmi.cfg-example plugin-ipmi.cfg         

You should check the variables in the /etc/clues2/conf.d/plugin-slurm.cfg file to match your platform, but the default values may suitable for you. The expected include getting the nodes, queues, jobs, etc.

In the /etc/clues2/conf.d/plugin-ipmi.cfg we should check the variables IPMI_HOSTS_FILE and IPMI_CMDLINE_POWON and IPMI_CMDLINE_POWOF, and set them to the proper values of your deployment.

[IPMI]
IPMI_HOSTS_FILE=ipmi.hosts
IPMI_CMDLINE_POWON=/usr/bin/ipmitool -I lan -H %%a -P "" power on
IPMI_CMDLINE_POWOFF=/usr/bin/ipmitool -I lan -H %%a -P "" power off

The ipmi.hosts should be located in the folder /etc/clues2/ and contains the correspondences of the IPMI IP addresses and the names of the hosts that appear in ONE, using the well known /etc/hosts file format. An example for this file is, where the first column is the IPMI IP address and the second column is the name of the host as appears in ONE.

192.168.1.100   niebla01
192.168.1.102   niebla02
192.168.1.103   niebla03
192.168.1.104   niebla04

The you should adjust the commandline for powering on and off the working nodes, using IPMI. In the default configuration we use the common ipmitool app and we use a passwordless connection to the IPMI interface. To adjust the commandline you can use %%a to substitute the IP address and %%h to substitute the hostname

The SLURM addon is based in substituting the command sbatch by the CLUES sbatch to check whether new nodes are needed. Later this command will call the original SLURM sbatch command to queue the jobs. In order to make it, you should rename the original sbatch command to sbatch.o and then copy the CLUES' one:

# In the case of debian based distributions (e.g. ubuntu)
mv /usr/local/bin/sbatch /usr/local/bin/sbatch.o
cp /usr/local/bin/clues-slurm-wrapper /usr/local/bin/sbatch

# In the case of red-hat based distributions (e.g. fedora, scientific linux)
mv /usr/bin/sbatch /usr/bin/sbatch.o
cp /usr/local/bin/clues-slurm-wrapper /usr/bin/sbatch

Take into account that the users that are able to use sbatch must be able to read the configuration of CLUES.

Example configuration with ONE

In this example we are integrating CLUES in a OpenNebula 4.8 deployment, which is prepared to power on or off the working nodes using IPMI. In the next steps we are configuring CLUES to monitor the ONE deployment and to intercept the requests for new VMs.

On the one side, we must set the proper values in /etc/clues2/clues2.cfg. The most important values are:

[general]
CONFIG_DIR=conf.d
LRMS_CLASS=cluesplugins.one
POWERMANAGER_CLASS=cluesplugins.ipmi
MAX_WAIT_POWERON=300
...
[monitoring]
COOLDOWN_SERVED_REQUESTS=300
...
[scheduling]
SCHEDULER_CLASSES=clueslib.schedulers.CLUES_Scheduler_PowOn_Requests, clueslib.schedulers.CLUES_Scheduler_Reconsider_Jobs, clueslib.schedulers.CLUES_Scheduler_PowOff_IDLE, clueslib.schedulers.CLUES_Scheduler_PowOn_Free
IDLE_TIME=600
RECONSIDER_JOB_TIME=600
EXTRA_SLOTS_FREE=0
EXTRA_NODES_PERIOD=60
  • CONFIG_DIR is the folder (relative to the CLUES configuration folder: /etc/clues2), where the *.cfg files will be considered as part of the configuration (e.g. for the configuration of the plugins).
  • LRMS_CLASS is set to use the ONE plugin to monitor the deployment.
  • POWERMANAGER_CLASS is set to use IPMI to power on or off the working nodes.
  • MAX_WAIT_POWERON is set to an upper bound of the time that a working node lasts to be power on and ready from the IPMI order to power on (in our case 5 minutes). If this time passes, CLUES will consider that the working node has failed to be powered on.
  • COOLDOWN_SERVED_REQUESTS is the time during which the requested resources for a VM will be booked by CLUES, once it has been attended (e.g. some working nodes have been powered on). It is needed to take into account the time that passes from when a VM is released to ONE to when the VM is finally deployed into a working node. In case of ONE, when the VM is finally hosted in a host, this time is aborted (it does not happen in other LRMS).
  • SCHEDULER_CLASSES are the power-on features that we want for the deployment. In this case, we are reacting up on requests, and we will also consider the requests for resources of jobs that are in the queue for too long. Then, we will power off the working nodes that have been idle for too long, but we will keep some slots free.
  • IDLE_TIME is related to the CLUES_Scheduler_PowOff_IDLE and is the time during which a working node has to be idle to be considered to be powered off.
  • RECONSIDER_JOB_TIME is related to the CLUES_Scheduler_Reconsider_Jobs scheduler, and states the frequency (in seconds) that a job has to be in the queue before its resources are reconsidered.
  • EXTRA_SLOTS_FREE is related to the CLUES_Scheduler_PowOn_Free scheduler and states how many slots should be free in the platform.
  • EXTRA_NODES_PERIOD=60 is also related to CLUES_Scheduler_PowOn_Free and states the frequency of the scheduler. It is not executed all the time to try to avoid transient allocations.

Once this file is configured, we can use the templates in the /etc/clues2/conf.d folder to configure the ONE and IPMI plugins. So we are creating the proper files:

$ cd /etc/clues2/conf.d/
$ cp plugin-one.cfg-example plugin-one.cfg         
$ cp plugin-ipmi.cfg-example plugin-ipmi.cfg         

In the /etc/clues2/conf.d/plugin-one.cfg we should check the variables ONE_XMLRPC and ONE_AUTH, and set them to the proper values of your deployment. The credentials in the ONE_AUTH variable should be of a user in the oneadmin group (you can use the oneadmin user or create a new one in ONE).

[ONE LRMS]
ONE_XMLRPC=http://localhost:2633/RPC2
ONE_AUTH=clues:cluespass

In the /etc/clues2/conf.d/plugin-ipmi.cfg we should check the variables IPMI_HOSTS_FILE and IPMI_CMDLINE_POWON and IPMI_CMDLINE_POWOF, and set them to the proper values of your deployment.

[IPMI]
IPMI_HOSTS_FILE=ipmi.hosts
IPMI_CMDLINE_POWON=/usr/bin/ipmitool -I lan -H %%a -P "" power on
IPMI_CMDLINE_POWOFF=/usr/bin/ipmitool -I lan -H %%a -P "" power off

The ipmi.hosts should be located in the folder /etc/clues2/ and contains the correspondences of the IPMI IP addresses and the names of the hosts that appear in ONE, using the well known /etc/hosts file format. An example for this file is, where the first column is the IPMI IP address and the second column is the name of the host as appears in ONE.

192.168.1.100   niebla01
192.168.1.102   niebla02
192.168.1.103   niebla03
192.168.1.104   niebla04

The you should adjust the commandline for powering on and off the working nodes, using IPMI. In the default configuration we use the common ipmitool app and we use a passwordless connection to the IPMI interface. To adjust the commandline you can use %%a to substitute the IP address and %%h to substitute the hostname

Hooks system

The hooks mechanism of CLUES enables to call specific applications when different events happen in the system. E.g. when a node is powered on or off. One immediate application of this system is to send an e-mail to the admin when a node has failed to be powered on.

Hooks are custom external scripts (or applications) that are executed when some events happen. CLUES includes the possibility to define the next hooks:

  • Prior to execute the power_on action: ./PRE_POWERON
  • After the power_on action has been executed: ./POST_POWERON <0: failed | 1: succeded>
  • Prior to execute the power_off action: ./PRE_POWEROFF
  • After the power_off action has been executed: ./POST_POWEROFF <0: failed | 1: succeded>
  • The state of the node has unexpectedly changed from OFF to ON: ./UNEXPECTED_POWERON
  • The state of the node has unexpectedly changed from ON to OFF: ./UNEXPECTED_POWEROFF
  • When a node has been tried to be powered off, but after a time is still detected as ON: ./ONERR
  • When a node has been tried to be powered on, but after a time is still detected as OFF: ./OFFERR
  • When a node is finally detected to be ON after it has been requested to be powered on: ./POWEREDON
  • When a node is finally detected to be OFF after it has been requested to be powered off: ./POWEREDOFF
  • When a node has been missing by the monitoring system: ./UNKNOWN
  • When a node gets the idle state from the used state: ./IDLE
  • When a node gets the used state from the idle state: ./USED
  • When a request for resources is queued in the system: ./REQUEST <; separated specific requests expressions>

Reports

CLUES has a report generator that has been created to help to monitor your infrastructure, regarding to CLUES.

The reports that generate CLUES provide the next information:

  • Graphs that show the state of the nodes during a period of time.
  • Graphs of usage of slots and memory (per node, and accumulated).
  • Details about the usage of each node.
  • Stats about the requests that CLUES have received and attended.

CLUES provide reports in the form of web pages. So you will need a browser to open these reports. Once opened, the reports web page will look like the next one:

The CLUES reports web page

Refer to the Reports documentation to get more information about how to create the reports.

Troubleshooting

You can get information in the CLUES log file (i.e. /var/log/clues2/clues2.log). But you can also set the LOG_FILE to a empty value in the /etc/clues2/clues2.cfg file and execute CLUES as

$ /usr/bin/python /usr/local/bin/cluesserver

In the logging information you can find useful messages to debug what is happening. Here we highlight some common issues.

Wrong ONE configuration

Some messages like

[DEBUG] 2015-06-18 09:41:57,551 could not contact to the ONE server
[WARNING] 2015-06-18 09:41:57,551 an error occurred when monitoring hosts (could not get information from ONE; please check ONE_XMLRPC and ONE_AUTH vars)

usually mean that either the URL that is pointed by ONE_XMLRPC is wrong (or not reachable) or the ONE_AUTH information has not enough privileges.

In a distributed configuration, maybe the ONE server is not reachable from outside the localhost.

Lack of permission

When using the client, a message like the next one

$ clues status
Could not get the status of CLUES (Error checking the secret key. Please check the configuration file and the CLUES_SECRET_TOKEN setting)

is usually a sympthom that the CLUES commandline has not permissions to read the clues2.cfg. Please check that the users are able to read the configuration of CLUES.

clues's People

Contributors

alldaudinot avatar amcaar avatar dealfonso avatar micafer avatar serlophug avatar srisco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clues's Issues

CLUES host cannot be changed

The CLUES listening port can be changed but the host is always "localhost":

server = cpyutils.rpcweb.XMLRPCServer("localhost", configserver._CONFIGURATION_GENERAL.CLUES_PORT, web_class = clues_web_server)

Minimum number of idle resources

Hi Carlos,

I wonder if CLUES can be extended to guarantee that a minimum number of machines are powered on. I want to guarantee that I can deal with a job with no delay, assuming the cost of always keeping one extra node up.

The rationale would be to keep always one idle node (provided that we do not reach the maximum). If a node becomes used for a given time, CLUES would power on a new node. For the powering off, the behaviour will be quite similar to the current one but ensuring that always one idle node is available. If there is more than one idle node for more than a given period, one node can be powered off.

This would be especially interesting when combined with EC3, due to the delay on the node configuration.

Thanks!

PBS job is detected, but nodes do not power on

Dear team,

I am trying to set up CLUES for a HPC using Torque with PBS as queue manager. After some configurating, I managed to start the CLUES server succesfully with the PowerOn_Requests scheduler enabled. I can also see that the queue is detected, because when I submit a new job, it is shown in the CLUES server output. However, after the job is detected, no nodes are powered on to start working. All nodes stay powered off and the job remains in the queue, waiting for resources. I tested the poweron and poweroff commands via the clues CLI (I'm using IPMI to command the nodes) and that is working properly.

Any idea what the problem could be?
I hope this project is still being monitored, as it would be very useful!

Best regards

send error (connection refused)

Hello clues crews,

I have installed and configured the clues service as explained in the instruction. I have running sge (version 8.1.9) and ipmitool (version 1.8.18), which both running fine on Centos7. When I start the service by "systemctl start cluesd.service", everything looks normal and "systemctl status cluesd.service"shows the service is active but when I look into the /var/log/clues2/clues2.log, I have the following error:

=============
[CLUES]; INFO;2021-12-20 15:04:05,299;1640009045.300;not monitoring jobs due to configuration (var PERIOD_MONITORING_JOBS)
root;ERROR;2021-12-20 15:04:05,306; Error in command "/opt/gridengine/bin/lx-amd64/qconf -shgrpl"
root;ERROR;2021-12-20 15:04:05,306; Return code was: 1
root;ERROR;2021-12-20 15:04:05,306; Error output was:
error: commlib error: got select error (Connection refused)
unable to send message to qmaster using port 6444 on host "carbon.local": got send error

[PLUGIN-SGE];ERROR;2021-12-20 15:04:05,306;1640009045.306;could not get information about the hosts:
root;ERROR;2021-12-20 15:04:05,312; Error in command "/opt/gridengine/bin/lx-amd64/qhost -xml -q"
root;ERROR;2021-12-20 15:04:05,312; Return code was: 1
root;ERROR;2021-12-20 15:04:05,312; Error output was:
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "carbon.local": got send error

Do you have any idea what is wrong or how can I solve this issue?
Thanks in advance mahdi

Error connecting clues to mysql

Can't start clues with mysql database. Seems like string parsing is broken in cpyutils or wrong sample format is provided in configuration file.

Connection string in clues2.cfg file:
DB_CONNECTION_STRING=mysql://clues:clpAss@cluesdb/clues

In log I get errors:
...

WARNING: Couldn't write lextab module 'lextab'. [Errno 20] Not a directory: '/usr/local/lib/python2.7/dist-packages/cpyutils-0.24-py2.7.egg/cpyutils/lextab.py'
WARNING: Couldn't create 'parsetab'. [Errno 20] Not a directory: '/usr/local/lib/python2.7/dist-packages/cpyutils-0.24-py2.7.egg/cpyutils/parsetab.py'
...

Traceback (most recent call last):
File "/usr/bin/cluesserver", line 569, in
main_loop()
File "/usr/bin/cluesserver", line 560, in main_loop
CLUES_DAEMON = CluesDaemon(PLATFORM, active_schedulers)
File "/usr/local/lib/python2.7/dist-packages/clueslib/cluesd.py", line 185, in init
self._db_system = DBSystem.create_from_connection_string()
File "/usr/local/lib/python2.7/dist-packages/clueslib/cluesd.py", line 57, in create_from_connection_string
return DBSystem(connection_string)
File "/usr/local/lib/python2.7/dist-packages/clueslib/cluesd.py", line 61, in init
self._db = cpyutils.db.DB.create_from_string(connection_string)
File "build/bdist.linux-x86_64/egg/cpyutils/db.py", line 80, in create_from_string
NameError: global name 'db' is not defined

If I use ip address for host in connection string I get last 2 lines of same error:

File "build/bdist.linux-x86_64/egg/cpyutils/db.py", line 95, in create_from_string
cpyutils.db.MalformedConnectionString: Connection string format not recognised

Latest master branches of clues and cpyutils on Ubuntu 16.04

Error with command "clues shownode <node_name>"

Hi,

the command "clues shownode node_name" is causing the next error:

ubuntu@slurmserverpublic:~$ clues shownode wn1
WARNING: Couldn't write lextab module 'lextab'. [Errno 20] Not a directory: '/usr/local/lib/python2.7/dist-packages/cpyutils-0.24-py2.7.egg/cpyutils/lextab.py'
WARNING: Couldn't create 'parsetab'. [Errno 20] Not a directory: '/usr/local/lib/python2.7/dist-packages/cpyutils-0.24-py2.7.egg/cpyutils/parsetab.py'
Traceback (most recent call last):
File "/usr/local/bin/clues", line 222, in <module>
main_function()
File "/usr/local/bin/clues", line 219, in main_function
p.self_service(True)
File "build/bdist.linux-x86_64/egg/cpyutils/parameters.py", line 501, in self_service
File "build/bdist.linux-x86_64/egg/cpyutils/parameters.py", line 479, in autocall_ops
File "/usr/local/bin/clues", line 115, in shownode
node = Node.fromxml(Node("",0,0,0,0), text)
File "/usr/local/lib/python2.7/dist-packages/clueslib/helpers.py", line 186, in fromxml
reference_object.dict[v] = int(n.dict[v])
ValueError: invalid literal for int() with base 10: '1073741824.0'

Any idea of what's happening there? Thank you in advance.

Clues receives infinite requests when a node is idle

How to reproduce the error:

  • Launch a cluster with EC3 (using the ansible role recipes ('devel' branch of ec3 repo)):
./ec3 launch mycluster nfs slurm ubuntu14 -a auth.dat -u http://servproject.i3m.upv.es:8899 -y
  • Launch a node inside the cluster:
clues poweron wn1
  • When the node is idle launch a job:
- cat >> test.sh << EOF
  #!/bin/bash
  date > date.out
  /bin/hostname
  EOF
- sbatch test.sh

If you check the clues log you can see that the clues daemon is receiving infinite requests.
I've attached a complete clues log that has been generated due to this issue.
clues2.txt

Latest monitoring data does not retrieve latest data

In clueslib/cluesd.py line 112 select query does not return latest data in mysql:

SELECT MAX(TIMESTAMP), m.*, d.enabled FROM host_monitoring AS m LEFT JOIN hostdata AS d ON m.name=d.name GROUP BY m.name

Only latest value is max(timestamp), all other columns show values of first rows of every node.

Also this query is very inefficient. No indexes over timestamp or host_monitoring.name colums.
Left join isn't really necessary in my opinion.

NameError: global name 'schedulers' is not defined

When cluesd fails to power on node, it exits with error:

[NODE];DEBUG;2016-11-01 15:52:47,850;1478008367.850;failed to power on node vml-vm1 (3 fails)
Traceback (most recent call last):
File "/usr/bin/cluesserver", line 569, in
main_loop()
File "/usr/bin/cluesserver", line 566, in main_loop
CLUES_DAEMON.loop(options.RT_MODE)
File "/usr/local/lib/python2.7/dist-packages/clueslib/cluesd.py", line 557, in loop
cpyutils.eventloop.get_eventloop().loop()
File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 299, in loop
File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 138, in call
File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 101, in call
File "/usr/local/lib/python2.7/dist-packages/clueslib/cluesd.py", line 500, in _schedulers_pipeline
if not scheduler.schedule(self._requests_queue, monitoring_info, candidates_on, candidates_off):
File "/usr/local/lib/python2.7/dist-packages/clueslib/schedulers.py", line 606, in schedule
if node.power_on_operation_failed < schedulers.config_scheduling.RETRIES_POWER_ON:
NameError: global name 'schedulers' is not defined
cluesd.service: Main process exited, code=exited, status=1/FAILURE

Should line 606 in clueslib/cluesd.py just be
if node.power_on_operation_failed < config_scheduling.RETRIES_POWER_ON:
?

Error running cluesserver command

Hi

We are playing around with Clues in our OpenNebula cluster, we have followed the doc example to use our ONE endpoint (clues and oned are running in the same CentOS7 machine) but we get this error running the command:

# /usr/bin/cluesserver

http://0.0.0.0:8000/
Traceback (most recent call last):
  File "/usr/bin/cluesserver", line 639, in <module>
    main_loop()
  File "/usr/bin/cluesserver", line 636, in main_loop
    CLUES_DAEMON.loop(options.RT_MODE)
  File "/usr/lib/python2.7/site-packages/clueslib/cluesd.py", line 638, in loop
    cpyutils.eventloop.get_eventloop().loop()
  File "/usr/lib/python2.7/site-packages/cpyutils/eventloop.py", line 258, in loop
    e = self.execute_next_event()
  File "/usr/lib/python2.7/site-packages/cpyutils/eventloop.py", line 226, in execute_next_event
    pe_execute.call()
  File "/usr/lib/python2.7/site-packages/cpyutils/eventloop.py", line 77, in call
    Event.call(self)
  File "/usr/lib/python2.7/site-packages/cpyutils/eventloop.py", line 55, in call
    self.callback(*self.parameters)
  File "/usr/lib/python2.7/site-packages/clueslib/cluesd.py", line 288, in _monitor_lrms_nodes
    lrms_nodelist = self._platform.get_nodeinfolist()
  File "/usr/lib/python2.7/site-packages/clueslib/platform.py", line 44, in get_nodeinfolist
    return self._lrms.get_nodeinfolist()
  File "/usr/lib/python2.7/site-packages/cluesplugins/one.py", line 322, in get_nodeinfolist
    hosts = self._one.get_hosts()
  File "/usr/lib/python2.7/site-packages/cpyutils/oneconnect.py", line 224, in get_hosts
    return HOST_POOL(str_out).HOST
  File "/usr/lib/python2.7/site-packages/cpyutils/xmlobject.py", line 111, in __init__
    self._parse(xml_str, parameters)
  File "/usr/lib/python2.7/site-packages/cpyutils/xmlobject.py", line 81, in _parse
    newObj = className(obj.toxml())
  File "/usr/lib/python2.7/site-packages/cpyutils/oneconnect.py", line 82, in __init__
    self.keywords.update(self.TEMPLATE.get_kws_dict())
  File "/usr/lib/python2.7/site-packages/cpyutils/xmlobject.py", line 139, in get_kws_dict
    kw_dict[kw] = _to_numeric(self.__dict__[kw])
  File "/usr/lib/python2.7/site-packages/cpyutils/xmlobject.py", line 114, in _to_numeric
    v = float(value)
TypeError: float() argument must be a string or a number

Do you know where the issue is? we are using ONE 4.12.1 in our case

Thanks in advance!
Alvaro

Hook System

Version 1 of CLUES had a hook system, that is not implemented in CLUES2.

We are including the hook system that will execute external applications when some events happen. At this time, we are considering:

  • On Pre POWON Node
  • On Post POWON Node
  • On POWONERR Node
  • On Pre POWOFF Node
  • On Post POWOFF Node
  • On POWOFFERR Node
  • On ERR Node
  • On REQUEST
  • On IDLE
  • On USED

Some other events may be considered

Using gres resources

Hello somebody

I detected that clues vies CPUs and RAM resources. But I need know how detected GPUS too. Cause when I have 16 cores machines with 8 gpus. Clues sent gpus'jobs thinking that I need cpus. And slurm put jobs with resources. gpus nodes are free (and down). But in used nodes cpus are available. But not GPU

Thanks again

KeyError: ‘MinMemoryNode' in SLURM plugin

Traceback (most recent call last):
  File "/bin/cluesserver", line 618, in <module>
    main_loop()
  File "/bin/cluesserver", line 615, in main_loop
    CLUES_DAEMON.loop(options.RT_MODE)
  File "/usr/lib/python2.7/site-packages/clueslib/cluesd.py", line 604, in loop
    cpyutils.eventloop.get_eventloop().loop()
  File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 299, in loop
  File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 138, in call
  File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 101, in call
  File "/usr/lib/python2.7/site-packages/clueslib/cluesd.py", line 382, in _monitor_lrms_nodes_and_jobs
    self._monitor_lrms_jobs()
  File "/usr/lib/python2.7/site-packages/clueslib/cluesd.py", line 332, in _monitor_lrms_jobs
    lrms_jobinfolist = self._platform.get_jobinfolist()
  File "/usr/lib/python2.7/site-packages/clueslib/platform.py", line 50, in get_jobinfolist
    return self._lrms.get_jobinfolist()
  File "/usr/lib/python2.7/site-packages/cluesplugins/slurm.py", line 280, in get_jobinfolist
    memory = _translate_mem_value(job["MinMemoryNode"] + ".MB")
KeyError: ‘MinMemoryNode'

Power On nodes

Buenos días, estoy realizando un TFG para la Universidad de Valladolid, y estoy teniendo problemas a la hora de que Clues encienda los nodos del cluster cuando son necesarios por los trabajos que se introducen en la cola.

clues does not enable a previously disabled node

If you disable a node with the command:

clues disable node1

And then you want to enable it again with:

clues enable node1

There is no error but the node is not enabled it remains in "disabled" state.

Power on/off order

Since often not all servers are the same capacity, would be nice I we could define some kind of rank or order to power on or off servers.

Cluesd will crash / won't start if node is set to offline

In Opennebula 5.2 if I set node state to offline, cluesd will crash and won't start again (until host is offline).
If I set node online or disabled, it can be started again.

Here's relevant error in log:

Traceback (most recent call last):
File "/usr/bin/cluesserver", line 569, in
main_loop()
File "/usr/bin/cluesserver", line 566, in main_loop
CLUES_DAEMON.loop(options.RT_MODE)
File "/usr/local/lib/python2.7/dist-packages/clueslib/cluesd.py", line 557, in loop
cpyutils.eventloop.get_eventloop().loop()
File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 299, in loop
File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 138, in call
File "build/bdist.linux-x86_64/egg/cpyutils/eventloop.py", line 101, in call
File "/usr/local/lib/python2.7/dist-packages/clueslib/cluesd.py", line 226, in _monitor_lrms_nodes
lrms_nodelist = self._platform.get_nodeinfolist()
File "/usr/local/lib/python2.7/dist-packages/clueslib/platform.py", line 45, in get_nodeinfolist
return self._lrms.get_nodeinfolist()
File "/usr/local/lib/python2.7/dist-packages/cluesplugins/one.py", line 323, in get_nodeinfolist
hosts = self._one.get_hosts()
File "build/bdist.linux-x86_64/egg/cpyutils/oneconnect.py", line 224, in get_hosts
File "build/bdist.linux-x86_64/egg/cpyutils/xmlobject.py", line 111, in init
File "build/bdist.linux-x86_64/egg/cpyutils/xmlobject.py", line 81, in _parse
File "build/bdist.linux-x86_64/egg/cpyutils/oneconnect.py", line 72, in init
KeyError: 8
cluesd.service: Main process exited, code=exited, status=1/FAILURE

Add a new state: USED_ERR

There is an ON_ERR state, but CLUES has no means to inform the user whether the whether the node being used or not. The user may check if the resources are used or not, but when considering the node to be recovered (e.g. trying to power it off again) it is tedious to check once and again whether the node is used or it is idle. A new state USED_ERR (and rename ON_ERR to IDLE_ERR) should be considered.

Error in Nomad Plugin

For each node, the function get_nodeinfolist() return the real usage of memory and CPU instead of the number of resources that are currently reserved for jobs in the node.

Support for Kubernetes?

Bon dia gent !!!
I have just stepped across this project when looking for something like this for our on prem k8s cluster. Is this supporting it?
Apologize if this isn't the right place to ask.
Cheers.

Clues Reports - Error showing the areas of the deployed nodes

This error appears when printing the information about the state of the nodes.
In the image between 22:03 and 22:38 you can see the purple line but not the purple area that represents an idle node.
Also between 22:38 and 23:18 the area should show the use of 10 nodes, not 9. Also the idle area is missing one node.

clues_error

If you check the cluesdata.js, the data is there, so it seems that it is an error from the graphs.

The data to reproduce the error can be found in this zip file. reports.zip

Cheers

Slurm mixed state not recognized

Hi. I used clues with Slurm. Clues put on and down nodes ok. But if some nodes are in "mixed" status. Clues status put "off" or "off error" status. Nodes idle work fin

The problem is that this mixed nodes have free resources, and clues try to power on another nodes.

Who I can insert the mixed status?

I see core count is correct

gpu17 idle enabled 02h32'23" 0,0.0 16,1.38141180625e+14
gpu18 off (err enabled 04h01'53" 8,0.0 16,1.38141180625e+14
gpu19 off (err enabled 03h31'23" 8,0.0 16,1.38141180625e+14

Thanks a lot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.