Code Monkey home page Code Monkey logo

collectd-cvmfs's Introduction

Collectd Module for CvmFS

Configuration

Example:

TypesDB "/usr/share/collectd/collectd_cvmfs.db"
<Plugin "python">
  Import "collectd_cvmfs"
  <Module "collectd_cvmfs">
    Repo "alice.cern.ch" "atlas.cern.ch"
    Repo "ams.cern.ch"
    MountTime True
    MountTimeout 10
    Memory True
    Attribute ndownload nioerr
    Attribute usedfd
    Verbose False
    Interval "300"
  </Module>
</Plugin>
  • TypesDB: types used by the plugin and shipped with the package.
  • Repo: cvmfs repository to monitor.
  • MountTime: boolean value to specify whether mount time should be reported or not.
  • MountTimeout: timeout in seconds while trying to mount the repositories.
  • Memory: boolean value to specify whether the memory footprint should be reported or not.
  • Attribute: attribute to monitor on the given repositories. You can get the list from of valid attributes from the type db in resources/collectd_cvmfs.db.
  • Interval: interval in seconds to probe the CVMFS repositories.
  • Verbose: boolean value to produce logs more verbosed in collectd. It is false by default.

The plugin allows multiple instances for different configurations. This allows probing different repos at different intervals or probing different attributes depending on the repository.

Metrics

The metrics are published in the following structure:

Plugin: cvmfs
PluginInstance: <repo>
Type: {<Attribute>|MountTime|Memory|Mountok}

# Only with Memory:
TypeInstance: [rss|vms]

Example:

lxplus123.cern.ch/cvmfs-lhcb.cern.ch/mounttime values=[0.000999927520751953]
lxplus123.cern.ch/cvmfs-lhcb.cern.ch/nioerr values=[0]
lxplus123.cern.ch/cvmfs-lhcb.cern.ch/memory-rss values=[31760384]
lxplus123.cern.ch/cvmfs-repo.domain.ch/mountok values=[1]

collectd-cvmfs's People

Contributors

luisfdez avatar traylenator avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

collectd-cvmfs's Issues

The plugin shouldn't redefine the memory type

The current implementation of the plugin is defining a type called memory that clashes with the collectd memory type built-in and it is absolutely unnecessary.

It should be removed.

Symptoms: When the type is redefined, the datasource is changed and memory values are reported with a different datasource name.

collectd not able to mount cvmfs if selinux is in place

  • Version python2-collectd_cvmfs-1.0.2-1.el7.1
  • CentOS 7
  • cvmfs-2.4.4-1.el7.centos

We will need some extra seliinux permissions to allow collectd service to access cvmfs.

# grep avc /var/log/audit/audit.log | audit2allow   -a
#============= collectd_t ==============
allow collectd_t fusefs_t:dir read;

and probably others once mounted.

collectd could get stuck in some cases if autofs hangs while mounting the repository

On SLC6 nodes I have spotted some scenarios where collectd could get stuck while trying to the the MountTime for a cvmfs repository.

In the affected nodes, you would see an autofs in a state like this:

root      7065  0.0  0.0 606552  1316 ?        Ssl  Oct10   1:29 automount --pid-file /var/run/autofs.pid
root     31188  0.0  0.0 111756   544 ?        S    Oct17   0:00  \_ /bin/mount -t cvmfs ilc.desy.de /cvmfs/ilc.desy.de
root     31189  0.0  0.0  16224   748 ?        S    Oct17   0:00      \_ /sbin/mount.cvmfs ilc.desy.de /cvmfs/ilc.desy.de -o rw
cvmfs    31209 99.9  0.0  68504  1056 ?        R    Oct17 9685:47          \_ /usr/bin/cvmfs2 -o rw,fsname=cvmfs2,allow_other,grab_mountpoin

In this state, os.listdir will hang forever and not event collectd will be able to kill it after an interval.

I have tried an alternative implementation using thread to run listdir but the start() call to the thread hangs as well and it cannot reach the next step: to join the thread with a timeout.

After some tests, it was found that the scandir package does a better job, being able to run it in a thread and kill it.

Even if collectd is able to kill scandir after a problematic interval, I think a more sensible approach would be to define a timeout for attempts to mount a given repo. Something like this:

    import scandir
    def async_scandir(self, repo_mountpoint, timeout):
        contents = []
        t = threading.Thread(target=lambda: contents.extend(scandir(repo_mountpoint)))
        t.daemon = True
        t.start()
        t.join(timeout)
        if t.is_alive():
            raise Exception('Scandir timed out')
        return contents

crashed cvmfs is not noticed

Steps to reproduce:

cd /cvmfs/alice.cern.ch
# Kill the cvmfs process
kill -9 1234 

Results in:

ls /cvmfs/alice.cern.ch
ls: cannot access /cvmfs/alice.cern.ch: Transport endpoint is not connected

Restart collectd just so we know we have fresh results:

# collectdctl getval lxplus790.cern.ch/cvmfs-alice.cern.ch/mounttime
value=1.172495e-02
# collectdctl getval lxplus790.cern.ch/cvmfs-alice.cern.ch/mountok
value=1.000000e+00

so basically we are not detecting this error state.

The python is:

<ipython-input-3-0950e51420db> in <module>()
----> 1 scandir('/cvmfs/alice.cern.ch')

OSError: [Errno 107] Transport endpoint is not connected: '/cvmfs/alice.cern.ch'

compared to good repository

In [4]: scandir('/cvmfs/cms.cern.ch')                                                                                                                                                        
Out[4]: <scandir.ScandirIterator at 0x7fdeb5cc57b0>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.