suse / deepsea Goto Github PK

View Code? Open in Web Editor NEW

161.0 35.0 75.0 6.57 MB

A collection of Salt files for deploying, managing and automating Ceph.

License: GNU General Public License v3.0

Python 74.08% SaltStack 17.65% Shell 0.96% Makefile 5.03% Scheme 0.01% Dockerfile 0.04% Jinja 2.23%

ceph salt python software-defined-storage distributed-storage deployment-system

deepsea's Introduction

DeepSea

A collection of Salt files for deploying, managing and automating Ceph.

The goal is to manage multiple Ceph clusters with a single salt master. At this time, only a single Ceph cluster can be managed.

This diagram should explain the intended flow for the orchestration runners and related salt states.

Status

DeepSea is no longer being actively developed since cephadm was added to the Ceph Octopus release.

The SES6 branch (v0.9.x) contains the most recent release of DeepSea, which supports deploying Ceph Nautilus.

deepsea's People

Contributors

Stargazers

Watchers

Forkers

theanalyst avengermojo lenzgr digideskio p-se sebastian-philipp supriti jschmid1 chekolyn swamireddy rjfd ricardoasmarques martin-weiss hoonetorg smithfarm imthekai oliveiradan shundezhang mkoderer ddiss ebartz mogeb votdev jecluis dzedro carltonf marc-assmann liangxin1300 jasonvoor sisphyus agraul aland-zhang toabctl canarytek callithea fflorens bmemmott71 plexorama seanrickerd qiangqiangsir ubi2go jonsger mallozup domainesia bhdn shyukri cloudxtreme hemna david-z lordfolken y4ns0l0 iwakus tserong leohubery craiggardner azraelxyz dreamsequins bdfoster mgfritch annabelle0218 gekios noelmcloughlin asettle johanneskastl xconnection bk201 tubbz-alt knkonishi yukaripapa aqsamm art-pozdn 0xavi0 samkenxstream spacecase123

deepsea's Issues

nettests do not work on long minion id's.

When minion id's are in the form host.domain.org nettests fail due to minion ids not being found

PR:

#27

Partially fixes this but not completely

Use pylint or any other linter for syntax checks

As we're adding more and more code in the runner, we tend to error, a linter should be able to catch many of them, though at this moment we need a lot more code style fixes before we can have a linter run clean.

We could use salt-pylint and borrow salt's own pylintrc config which should allow us to not report syntax errors on things salt monkey patches in the global namespace (for eg __salt__ and __opts__ builtins to name a few)

cc @rjfd @jschmid1 @swiftgist @jan--f

ceph.conf friendly editor

The request is for an editor that supplies ceph.conf parameters primarily for initial deployments. The current parameters are

journal size
pg_num
pgp_num

I expect more. The friendly part is mild validation and likely helpful descriptions for these parameters.

advertise network configuration

Stage 1 does a best guess of the public/cluster networks. However, if it's wrong the admin may not realize this until Stage 3 is complete. This creates a lot of unnecessary churn.

Find a way to let the admin know what's been selected. I expect a nice message in advise.py and added to Stage 2 would work.

Stage 0: failed: Jinja error: 'validate.setup'

Trying to run "salt-run state.orch ceph.stage.0" command get following error
OS:- CentOS7
Compile:- DeepSea
[CRITICAL] Rendering SLS 'base:ceph.stage.0.master.default' failed: Jinja error: 'validate.setup'
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/salt/utils/templates.py", line 368, in render_jinja_tmpl
output = template.render(**decoded_context)
File "/usr/lib/python2.7/site-packages/jinja2/environment.py", line 969, in render
return self.environment.handle_exception(exc_info, True)
File "/usr/lib/python2.7/site-packages/jinja2/environment.py", line 742, in handle_exception
reraise(exc_type, exc_value, tb)
File "", line 3, in top-level template code
File "/usr/lib/python2.7/site-packages/salt/modules/saltutil.py", line 1181, in runner
return rclient.cmd(name, kwarg=kwargs, print_event=False, full_return=True)
File "/usr/lib/python2.7/site-packages/salt/runner.py", line 144, in cmd
full_return)
File "/usr/lib/python2.7/site-packages/salt/client/mixins.py", line 225, in cmd
self.functions[fun], arglist, pub_data
File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1086, in getitem
func = super(LazyLoader, self).getitem(item)
File "/usr/lib/python2.7/site-packages/salt/utils/lazy.py", line 101, in getitem
raise KeyError(key)
KeyError: 'validate.setup'

; line 3

Any suggestion

ntp: we try to reach the time server even if time_service is set as disabled

          ID: time
    Function: salt.state
      Result: False
     Comment: Run failed on minions: tf-salt-minion-1, tf-salt-master, tf-salt-minion-0, tf-salt-minion-3, tf-salt-minion-2
              Failures:
                  tf-salt-minion-1:
                    Name: ntp - Function: pkg.installed - Result: Clean Started: - 10:56:43.069718 Duration: 427.82 ms
                  ----------
                            ID: sync time
                      Function: cmd.run
                          Name: sntp -S -c tf-salt-master
                        Result: False
                       Comment: Command "sntp -S -c tf-salt-master" run
                       Started: 10:56:43.499383
                      Duration: 32.645 ms
                       Changes:   
                                ----------
                                pid:
                                    3843
                                retcode:
                                    1
                                stderr:
                                    tf-salt-master lookup error Name or service not known
                                stdout:
                                    sntp [email protected] Tue Nov 29 09:15:54 UTC 2016 (1)
                  
                  Summary for tf-salt-minion-1
                  ------------
                  Succeeded: 1 (changed=1)
                  Failed:    1
                  ------------
$ salt "*" pillar.get time_service
tf-salt-minion-0:
    disabled
tf-salt-minion-1:
    disabled
tf-salt-minion-2:
    disabled
tf-salt-minion-3:
    disabled
tf-salt-master:
    disabled

look into setting NTPD_FORCE_SYNC_ON_STARTUP

As @smithfarm pointed out "/etc/sysconfig/ntp has a knob (NTPD_FORCE_SYNC_ON_STARTUP) to make ntpd force clock sync at startup."

Maybe DeepSea setting this option improves a deployment with quicker clock sync.

Custom profiles

The general idea is how and when to present the installer/admin with the ability to specify a unique proposal. Normally, populate.py uses nice_ratios and rounding to create the different hardware profiles for separate journals. However, if the numbers simply do not work (12 spinners, 1 SSD), no suggestion is made.

Rather than go down the path of infinite possibilities and also because the above example should be the uncommon case, give the admin the ability to trivially create their own profiles. One feature is to allow optionally filtering based on model and/or capacity. An example command may look like

salt-run populate.proposals --ratio 6:1 --data-model DELL --data-capacity 1862GB --journal-model Intel --journal-capacity 185GB

We use the word capacity rather than "size". This matches the current output of hwinfo and will hopefully prevent any confusion with "journal size". In Ceph, "journal size" is normally the partition size of a given journal. We need the size or the capacity of the disk.

The behavior of the above would filter all non-journal drives by the DELL model that match 1862GB and likewise with the journals that match Intel and 185GB. Once some multiple of 6:1 is achieved, all remaining drives including those excluded by the filter would be listed as dedicated OSDs.

The filter options are truly optional. The model or capacity is only necessary if the admin wishes to further restrict the application of the ratio. Multiple commands could then be run such as

salt-run populate.proposals --ratio 6:1
salt-run populate.proposals --ratio 5:1
salt-run populate.proposals --ratio 12:1

With 12 spinners and 1 SSD, the first profile would have 6 split journals and 6 indepdendent OSDs. The second would have 5 and 7 respectively. The last isn't recommended at all, but then, maybe the admin knows something about the hardware that I don't. :)

To keep the profiles both obvious and avoid colliding with the current profiles, I believe embedding the ratio as part of the name is an okay approach. That means if you see

profile-6:1-1Intel185GB-12DELL1862GB-1

the admin went out of their way to create a custom profile.

The needed logic should be a duplication of the nice_ratio method but using the provided values and making sure to put all remaining devices as OSDs.

The first pass would keep these are arguments to the runner, but working this backward into providing a configuration(s) file may be useful.

Should this solution prove insufficient for some combination of hardware (e.g. imagine 4 types of drives where the admin wants two different ratios), my instinct says not to go any further for now. The additional complexity would likely not benefit the uncommon or one-off group. Also, the creation of a data structure of both data+journal and osds with the real device names still allow for hand editing in the most extreme scenarios.

Stage 1 should check for minions responding

If minions are not responding for any reason, we need to either issue a warning or error before trying to deduce things such as the public or cluster network.

policy.cfg-generic

The file is missing the stack/default entry.

The ultimate goal is to move this from the pillar entirely, but in the meantime, the examples should be accurate.

nettest does not take ceph.conf into account

As far as I can see, nettest ignores which networks Ceph is actually using. I think it should take that into account.

It should try whether all Ceph member nodes with addresses on the public_network are accessible from all nodes. It should also try to contact to all cluster network addresses from all nodes that need to be able to access them (OSDs, MONs, MDS, ...) For this, it should also set the source address properly to the local address on that network.

Right now, I think nettest leverages all IPv4 addresses, even those that Ceph doesn't use. I don't think we should try that.

Deepsea can't detect if a disk is an SSD/HDD if it sits behind a RAID Controller.

So far I heard from a couple of people wanting to deploy with a RAID setup. That won't work with the current method of detecting solid states..

The only tool that is capable of detecting the correct type is smartctl megaraid.

We either want to change the way we detect the the disktype or add some overwrite capabilities.

salt-api and runner authentication

I recently learned that runner authentication seems to fail. The example from https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html#usage for the runner portion of the example does not work. While I did not find the root cause, the problem starts with /usr/lib/python2.7/site-packages/salt/master.py line 1521 "good = check_fun(..." in Salt 2015.8.12. Considering our current eauth settings allow 'admin' to do anything, I do not believe that the issue is configuration related.

To explore the functionality of executing runners in the salt-api, adding "good = True" directly after this line allows all runners that I tried to work.

sequential reboot

In stage/prep/minion/default.sls, there is a restart call to ceph.updates.restart.

Since this is a minion file targeted from orchestration, these will run in parallel. Now, the reboot only happens when there's a kernel update. During a fresh installation, a parallel update/reboot is desirable.

Once a cluster is running, a sequential reboot is necessary. The preference is to keep all other steps running in parallel and only run the reboots sequentially when something like roles are defined.

My first guess would be create a salt-runner that returns each minion name in the proper order according to role or return '*' and hope that will work as a target. The "proper" order should reflect the same order as the restart orchestration (i.e. ceph/restart/default.sls)

verify the healthiness of roles

Currently we monitor the clusters health based on the information ceph health gives you.
That is alright to a certain extend - We might have wrong information if we reboot monitors in a setup > 3 and something goes wrong. ceph health will tell you that you might have lost a monitor and drop to HEALTH_WARN. So we happily go along an restart the next one ( the exit condition is HEALTH_ERR for reason [0] ). The issue here is that you only notice if something went wrong when you lost the critical amount of monitors already. So the goal here is to look for a better way to look for the status like systemdctl status etc.

[0] HEALTH_WARN is too vague to drop out with error. there are many non-critical cases where ceph gives you a warning.

mon creation + salt's fqdn system is broken depending on the version of salt being used

tf-salt-master:/srv/salt # python
Python 2.7.9 (default, Dec 21 2014, 11:02:59) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import salt.utils.network
>>> import socket
>>> socket.gethostname()
'tf-salt-master.localdomain'
>>> salt.utils.network.get_fqhostname()
'localhost'

And depending on the version of salt used we may have different values as the hueristics used in getting the fqdn, minion_id is generally guaranteed to be unique or salt wouldn't accept the keys, we could use that to create monitors

dmcrypt/bluestore support

the approach how data-at-rest encryption is managed changed some month back. It's now relatively easy as you upload your en/decryption key in the k/v store of the monitor and retrieve this via a separate keyring.

afaict enabling this would just mean to add a --dmcrypt switch to the constructed ceph-disk command.

we might need to add something to the policy in order to mark certain profiles as encrypted

ntp client configuration

DeepSea currently forces synchronization, but does not write an ntp.conf to the minions. The request is to add the client configuration.

stage1 passes silently though populate fails

In a virtual setup I'm getting a success for stage1 even though populate proposals runner has errored out

# salt-run state.orch ceph.stage.1
[WARNING ] All minions are ready
retcode:
    0
7374-deepsea-salt-master_master:
  Name: minions.ready - Function: salt.runner - Result: Changed Started: - 16:37:17.294890 Duration: 1477.662 ms
  Name: populate.proposals - Function: salt.runner - Result: Changed Started: - 16:37:18.772688 Duration: 1767.664 ms

Summary for 7374-deepsea-salt-master_master
------------
Succeeded: 2 (changed=2)
Failed:    0
------------
Total states run:     2
Total run time:   3.245 s

bash-4.3# salt-run populate.proposals                    
Exception occurred in runner populate.proposals: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/client/mixins.py", line 357, in low
    data['return'] = self.functions[fun](*args, **kwargs)
  File "/srv/modules/runners/populate.py", line 829, in proposals
    disk_configuration.generate(hardwareprofile)
  File "/srv/modules/runners/populate.py", line 297, in generate
    self.hardware.add(server, self.storage_nodes[server])
  File "/srv/modules/runners/populate.py", line 178, in add
    label = self._label(drive['Model'], drive['Capacity'])
KeyError: 'Capacity'

hwinfo does list the capacity though, part of output from cephdisks.list mine function

374-deepsea-salt-minion-b:
    |_
      ----------
      Attached to:
          #15 (Storage controller)
      Bytes:
          8589934592
      Capacity:
          8 GB
      Config Status:
          cfg=new, avail=yes, need=no, active=unknown
      Device File:
          /dev/vdb
      Device Number:
          block 254:16-254:31
      Driver:
          virtio-pci, virtio_blk
      Driver Modules:
          virtio_pci, virtio_blk
      Geometry (Logical):
          CHS 16644/16/63
      Hardware Class:
          disk
      Model:
          Disk
      Parent ID:
          sNGd.+FFPFBVXZu6
      Size:
          16777216 sectors a 512 bytes
      SysFS BusID:
          virtio1
      SysFS Device Link:
          /devices/pci0000:00/0000:00:04.0/virtio1
      SysFS ID:
          /class/block/vdb
      Unique ID:
          ndrI.Fxp0d3BezAE
      device:
          vdb
      rotational:
          1

Jijna error with salt version 2016.3.4

This error is not seen with salt version salt 2015.8.10. With salt version 2016.3.4, running deepsea stage 3 gives the following error:

admin.ceph_master:
Data failed to compile:

Rendering SLS 'base:ceph.stage.3.default' failed: Jinja error: runner() got multiple values for keyword argument 'name'

Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/salt/utils/templates.py", line 368, in render_jinja_tmpl
output = template.render(**decoded_context)
File "/usr/lib/python2.7/site-packages/jinja2/environment.py", line 989, in render
return self.environment.handle_exception(exc_info, True)
File "/usr/lib/python2.7/site-packages/jinja2/environment.py", line 754, in handle_exception
reraise(exc_type, exc_value, tb)
File "", line 4, in top-level template code
TypeError: runner() got multiple values for keyword argument 'name'

; line 4

{% set FAIL_ON_WARNING = salt'pillar.get' %}

{% if salt['saltutil.runner']('ready.check', name='ceph', fail_on_warning=FAIL_ON_WARNING) == False %} <======================
ready check failed:
salt.state:
- name: "Fail on Warning is True"
- tgt: {{ salt'pillar.get' }}
- failhard: True
[...]

retcode:
1

ceph mds: Default PG_NUM should be calculated based on user selection

For mds pool creation, srv/salt/ceph/mds/pools/default.sls, currently we use as default 256.
But from previous stages, we already have enough information to calculate a right PG num

We should use the calculated value instead of default 256.

Extend Stage 0 validate

One item that may be suitable is verifying that the salt-master and salt-minions are set to enable (as in the systemctl enable sense). Is this worth stopping a deployment over?

cephfs benchmark: don't use admin keyring

https://github.com/SUSE/DeepSea/blob/master/srv/salt/ceph/cephfs/benchmarks/mount.sls#L17 expects the admin keyring to be present on the client-cephfs nodes. Use a key with less privileges.

ping tests dont run in parallel

Have just spoken to eric and discovered because ping tests don’t run in parallel. He wont be accepting the ping tests.

baseline benchmark: add some clustering to results before reporting

Right now the baseline benchmark reports the average OSD throughput and OSDs that deviate from this average by a certain percentage.
When two (or more) class of OSDs are in a cluster (think one class on spinners, another only on SSDs) and given the right amount of instances of each class this would report all OSDs as outliers. The average is a bad statistical tool here.

Implement some clustering after the test was run and report each cluster (hopefully OSDs of the same class) individually. With some tuning the clustering could be quite good and make the output much more informative on heterogeneous clusters.

nettest only tests IPv4

Customers may be using IPv6.

address_list = local_client.cmd(node, 'grains.get', ['ipv4'])

At the very least, nettest needs to detect if IPv6 addresses are used and abort with an appropriate error message.

add ability to add warnings to validate

Currently we have the ability to only pass fail in validate, sometimes (for eg. in case of fqdn) we'd like to have an option to warn and pass

populate's network detection interferes with role assigment

I want to deploy ceph on an openstack cluster where all nodes have their own /32 network. Routing is managed elsewhere. The proposal runner (rightly) complains about this network situation:

File "/srv/modules/runners/populate.py", line 704, in public_cluster
   raise ValueError("No network exists on at least 4 nodes")
No network exists on at least 4 nodes

The plan was to run stage 2 and then fill in the public_interface addresses in the /srv/pillar/ceph/stack tree. However the proposal runner doesn't create any files in the role-mon directory. Stage 2 complains about this:

role-mon/cluster/target158069079237.sls matched no files
role-mon/stack/default/ceph/minions/target158069079237.yml

Stage 3 then of course complains about the absence of monitors.
I think the proposal runner should complain about the networks, but still create at the role assignments under role-mon/cluster.

stage 2 fails silently if /srv/pillar/ceph/stack/ceph is not writeable

Had the wrong owner/permission on /srv/pillar/ceph/stack/ceph. The push runner fails fairly silently if it can not write to this directory.
With -l debug one gets

[INFO] Exception occurred in runner push.proposal: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/client/mixins.py", line 357, in low
    data['return'] = self.functions[fun](*args, **kwargs)
  File "/srv/modules/runners/push.py", line 72, in proposal
    pillar_data.output(common)
  File "/srv/modules/runners/push.py", line 120, in output
    self._custom(custom)
  File "/srv/modules/runners/push.py", line 152, in _custom
    os.makedirs(path_dir)
  File "/usr/lib64/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/srv/pillar/ceph/stack/ceph/minions'

Maybe the push runner can fail a bit more dramatically. Runing stage 3 after doesn't give much info either.

single node cluster deployment not permitted

In attempting to deploy a single node Ceph cluster, I ran into the following error during stage 1:

658         priorities = []
659         for network in networks:
660             quantity = len(networks[network])
661             # Minimum number of nodes, ignore other networks
662             if quantity > 3:
663                 priorities.append( (len(networks[network]), network) )
664                 log.debug("Including network {}".format(network))
665             else:
666                 log.warn("Ignoring network {}".format(network))
667 
668         if not priorities:
669             raise ValueError("No network exists on at least 4 nodes")
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So it seems that in attempting to automagically determine which network is public and which network is cluster, Deapsea asserts that at least four nodes exist.

I only raise this as a bug because ceph-deploy supported single node cluster deployment. If this isn't seen as a priority to Deepsea then feel free to close this.

I'll gladly take this as bug assignee, but would appreciate an indication of how many other roadblocks I can expect to hit in attempting to support this.

rbd benchmark: check if pool used for benchmarking exists

The benchmark code assumes the rbd pool exists, as it should on a freshly deployed DeepSea cluster. Presence of pool to use should be checked and pool to use should be configurable (probably in the pillar).

Enable deploying collectd on all cluster nodes, to gather performance data on a central node

In order to be able to monitor and observe the cluster nodes' general performance, it would be helpful if DeepSea would deploy collectd on all nodes. These collectd instance should be configured to remotely log to a central (definable) collectd instance. By default, this could be the management node directly.

fqdn and localhost

The validation step checks to see that the minion name matches the fqdn for each salt minion. The objective is to eliminate the confusion between "ping nodeA" and "salt 'minionX.domain' test.ping" when these are the same server. Those new to Salt start support issues of "why doesn't "salt 'nodeA' test.ping" work?"

If your setup includes adding the hostname to /etc/hosts as 127.0.0.1, then the salt minion will list the fqdn as localhost. I believe Salt is relying on python and that a forward and reverse lookup are involved.

The question is "Is localhost a valid fqdn for Salt?". I cannot think of any consequences currently. If that is the case, then validate.py needs to allow "localhost" as a match for any fqdn.

journal partitions remain

When calling ceph.rescind for a storage node, the partitions are not removed. Subsequent installations add more partitions to journal drives.

Stage 0 fails cause of `ceph.updates.default.zypper update` failed.

Hi Eric,

While running stage 0, I got this stack trace:

The minion function caused an exception: Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/salt/minion.py", line 1320, in _thread_return
    return_data = executor.execute()
File "/usr/lib/python2.7/site-packages/salt/executors/direct_call.py", line 28, in execute
    return self.func(*self.args, **self.kwargs)
File "/usr/lib/python2.7/site-packages/salt/modules/state.py", line 942, in sls
    ret = st_.state.call_high(high_)
File "/usr/lib/python2.7/site-packages/salt/state.py", line 2276, in call_high
    ret = dict(list(disabled.items()) + list(self.call_chunks(chunks).items()))
File "/usr/lib/python2.7/site-packages/salt/state.py", line 1799, in call_chunks
    running = self.call_chunk(low, running, chunks)
File "/usr/lib/python2.7/site-packages/salt/state.py", line 2163, in call_chunk
    self.event(running[tag], len(chunks), fire_event=low.get('fire_event'))
File "/usr/lib/python2.7/site-packages/salt/state.py", line 1955, in event
    [self.jid, self.opts['id'], str(chunk_ret['name'])], 'state_result'
KeyError: 'name'

To find the cause, I've modified /usr/lib/python2.7/site-packages/salt/state.py to print chunk_ret:

{
    'comment': 'One or more requisite failed: ceph.updates.default.zypper update',
    '__run_num__': 1, 
    '__sls__': u'ceph.updates.default', 
    'changes': {}, 
    'result': False
}

Then, I enabled the debug log on the admin minion, which revealed that zypper --non-interactive update --replacefiles failed:

admin:/srv # zypper --non-interactive update --replacefiles
Loading repository data...
Reading installed packages...

The following 9 package updates will NOT be installed:
  augeas-lenses ... snip ... p11-kit p11-kit-tools ruby2.1-rubygem-facter rubygem-facter

The following 9 NEW packages are going to be installed:
  ruby2.1-rubygem-bundler ruby2.1-rubygem-chef-zero-4_9 ... snip ...

The following 199 packages are going to be upgraded:
  aaa_base aaa_base-extras .. snip ... rubygem-facter ... snip ...  zypper-log

199 packages to upgrade, 9 new.
Overall download size: 13.5 MiB. Already cached: 188.4 MiB. After the operation, additional 29.9 MiB will be used.
Continue? [y/n/? shows all options] (y): y
In cache desktop-translations-20151007-4.1.noarch.rpm                                             (1/208),   5.1 MiB ( 41.6 MiB unpacked)
... snip ...                                                   (154/208), 109.4 KiB (179.9 KiB unpacked)
In cache rubygem-hiera-1.3.4-17.2.x86_64.rpm                                                    (155/208),   5.7 KiB (  426   B unpacked)
Retrieving package rubygem-facter-2.4.6-1.2.x86_64                                              (156/208),  11.2 KiB (  952   B unpacked)
Retrieving: rubygem-facter-2.4.6-1.2.x86_64.rpm .....................................................[error]
File './x86_64/rubygem-facter-2.4.6-1.2.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/systemsmanagement:/puppet/openSUSE_Leap_42.1/'

Abort, retry, ignore? [a/r/i/? shows all options] (a): a
Problem occured during or after installation or removal of packages:
Installation aborted by user                                                                                                             

Please see the above error message for a hint.

Running a zypper refresh fixed this issue.

See also: saltstack/salt#36687

stage.3 fails if no igw is deployed

If the policy.cfg does not inlcude an igw (and maybe rgw as well) stage 3 fails. This is due to the role key creation has the failhard attribute and not matching a minion seems to be considered a failure.
Temporary work-around is to remove the not deployed roles from the {{ role }} key target array in /srv/salt/ceph/stage/3.sls.

nettests dont support salt-master-2015.8.7-17.1.x86_64

While the work on salt-master-2016 this is due to the sls files using the local include format.

Support for Ubuntu OS

Is it possible to have Ubuntu OS support instead of just SuseLinux?

Or change the formulas to support any operating system, instead of depending on Suse specific commands like zipper.

remove role-storage directory

The storage role is assigned by the profile directories. The role-storage is unnecessary and confusing users, so populate.py should not create it.

nettest does not validate whether ports are open

There could be firewall rules or others interfering.

Maybe running quick iperf3 tests or similar against the ports that we expect the nodes to use (from ceph.conf)?

Move modules

/srv/salt/_modules is really for user defined modules... need to move existing modules. Off hand, I think that's /srv/modules/modules.

stage.0: restart master_minion when running stage on master

There is a problem when executing this step https://github.com/SUSE/DeepSea/blob/master/srv/salt/ceph/stage/prep.sls#L34 if running the stage from the master node.

Don't know what should be the correct solution to this problem.
Should we assume that the master_minion is different from the salt-master node?

stage 0: kernel update requires "kernel-default" package to be installed

I was deploying DeepSea in a SLE12_SP2 image and when running stage 0 it stopped and restarted the node to update the kernel. The problem was that when I ran stage 0 again it tried again to install an updated version of the kernel.

After some investigation I found that the problem is due to the fact that kernel-default package is not installed, instead it is installed the kernel-default-base package. And in this case stage 0 will try to update the kernel-default package without any effect.

My main question to this issue is, do we really require that the host needs to have the kernel-default installed? or we can live with having only the kernel-default-base?

Either way, we need to fix stage 0 to check if kernel-default is installed to decide if it updated or installs, and avoid entering in a r"estart/run stage 0" loop

EauthAuthenticationError: Failed to authenticate!

after a recent rebase atop:

sha: 316c3b9071807a22baa87df5dd1823560c9ef79e
I get $subject error message when running ceph.purge, ceph.restart or almost anything.. just tried ceph.$role.key which seems to work just fine.

verified on virtual machine and bare metal on salt version:

Salt Version:
          Salt: 2015.8.12


Dependency Versions:
        Jinja2: 2.8
      M2Crypto: Not Installed
          Mako: 1.0.1
        PyYAML: 3.10
         PyZMQ: 14.0.0
        Python: 2.7.9 (default, Dec 21 2014, 11:02:59) [GCC]
          RAET: Not Installed
       Tornado: 4.2.1
           ZMQ: 4.0.4
          cffi: 1.5.2
      cherrypy: Not Installed
      dateutil: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
         ioflo: Not Installed
       libgit2: Not Installed
       libnacl: Not Installed
  msgpack-pure: Not Installed
msgpack-python: 0.4.8
  mysql-python: Not Installed
     pycparser: 2.10
      pycrypto: 2.6.1
        pygit2: Not Installed
  python-gnupg: Not Installed
         smmap: Not Installed
       timelib: Not Installed

System Versions:
          dist: SuSE 12 x86_64
       machine: x86_64
       release: 4.4.38-93-default
        system: SUSE Linux Enterprise Server  12 x86_64

full logs for ceph.restart
http://paste.opensuse.org/14432274

full logs for ceph.purge
http://paste.opensuse.org/50996940

could not find any related issues on salt itself yet.. anyone seen this before?

Running tool on remote node, tgz'ing exported files & transferring them to admin node

As discussed with @jan--f on irc, this is mostly meant so that we don't forget we had this discussion.

The whole idea would be to be able to run a given tool on all remote OSD nodes. Said tool would grab given data out of the OSD (e.g., maps), and export them to a predefined (or user-defined) directory; directory would be compressed, and then transferred to admin node.

This could potentially be useful for monitor rebuild, during a disaster recovery event.