Code Monkey home page Code Monkey logo

Comments (21)

swiftgist avatar swiftgist commented on May 23, 2024 1

What if we rethink the original structure. I see not having duplicate entries in a structure would be a good thing. As far as backward compatibility, we could create a new structure that is more ideal for extending and convert the existing cmd.run calls into a module. This salt module could handle either structure and have the simple validation that the structures are mutually exclusive. This should handle any transition for existing clusters until profiles are migrated. (The profiles would likely migrate to filters.)

If I invert the original relationship and use a new keyword, we could have

storage:
  devices:
  - /dev/vdb:
    format: xfs
    journal_size: 10G
  - /dev/vdc:
    format: xfs
    journal_size: 10G
  - /dev/disk/by-id/real-data-pathname:
    journal: /dev/disk/by-id/real-journal-pathname
    format: bluestore
    journal_size: 10G

Adding attributes for a metadata device for Bluestore would not corrupt the structure. The logic of deciding whether an OSD is stand-alone or has a separate journal would be based on the presence of a journal. The original logic relies on osds and data+journal keys. The new logic would rely on devices.

About my only concern with this path is the Salt module containing the condition check and executing the command. It's likely necessary since DeepSea has no examples of Jinja calling a module in a conditional. The admin needs the same level of control that an sls file provides. The unless/onlyif could be an option but those only run shell commands. This will take some thought.

Anyways, does the above structure matched what both of you desired as well as give a general path ahead on bluestore/dmcrypt?

from deepsea.

BlaineEXE avatar BlaineEXE commented on May 23, 2024 1

My only thought is that I don't see (sorry if I missed it) a place where the values can be specified cluster-wide. Does it make sense to have a default value that can be assumed for all devices?

ceph:
  storage:
    devices:
      default:
        format: bluestore
        wal: ''
        wal_size: ''
        db: ''
        db_size: ''
        encryption: dmcrypt
      '/dev/vdb':
        format: xfs
        encryption: none
      '/dev/vdc': #defaults
      '/dev/vdd': #defaults

from deepsea.

jschmid1 avatar jschmid1 commented on May 23, 2024

That will most likely only be applicable for filestore.
Bluestore won't take the same route.

edit: Bluestore will take the same route

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

I see #130 so the commands seem straightforward. Thanks for that.

Although I have been asking about only setting one host to the same format (e.g. bluestore, dmcrypt+bluestore) for all OSDs, the management may not be too terrible for each OSD. jan--f and I were chatting about the custom profiles and the filter/search issue is addressed there.

Thinking backwards, I believe adding a dictionary of devices and formats to the existing yaml profiles should be sufficient. For example,

storage:
  data+journals: []
  osds:
  - /dev/vdb
  - /dev/vdc

becomes

storage:
  data+journals: []
  osds:
  - /dev/vdb
  - /dev/vdc
  format:
    /dev/vdb: xfs
    /dev/vdc: xfs

The general idea is a simple device lookup which fits into the original command. Extending cephdisks.py to support a format command that outputs the correct options would fit nicely into the original command:

"ceph-disk -v prepare {{ salt'cephdisks.format' }} --data-dev --journal-dev --cluster {{ salt'pillar.get' }} --cluster-uuid {{ salt'pillar.get' }} {{ device }}"

The four formats and corresponding options would be

xfs                       --fs-type xfs
dmcrypt+xfs               --dmcrypt --fs-type xfs
bluestore                 --bluestore
dmcrypt+bluestore         --dmcrypt --bluestore

If no format key exists, we can default to xfs.

How does this sound?

from deepsea.

jan--f avatar jan--f commented on May 23, 2024

I have two questions:

  1. Can we come up with a more terse format for the yaml profiles? The current proposal seems a bit verbose. Without trying it I'd really like something like
  data+journals: []
  osds:
  - /dev/vdb:
      format: xfs
    /dev/vdc:
      format: xfs

This of course opens up the question about data+journal. This leads to my second question:

  1. From #130 I see we only encrypt the data partition. Is it possible to encrypt the journal too? Maybe even necessary since we'd otherwise leak data? Format wise I could imagine a more extensive format too. Maybe also use the data partition as a key for an object with properties containing extra info.

I'm aware that this could be a pretty severe change to the profiles format and has all kinds of repercussions in terms is backwards compatibility. But since I expect the info attached to OSDs devices to grow maybe we should think about a format that is more easily extendable.
closing this can of worms

edit: This is also with the custom ratio proposals in mind. The info about journal sizes would find a good home in this format too.

from deepsea.

jschmid1 avatar jschmid1 commented on May 23, 2024

Although I have been asking about only setting one host to the same format (e.g. bluestore, dmcrypt+bluestore) for all OSDs, the management may not be too terrible for each OSD. jan--f and I were chatting about the custom profiles and the filter/search issue is addressed there.

I don't think so either, but it will require, as already pointed out, changes to core structures of DeepSea.

  data+journals: []
  osds:
  - /dev/vdb:
      format: xfs
    /dev/vdc:
      format: xfs

seems favorable over

storage:
  data+journals: []
  osds:
  - /dev/vdb
  - /dev/vdc
  format:
    /dev/vdb: xfs
    /dev/vdc: xfs

From #130 I see we only encrypt the data partition. Is it possible to encrypt the journal too? Maybe even necessary since we'd otherwise leak data? Format wise I could imagine a more extensive format too. Maybe also use the data partition as a key for an object with properties containing extra info.

I honestly don't know the behavior of ceph-disk --dmcrypt with non-colocated data/journals, I'll look that up..

I'm aware that this could be a pretty severe change to the profiles format and has all kinds of repercussions in terms is backwards compatibility. But since I expect the info attached to OSDs devices to grow maybe we should think about a format that is more easily extendable.

Agreed.

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

The issue with the embedded format is that the calls become that much more complicated. We now have to pass any of 'osd', 'data' and 'journal' with the device name to find the format. It was my first thought as well, but a list of devices with their preferred format regardless of how they are used works a bit easier in terms of lookup and extending the structure.

I am still uncertain about specifying a unique journal size for every device in a Ceph cluster. We should probably continue that discussion in the #76.

from deepsea.

jan--f avatar jan--f commented on May 23, 2024

@swiftgist Yes, that is exactly the structure @jschmid1 and I came up with too. Planing to implement exactly that.
Sorry for not sharing sooner...

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

@jan--f So, you're okay with the word 'devices'? I am looking at starting a branch for additions to osd.py.

from deepsea.

jan--f avatar jan--f commented on May 23, 2024

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

Summarizing our conversation earlier: We will go with explicit keywords with default values and begin migrating the pillar data to a ceph name space.

For example,

ceph:
  storage:
    devices:
      '/dev/vdb': 
        format: bluestore
        journal: ""
        encryption: dmcrypt 
        journal_size: ""

Using empty string for accepting defaults. I think with the move to "ceph:storage", then osds is probably better than devices.

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

I think we need to match the actual names for bluestore, so replace journal with wal. I think I would like to leave the db on the main device for the moment. In other words, we start with the left hand side of slide 16 from

http://events.linuxfoundation.org/sites/events/files/slides/20170323%20bluestore.pdf

This mirrors the behavior we currently have with xfs and journals. Adding an extra device will likely complicate both the filter runner/module that Jan is working on as well as the remove/rescind logic. We will need to support it and if it turns out to be easier than expected, that's great.

For the structure, I think this works for now

ceph:
  storage:
    devices:
      '/dev/vdb': 
        format: bluestore
        wal: ""
        wal_size: ""
        encryption: dmcrypt 

and this would be the next iteration:

ceph:
  storage:
    devices:
      '/dev/vdb': 
        format: bluestore
        wal: ""
        wal_size: ""
        db: ""
        db_size: ""
        encryption: dmcrypt 

Any thoughts?

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

The issue is with Salt. With the pathname as a key in a dictionary, I can't think of a query that allows pulling from either or without littering sls files. If you put the logic into a salt module to hide the double query, you do have the management of when does this OSD get its own settings vs. using the cluster settings. If you merge all of these structures into one file, you still have to manage this from a few different endpoints.

Salt natively did not have global settings vs. per minion settings until Saltstack included the stack.py. DeepSea has had the global, cluster, role and minion hierarchy with defaults by including stack.py, but that does depend on the structures containing the same keys. With every OSD having a unique pathname, keeping the osd.sls simple and creating the necessary runner to manage the files lets everything stay straightforward if not perfectly ideal.

Take a look at /srv/salt/ceph/osd/default.sls and work backwards. Jinja isn't a programming language and the more complex that process becomes, the more involved the debugging is for the end user.

Now, one thing I am concerned about is how large the Salt pillar will get. With a few hundred disks, a "salt '*' pillar.items" might be intimidating to some. I will say I am mixed on that.

I just realized you were using a 'default' within one minion and I have been describing one for the entire cluster. If there is a way to get

salt 'data1*' pillar.get ceph:storage:devices:/dev/vdd

to return the merged results, that should be fine. I have not delved into Salt itself to see if there is key globbing there.

from deepsea.

jan--f avatar jan--f commented on May 23, 2024

I think the verbosity of this approach is not too bad. At no point does a user have to write this...it will all be generated. The presentation in the pillar is a different animal. But if I understand correctly we don't have to put all this into the pillar do we?

from deepsea.

jan--f avatar jan--f commented on May 23, 2024

As for wal and db, I think it makes sense to put it all in there, doesn't it?

from deepsea.

Martin-Weiss avatar Martin-Weiss commented on May 23, 2024

When I look at this https://www.sebastien-han.fr/blog/2016/03/21/ceph-a-new-store-is-coming/ I can find such an "idea":

So you can imagine HDD as a data drive, SSD for the RocksDB metadata and one NVRAM for RocksDB WAL

So it would be great if we could specify these parameters per OSD in case of Bluestore.
1 -> where do we want to place the data (device/partition) -> i.e. the HDD
2 -> where the metadata (device/partition) -> i.e. the SSD partition
3 -> where the WAL (device/partition) -> i.e. the NVMe partition

I am not sure what sizes and ratios make sense for 2 and 3 - and I assume that there are scenarios where
1+2+3 = same device (default)
1 = one device, 2+3 second device (similar to XFS / journal in classic OSDs)
1 = HDD, 2 = SSD, 3 = NVMe

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

Considering the timeframe and lack of experimentation, I would like to initially support where the data and wal goes since that is most similar to data and journal with XFS. The db would still reside with the data. BTW, the metadata in actually its own partition on the main device. The db is specifically the RocksDB database.

This will allow some time for experimenting with two scenarios: all on same device, wal on separate device. We currently do not have recommendations for sizes or ratios and will need to investigate.

If the db and db_size appear in the configuration, I would expect as an admin that the setting would be honored and confused if it wasn't. So, my hope is that this time period without the db would be short lived and implemented relatively quickly. With it though, we will need recommendations for four partitions on two or three devices.

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

Latest structure working with wip-osd https://github.com/SUSE/DeepSea/tree/wip-osd

#storage:
#  data+journals: []
#  osds:
#  - /dev/vdb
#  - /dev/vdc

#ceph:
#  storage:
#    osds:
#      /dev/vdb:
#        format: filestore

#ceph:
#  storage:
#    osds:
#      /dev/vdb:
#        format: filestore
#        journal: /dev/vdc

#ceph:
#  storage:
#    osds:
#      /dev/vdb:
#        format: bluestore
#        wal: /dev/vdc
#        wal_size: 200M

#ceph:
#  storage:
#    osds:
#      /dev/vdb:
#        format: bluestore
#        db: /dev/vdc
#        db_size: 200M

ceph:
  storage:
    osds:
      /dev/vdb:
        empty: nothing
      /dev/vdc:
        empty: nothing

Note that format has changed from 'xfs' to 'filestore'. Also, the format defaults to bluestore when using the new structure; however, the dictionary cannot be completely empty currently.

from deepsea.

Martin-Weiss avatar Martin-Weiss commented on May 23, 2024

Some questions:

  1. will the old format still work in the future or do we need to adjust the proposals when upgrading?

storage:
data+journals:

  • /dev/vdb: /dev/vdd
  • /dev/vdc: /dev/vdd
    osds []
  1. will this also work and be supported:

ceph:
storage:
osds:
/dev/vdb:
format: bluestore
wal: /dev/vdc
wal_size: 200M
db: /dev/vdc
db_size: 200M

Thanks!

from deepsea.

swiftgist avatar swiftgist commented on May 23, 2024

Yes to both formats being supported.

from deepsea.

jan--f avatar jan--f commented on May 23, 2024

I think this can be closed now too? Implemented by #275 and #464.

from deepsea.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.