xapi-project / xapi-storage Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 18.0 3.49 MB

Experimental new storage interface for xapi

Home Page: https://xapi-project.github.io/xapi-storage/

Makefile 0.74% OCaml 24.82% Python 73.51% Shell 0.93%

xapi-storage's Introduction

The xapi toolstack

manages clusters of Xen hosts as single entities
allows running VMs to be migrated between hosts (including with storage) with minimal downtime
automatically restarts VMs after host failure ("High Availability")
facilitates disaster recovery cross-site
simplifies maintainence through rolling pool upgrade
collects performance statistics for historical analysis and for alerting
has a full-featured XML-RPC based API, used by clients such as XenCenter, Xen Orchestra, OpenStack and CloudStack

The xapi toolstack is developed by the xapi project: a sub-project of the Linux Foundation Xen Project.

Architecture: read about how the components of the xapi toolstack work together
Features: learn about the features supported by xapi and how they work.
Designs: explore designs for cross-cutting features.
Xen API documentation: explore the Xen API
Futures: find out how the xapi toolstack is likely to change and how you can help.
Xapi project: learn about the xapi project on the Xen wiki

Components

Xapi: manages a cluster of Xen hosts, co-ordinating access to network and storage.
Xenopsd: a low-level "domain manager" which takes care of creating, suspending, resuming, migrating, rebooting domains by interacting with Xen via libxc and libxl.
Xcp-rrdd: a performance counter monitoring daemon which aggregates "datasources" defined via a plugin API and records history for each.
Xcp-networkd: a host network manager which takes care of configuring interfaces, bridges and OpenVSwitch instances
Squeezed: a single host ballooning daemon which "balances" memory between running VMs.
SM: Storage Manager plugins which connect Xapi's internal storage interfaces to the control APIs of external storage systems.

xapi-storage's People

Contributors

Stargazers

Watchers

Forkers

jonludlam djs55 robhoes robertbreker simonjbeaumont euanh gaborigloi jjd27 kc284 marksymsctx krizex edwintorok lindig linbit psafont xandrus pinkdiamond1 xcp-ng

xapi-storage's Issues

Consider how storage migration could work

See subject

Is SMAPIv3 still maintained?

Hello there!

Just asking if SMAPIv3 (or v5 now?) is still actively developped because there isn't a lot of activity here (plus qemu-dp dev is done behind closed doors, so it's hard to tell).

If we (XCP-ng team) can help on anything, let us know :)

Possibility to pass a raw device?

Is it possible to pass/attach a raw device (eg a whole disk? or any block device) directly to a VM?

If yes, what would be the best way to do so?

SR.attach probably needs configuration

The URI is always input to the SR.create/SR.attach so it can't be modified to include configuration. If configuration is passed on SR.create, it may not be possible to write it to the storage to be read by the attach. Additionally we would allow the user to unplug the storage and supply a different configuration.

Need to return format as well as implementation

Qdisk needs a format=qcow2 or similar

Hyperlinks to "create" pick up "Volume.create" rather than "SR.create"

We need to fully-qualify the anchors in the HTML.

Best approach for block based storage

Hi there!

Asking for advice on a block based scenario.

We are successfully using qcow2 on a file based implementation for SMAPIv3 (BTW, is it still the "common" name of this API?) Basically, you got your device, a filesystem then your qcow2 files inside. That's fine, and I suppose that's similar to your way to do it for GFS2.

But what would be the best approach for dealing with block devices. In short, where to stop SMAPIv3 logic and where to start a fully independent storage logic?

I see 2 categories:

One big block storage that you split into smaller block "zones" (LVM approach done in SMAPIv1). Eg one big LUN for all your VMs.
One block storage per VDI, eg one LUN per VM disk

Regarding your experience on building a storage stack, what would be the best solution?

I'm not talking about a shared SR, only in a local scenario.

Should datapath.attach return multiple uris?

In particular if I have a .qcow file, I would probably prefer to use Qemu qdisk to serve it up. If I would like to import data in dom0, I need to fall back to a block device.

Alternatively: the tool in dom0 which needs the block device could run the block attach plugin?

Alternatively: we could ban blkfront's in dom0 (but this causes problems with suspend/resume and pygrub on file SRs)

Variant types are broken

    (** The choice of blkback to use. *)
    type backend = {
      domain_uuid: string; (** UUID of the domain hosting the backend *)
      implementation: string; (** choice of implementation technology *) 
    } with rpc

SMAPIv3 performance

As suggested by @kc284 , I post here for better/more fluid discussion regarding storage stack 👍 (thanks god I prefer GitHub than Jira!)

Since GFS2 started to use SMAPIv3 and qcow2 file format, we decided to do some performance tests.

In order to keep it simple as possible, only the filelevel based SR is tested:

xe sr-create type=filebased name-label=test-filebased device-config:file-uri=file:///mnt/ssd

Few minor issues: name-label and name-description aren't pushed correctly to XAPI, the SR is named "SR NAME" with "FILEBASED SR" decription. It's only a small glitch, but at least it's reported Otherwise, I can confirm the disk file is created and is a valid qcow2 file.

I did a benchmark on a Samsung 850 EVO SSD, on the same VM. Before benching, I did a local 'ext' SR on the same disk and still the same VM, so I could compare it.

Here is the results:

Sequential booth read then write (queue depth 32, 1 thread): SMAPIv3 is 3 times slower than "ext" SR. Also with SMAPIv3, note that tapdisk process seems to be at 100%.
Random Read (4KiB, queue depth 8, 8 threads): SMAPIv3 is 150 times slower than "ext" SR
Random Write (4KiB, queue depth 8, 8 threads): SMAPIv3 is 95 times slower than "ext" SR

If you want the detailed numbers, let me know.

Did I missed something during the SR creation? Since GFS2 is basically filelevel + GFS2 FS + cluster on top, I should expect roughly the same result I suppose.

Add setters for `name_label` `name_description`

We can set a name/description in Volume.create but we can't

update these later
select new values for a clone or snapshot (this is useful to mark the created Volume to avoid having to run a GC later if the client fails between the clone and set calls)

Make `type sr` abstract?

In the IDL the type sr (for an attached SR) should probably be marked as abstract. We should return it from SR.attach, store it away somewhere and resupply it later.

For the python boilerplate this probably corresponds to using pickle to store away the result in the boilerplate. The xapi-storage-service could persist a mapping of SR.uri to type sr on disk.

Enhancement: Support automatic VDI compaction based on Guest TRIM/UNMAP/Discard commands

Background

A common issue with thin proivisioned (a.k.a. dynamic) virtual disk files is that they grow beyond the size of the actual Guest data. This is because when a guest is deleting data from the filesystem, the files are only unlinked. Therefore the old data remains in the VDI.

With the new SSD drives OS'es now have the possibility to actually tell the storage layer what blocks are no longer in use through the ATA TRIM or the SCSI UNMAP command sets.

Automatic compaction of VDI files

Using the TRIM support in the guest OS it should be possible to automatically compact the VDI files used in Xen, by intercepting the TRIM or UNMAP commands.

This is possible in VirtualBox by emulating an SSD drive in the VM settings. nonrotational="true" discard="true

        <AttachedDevice nonrotational="true" discard="true" type="HardDisk" hotpluggable="false" port="1" device="0">
          <Image uuid="{a8763005-3f43-4ba8-843a-eedae46989ff}"/>
        </AttachedDevice>

It is also possible in VMWare, though I believe they use a guest driver or tool instead of TRIM.
Proxmox does this with TRIM and QCOW2 files: https://pve.proxmox.com/wiki/Shrink_Qcow2_Disk_Files

Of course the huge advantage is to be able to keep thin provisioned VDI storage thin with a low-resource automatic compaction. Today we have to fill a disk with zeroes and then either turn the VHD files sparse or use some VHD magic to compact disks. Very time consuming and not user or admin friendly.

Looking at the schematics at https://xapi-project.github.io/xapi-storage/#learn-architecture it may be possible to do this with a volume plugin. Possibly the Xapi needs to be expanded to forward the TRIM request to the plugin.

OCaml .mli is missing newlines in record types

This doesn't work very well:

    type volume = { key: string; (** A primary key for this volume. The key must be unique within the enclosing Storage Repository (SR). A typical value would be a filename or an LVM volume name. *)name: string; (** Short, human-readable label for the volume. Names are commonly used by when displaying short lists of volumes. *) description: string; (** Longer, human-readable description of the volume. Descriptions are generally only displayed by clients when the user is examining volumes individually. *) read_write: bool; (** True means the VDI may be written to, false means the volume is read-only. Some storage media is read-only so all volumes are read-only; for example .iso disk images on an NFS share. Some volume are created read-only; for example because they are snapshots of some other VDI. *) virtual_size: int64; (** Size of the volume from the perspective of a VM (in bytes) *) uri: (string list); (** A list of URIs which can be opened and used for I/O. A URI could  reference a local block device, a remote NFS share, iSCSI LUN or  RBD volume. In cases where the data may be accessed over several  protocols, he list should be sorted into descending order of  desirability. Xapi will open the most desirable URI for which it has  an available datapath plugin. *) } with roc

Include bug reporting URL in the Plugin query

Ideally errors in plugins would be easy to report (possibly even automatically reported)

Take-away lessons of a volume plugin (PoC) for zfs

Hello everyone,

in this issue, I would like to share my experience in writing a volume plugin for zfs filesystem for the smapiv3. In this implementation, the volume plugin for zfs represents Storage Repositories (SR) as pools and Volumes as zfs volumes. A zfs volume is a dataset that represents a block device and is created in the context of a filesystem. When creating a volume, it can be accessed as a raw block device, e.g., /dev/zvol/pepe. Zfs supports operations over volumes like snapshot(), clone() or promote(). This simplifies the driver, but, in some cases, the actual implementation becomes a bit tricky because the XAPI does some operations in a certain order that can't be done in a zfs filesystem in the same order. I though this PoC could be interesting to understand better why those tasks are tricky. This is a summary and there are still many things that I am not sure about.
First of all, the xe sr-create command ends up invoking zpool create [name] to create a pool. The only required parameter is the block devices in which the zfs fs will be installed. The xe vdi-create ends up invoking zfs create -V [pool] [name] [size]. This creates a new volume in the pool. The volume can be accessed as a raw block device at /dev/zvol/[name]
In zfs, snapshots are taken by relying on the zfs snapshot [volume] command. When a new snapshot is created, the new volume.id is used as the name for the snapshot. Snapshots are read-only volumes. To access the snapshot, we have to either mount it or create a clone from the snapshot. Otherwise, the snapshot is not accessible. The current PoC always creates a clone when a snapshot is taken. The clone is named as the snapshot but it belongs to the pool where is created. You can see this in the following output. When the snapshot @2 is created the cloned volume 2 is created too.

$ zfs list
NAME       USED  AVAIL     REFER  MOUNTPOINT
hola      10,3G  8,58G     48,5K  /hola
hola/1    10,3G  18,9G       12K  -
hola/1@2     0B      -       12K  -
hola/2       0B      -       12K  -

This cloned volume is important when issuing xe vm-copy, i.e., create-vm-from-snapshot, since the command requires accessing the volume to copy the content for the new VM's VDI. This is not possible if the volume is a snapshot.
Another example is the command xe snapshot-revert. This command reverts the state of a VM from a snapshot. The first step tries to destroy the current VDI. However, this is not possible since the current VDI has children, e.g., the snapshot. The correct way to do it is to directly clone from the snapshot, promote the new volume and finally destroy the main VDI. The current PoC only accepts snapshot for the clone method and it worked around to destroy the parent VDI just after the new volume is promoted. The current implementation of xe snapshot-revert is as follows:

Destroy parent vdi but fail to remove it from db and zfs
Clone from snapshot
- A volume is created from the snapshot
- The volume is promoted
- The parent volume is destroyed
- The children of the main volume are promoted to the clone.

Note that this works only if reverting is from the latest snapshot. Otherwise, the main volume can't be destroyed because there are still newer snapshots that can't be promoted to the new clone.
The current implementation of volume destroy relies on the zfs destroy command. The method checks if the volume is a snapshot or a volume. If it is a snapshot, the method first builds the correct path to the snapshot, and then, destroys it. The method also checks if there is a clone with the same name and destroys it. This aims at removing the clone that the snapshot command creates. Note that when trying to destroy a volume with children, the zfs destroy fails but the vdi-destroy success.
This is an overall summary of the current PoC. I may be missing some chunks. I may release a design document soon that explains all the implementation details and decisions.

Configuration parameter in Volumen.create()

Hello, for a volumen plugin I require to have a configuration field in which I need to pass some parameters. The spec defines it here, however, I can't find it in the implementation. My plan would be to add it but I would like to ask if you had the same problem, and in that case, how did you approach it.

Regads,

How should we report dynamic properties such as physical_utilisation?

Could we expose datasources for all VDI physical_utilisations? (Also needed for SR.stat)

Can we expose backend-specific APIs for querying things like vhd structure?

python: command-line args aren't type checked/casted

This means that integer parameters like 'size' end up slipping through as strings.

Consider per-VDI capabilities

This might be useful if

there are multiple file formats in one "SR"

Move to hugo

To avoid having to learn multiple site genreators, because we'll use hugo for our main docs.
We can use the docuapi theme, which is similar to slate: https://themes.gohugo.io/docuapi/

Unclear relationship between URI and device_config

Currently we have

SR.create sr uri configuration
SR.attach sr uri

and we have per-host "device_config".

Proposal:

set the URI = "device_config:uri"
pass the device_config as 'configuration' (minus 'uri' and 'SRmaster'?)

The documentation should be that 'configuration is passed at create-time only because it affects the creation of the SR. Once an SR is created the configuration is self-evident. To refer to an existing SR, the URI is always sufficient'

Don't use 'xcp' in module path names etc

Documented feature flags

At the very least we should advertise when behaviour changes. We've done this with 'capabilities' quite successfully already.

Alternatively we could support multiple versions of individual operations, perhaps by adding a version suffix to the name in a standard way?

Should SR operations take 'uri' instead of 'sr'

Should we insist that clients do

sr <- SR.attach uri
vols <- SR.ls sr
SR.detach uri
<-- now 'sr' is closed/detached/invalid

or should we treat attach/detach as optimisations and allow clients to

vols <- SR.ls uri

Rename volume to v-volume

or something

Dundee xapi-storage question

Dundee (Beta) question: I'm interested in running a storage driver (SR)
in a domain other than dom0. In previous versions of XenServer, this
has been possible (with blktap2 and some SMAPIv2 code) via configuring:

xe pbd-param-set uuid=$PBD other-config:storage_driver_domain=

but this does not seem to be the case in Dundee. I read here:

http://djs55.github.io/xapi-storage/datapath.html

that the datapath plugin takes a URI which points to virtual disk data
and chooses a Xen datapath implementation: driver domain, blkback
implementation and caching strategy. If I am reading this correctly a
properly formatted URI will allow disaggregated storage by specifying a
storage driver domain. Is this functionality currently implemented and
if so can I have a hint on how it would be accomplished?

Should we have a capability for epoch_{begin,end} (or similar)

Will some implementations want to be clever, perhaps by

secure deleting volumes on detach
truncating files on attach?

Or should this be implemented always in the layer above i.e. in xapi or in xapi-storage-script?

Development schedule

Has been quiet for a while. Any chance this project will be resuming active development soon?