openark / orchestrator Goto Github PK

View Code? Open in Web Editor NEW

5.5K 261.0 911.0 21.47 MB

MySQL replication topology management and HA

License: Apache License 2.0

Shell 5.19% Go 70.57% CSS 0.84% JavaScript 13.36% Ruby 0.18% Dockerfile 0.09% Python 9.77%

mysql replication high-availability management-system

orchestrator's People

Contributors

Stargazers

Watchers

Forkers

enisoc hubspot zhuoroger radovankavicky presslabs splade anovosiolov jiangguangxing microserviceprovider rafael-0317 pcharlan leeparayno bitmybytes kleopatra999 bgzxz rocio r1nc3w1nd jaeg131194 tookennysupreme antrepo rrreeeyyy cohenjo x-ops sjmudd prajwalprathap orkestrov 1davidmichael samveen obissick devops-ua jzoe houseqin csomers harshpandit007 swiftype-demo wt2510 francis327921538 guangminglion neoact wfxiang08 zvictorino bvierra fipar wenchaomeng komly yuanfeng0905 tsolodov number0 linuxgood1230 prateekpandey14 wyfaq jetbinliu elfadiliy atlankee sjas joemonster99 vvangemert allenlois gmiroshnykov kmeduna maurosr geezer-workshop akoserwal yibit jovibizstack pathcl guohongze dveeden doarthon emusickid joshuaprunier ibuystuff saadyalsaad callmedba scofields liufofu dekins 15871439537 ameetkotian sansom plapythian weichunshen89 olublessed pratyay-banerjee devopsmi getdigitallabs tom2jack etsangsplk deepak1602 no2key team-telnyx joe9724 apachesep fboxwala daiguadaidai kplimack samiahlroos jkwroi24io9342fsdjkladf zhangxiaocen anushasb

orchestrator's Issues

go test ./go/... does not work for me. Does it work for you?

For PRs you ask us to check if go test works. For me the current code does not work.
That is I see:

[user@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ go test ./go/...
?   	github.com/github/orchestrator/go/agent	[no test files]
ok  	github.com/github/orchestrator/go/app	0.017s
?   	github.com/github/orchestrator/go/attributes	[no test files]
?   	github.com/github/orchestrator/go/cmd/orchestrator	[no test files]
ok  	github.com/github/orchestrator/go/collection	0.017s
ok  	github.com/github/orchestrator/go/config	0.020s
?   	github.com/github/orchestrator/go/db	[no test files]
?   	github.com/github/orchestrator/go/discovery	[no test files]
ok  	github.com/github/orchestrator/go/http	0.029s
ok  	github.com/github/orchestrator/go/inst	0.023s
?   	github.com/github/orchestrator/go/logic	[no test files]
?   	github.com/github/orchestrator/go/metrics	[no test files]
?   	github.com/github/orchestrator/go/metrics/query	[no test files]
?   	github.com/github/orchestrator/go/os	[no test files]
?   	github.com/github/orchestrator/go/process	[no test files]
?   	github.com/github/orchestrator/go/remote	[no test files]
2017-04-30 23:29:54 INFO MutualTLS requested, client certificates will be verified
--- FAIL: TestNewTLSConfig (0.00s)
	ssl_test.go:40: Could not create new TLS config: No certificates parsed
--- FAIL: TestVerify (0.00s)
	ssl_test.go:94: x509: RSA key missing NULL parameters
2017-04-30 23:29:54 INFO Decrypted /var/folders/v5/6pdpvnbn7pg9h41pnbsr1klscmtspv/T/ssl_test488963850 successfully
--- FAIL: TestAppendKeyPair (0.01s)
	ssl_test.go:154: Failed to append certificate and key to tls config: x509: RSA key missing NULL parameters
2017-04-30 23:29:54 INFO Decrypted /var/folders/v5/6pdpvnbn7pg9h41pnbsr1klscmtspv/T/ssl_test572616158 successfully
--- FAIL: TestAppendKeyPairWithPassword (0.01s)
	ssl_test.go:169: Failed to append certificate and key to tls config: x509: RSA key missing NULL parameters
FAIL
FAIL	github.com/github/orchestrator/go/ssl	0.061s
[user@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$

This is true for me as of commit

commit c61104ea3b9a853affe6c4a6d4f7d37f420ae27d
Merge: 5a9547b1 72a59a23
Author: Shlomi Noach <[email protected]>
Date:   Thu Apr 20 17:51:33 2017 +0300

Shlomi, does this build and work for you? If so there's some difference in our environment which is triggering this and I would like to fix mine so the build works cleanly and I can really check that go test ./go/... actually works. If it makes any difference this is being run on OSX, but also get an issue on Linux.

Note: the build checks via travis do seem to work, so there's some difference here between what you ask us to check if someone creates a PR and what actually gets checked by travis.

Show active node "active time" and "uptime" in https://myurl.example.com/web/status

The current /web/status url shows the list of members of the orchestrator cluster and which node is active. We also show the version of orchestrator that is running.

A couple of things that are not shown for each orchestrator node:

process uptime
if an active node: time active

This gives you some idea of the stability of the cluster and also how long it has been running. Without that it's pretty hard to tell.

Why is this useful? Basically it allows you to see if your network is stable, if the servers have been running for the amount of time you might expect and you can do all of this at the click of a button and without dropping into the OS or using other monitoring tools to see it.

why the running log throws Unknown column exception?

hi,

I just download the new version tarball: orchestrator-2.0.2-linux-amd64.tar.gz

my question are:

when i used command ./orchestrator http at /usr/local/orchestrator directory, then running log throws ERROR Error 1054: Unknown column 'first_seen_active' in 'field list'
when i used command ./orchestrator --discovery=false http , access the web site, the running log throws ERROR Error 1054: Unknown column 'candidate_database_instance.promotion_rule' in 'field list'

I looked my orchestrator database, it has 36 tables, and I tryed `select promotion_rule from candidate_database_instance;` , the mysql alert `ERROR 1054 (42S22): Unknown column 'promotion_rule' in 'field list'` message

I think about, the version's sql is old or didn't update?

Avoid promotion if replica not up-to-date with relay log (configurable?)

Storyline: #83

(at least configurable?)

On master failover, do not promote a replica if it hasn't consumed its relay logs. This event can happen when all replicas are greatly lagging on time of master failure.
In such case a reset slave all on the promoted replica causes data loss.

Typically replicas are not lagging so much, and by the time the failure detection takes place, they will have consumed their relay logs.

Document Pseudo-GTID limitations

They're well advertised somewhere, but not here: https://github.com/github/orchestrator/blob/master/docs/pseudo-gtid.md

Minor typo in docs/pseudo-gtid.md ("ISNERT")

Have `orchestrator -c clusters` show alias

In our work flow we use the alias of clusters quite a bit, so I figured that orchestrator -c clusters would show the aliases. Is there a way to do this?

Binlog saving, parsing and transfer from failed server to new master

Hello,

I'm comparing MHA to Orchestrator and I was wondering if in non-GTID replicated environments in cases where the master fails, does Orchestrator make any effort to save, parse and transfer binlog entries from the failed MySQL master to the new MySQL master to coalesce any data that was not captured on the new master through normal MySQL replication?

If not, is this something that would ever been attempted/on the roadmap for the future?

Binlog servers (BLS) won't move between co-masters

My current replication topology is:

co-master A (writes)--->binlog01 --> slave A
|
|
co-master B (read-only) ---> binlog02 --> slave B

I have some maintenance I need to do on co-master A and want to point my binlog01 server to co-master B. When I try thru the UI it doesn't seem to work, doing it thru the cli gives me the following exception:

[root@inv1 orchestrator]#  ./orchestrator -c move-below -i binlog01.domain:3306 -d masterB.domain:3306 --noop
2017-01-13 14:42:51 ERROR Will not resolve empty hostname
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/outbrain/orchestrator/go/inst.(*Instance).MajorVersion(0xc2080c8000, 0x0, 0x0, 0x0)
	/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/inst/instance.go:282 +0xa2
github.com/outbrain/orchestrator/go/inst.(*Instance).IsSmallerMajorVersion(0xc2080c8000, 0xc2080c81c0, 0x82b9a0)
	/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/inst/instance.go:308 +0x28
github.com/outbrain/orchestrator/go/inst.(*Instance).CanReplicateFrom(0xc2080c8000, 0xc2080c81c0, 0x1, 0x0, 0x0)
	/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/inst/instance.go:437 +0x464
github.com/outbrain/orchestrator/go/inst.MoveBelow(0xc20801f8c0, 0xc20801eb80, 0x0, 0x0, 0x0)
	/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/inst/instance_topology.go:407 +0x47c
github.com/outbrain/orchestrator/go/app.Cli(0x7ffca576a686, 0xa, 0x0, 0x7ffca576a694, 0x18, 0x7ffca576a6b0, 0x10, 0xc20802af20, 0x4, 0x0, ...)
	/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/app/cli.go:175 +0x1e02
main.main()
	/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/cmd/orchestrator/main.go:646 +0xae4

goroutine 9 [select]:
github.com/pmylund/go-cache.(*janitor).Run(0xc20802b000, 0xc20803def0)
	/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:946 +0x13f
created by github.com/pmylund/go-cache.runJanitor
	/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:964 +0x8c

goroutine 7 [select]:
github.com/pmylund/go-cache.(*janitor).Run(0xc20802af60, 0xc20803dd70)
	/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:946 +0x13f
created by github.com/pmylund/go-cache.runJanitor
	/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:964 +0x8c

goroutine 8 [select]:
github.com/pmylund/go-cache.(*janitor).Run(0xc20802af70, 0xc20803ddd0)
	/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:946 +0x13f
created by github.com/pmylund/go-cache.runJanitor
	/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:964 +0x8c

goroutine 10 [syscall]:
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
	/usr/local/go/src/os/signal/signal_unix.go:27 +0x35

goroutine 11 [chan receive]:
database/sql.(*DB).connectionOpener(0xc208045b80)
	/usr/local/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
	/usr/local/go/src/database/sql/sql.go:452 +0x31c

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 18 [chan receive]:
database/sql.(*DB).connectionOpener(0xc20811c0a0)
	/usr/local/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
	/usr/local/go/src/database/sql/sql.go:452 +0x31c

goroutine 20 [chan receive]:
database/sql.(*DB).connectionOpener(0xc20811c640)
	/usr/local/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
	/usr/local/go/src/database/sql/sql.go:452 +0x31c

is there something I'm missing here?

Slaves lagging by couple of hours are elected as master by orchestrator

We are seeing instances where slaves with couple of hours of lag are elected as masters. Is there any configuration for that not to happen?

@shlomi-noach

GTID mode relocation not working with Percona releases

Hey !

I've just enabled Oracle GTID on my Percona 5.7.10 farm (2 masters + slaves). Tried to relocate using the GTID mode and I got the following error : "Cannot move via GTID as not both instances use GTID".

I first checked into the orchestrator database and I get that :

+--------------------------------------+---------------------+----------------------+-------------+-------------+-------------+
| hostname                             | last_checked        | supports_oracle_gtid | oracle_gtid | server_uuid | gtid_purged |
+--------------------------------------+---------------------+----------------------+-------------+-------------+-------------+
| xxxx        | 2017-03-09 17:48:12 |                    0 |           1 |             |             |
| yyyy          | 2017-03-09 17:48:12 |                    0 |           1 |             |             |
| zzzz        | 2017-03-09 17:48:12 |                    0 |           1 |             |             |
....

The problem is that supports_oracle_gtid is not set while having gtid_mode = ON.

I checked into the code, it seems that you're restricting the query getting those status to isOracleMySQL instances.
https://github.com/github/orchestrator/blob/5c34db981e5b91f03fb31135b1f0bb3bfc3c4ed0/go/inst/instance_dao.go#L303

isPercona() should also get these infos as the GTID implementation is the same.

Thanks !

Failure recovery does not give enough detail on what it did and why

Orchestrator recovers from master or intermediate master failover pretty well and that is great. However, sometimes if you want to look at the details they are hard to find.

I would like to see the following information being shown in these 3 places:

notifications by orchestrator of what it's done and why (through the external hooks)
in logging
in the output you can see here: https://orchestrator.domain.com/web/audit-failure-detection

What the failure type was (I think this is already shown)
Which replacement master was chosen and why. e.g. most up to date master in the same DC.
The steps that were taken to achieve the adjustment in the replication topology and any errors that occurred during the process. This information is available in the logs but the information in the logs may not be kept for long and will also contain the constant discovery processes that are running, so keeping this specific information is useful should things go wrong.

This would make the reporting of problems and later investigation easier as it wouldn't be necessary to scan through a large number of logs. The information is available but is not necessarily logged or kept for later use. Even when orchestrator works fine it's convenient to see all these details to confirm the correct behaviour, but also should there be an issue it makes it easier to diagnose the cause, whether that's due to orchestrator failing or the systems responding in some unexpected way.

Add more complete logging for failed RecoverDeadIntermediateMaster

A recent failed recovery shows the recovery steps as:

...
2017-05-07 19:48:37	- RecoverDeadIntermediateMaster: move to candidate intermediate master (hostname-removed:3306) did not complete: <nil>
...

Looking at the code the condition for giving this message is if err != nil || len(errs) > 0, so it seems we are not showing the information from errs which is where there is some sort of error.

Consequently the logging should show errs in addition to err.

I'll provide a PR for the one-line change to add this missing information.

Seen in a version I'm running based on 2.1.1

build.sh -b on OSX still failing

Despite #116 I still see build.sh -b failures on OSX.

[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ git diff github/master
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ git show --summary
commit cdba46216e5cc7fe187057caf414977b53621605
Merge: cbc5e2b ba45adf
Author: Shlomi Noach <[email protected]>
Date:   Mon Apr 3 14:32:31 2017 +0300

    Merge pull request #123 from samveen/fix_116

    Fix issue #116: `precheck()` honours `-b`

[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ ./build.sh -b
Build only; no packaging
Building via go version go1.8 darwin/amd64
go install runtime/internal/sys: mkdir /usr/local/go/pkg/linux_amd64: permission denied
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ uname
Darwin
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$

Not running as root yet the build.sh package is trying to install into /usr/local/go which is wrong and the referenced /usr/local/go/pkg/linux_amd64 is also wrong for OSX. Building for a patched v2.0.2 version works fine.

Not a big deal right now as I'm working off my own forked branch but I do need to get up to date and expect to do that shortly.

RDS support in Orchestrator?

Hi all,

We've been investigating using RDS to replace our MySQL servers, and Orchestrator could end up being great for shuffling topologies around as we set things up.

I've connected Orchestrator to an RDS instance, it got confused regarding hostnames (as the @@HOSTNAMEvariable is something internal to RDS's infrastructure), and I know that RDS MySQL's topology adjustment commands are different.

Has any work been done into getting Orchestrator to support RDS, and is there any design consideration within Orchestrator that would outright preclude compatibility if work was put into working around the few differences between RDS MySQL and standard MySQL?

Thanks!

graceful-master-takeover and transactions that commit after readonly is set

While comparing orchestrator promotions with an internal alternative, I noticed that the order of operations in graceful-master-takeover allows for data loss if any transactions commit after readonly is set. I thought that maybe I could have PreFailoverProcesses STONITH the old master but that process happens after the final coordinates are determined for the slave so there's no way to prevent this race condition.

edit: oops, my test was faulty. I was inadvertantly using a SUPER user which was throwing off the results. This can be closed/deleted.

build.sh issues on OSX with -b

I see the following using master branch of github against latest commit of Date: Thu Mar 23 21:06:23 2017 +0200, RELEASE_VERSION 2.1.0

[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ export GOPATH=~/src/orchestrator
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ ./build.sh -b
Build only; no packaging
Please install fpm and ensure it is in PATH (typically: 'gem install fpm')
rpmbuild not in PATH, rpm will not be built (OS/X: 'brew install rpm')
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$

I'm requesting no packaging yet it looks like build.sh is trying to package. That's wrong.
This works in RELEASE_VERSION 2.0.2

So looks like some of the option handling has broken. I've not had time to look at this further but will shortly need to update to 2.1 so it would be good to resolve these issues.

I see there are other issues reported in #101 so guess that perhaps some of the breakage is related.

Crash in inst.moveReplicasViaGTID.func1

Crash in orchestrator based on a version very close to v2.1.

Crash report says:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7f10d5]

goroutine 1925650 [running]:
github.com/github/orchestrator/go/inst.moveReplicasViaGTID.func1.1(0xc421785aa0, 0xc42000eb80)
        /builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:610 +0x25
github.com/github/orchestrator/go/inst.moveReplicasViaGTID.func1(0xc421785aa0, 0xc42000eb80, 0xc420378580, 0xc422643730, 0xc422707600, 0xc422707620, 0xc422707640)
        /builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:630 +0xe0
created by github.com/github/orchestrator/go/inst.moveReplicasViaGTID
        /builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:630 +0x327

The version of go/inst/instance_topology.go is the same as in v2.0.2 (and that's almost identical except for an unrelated change to v2.1)

I'll try to look at this later but am a bit busy with other things. Fuller anonymised logging shows:

2017-03-31 15:08:15 DEBUG Not elected as active node; active node: orchestrator2.dc; polling
[martini] Started GET /api/relocate-replicas/host5.dc/3306/host6.dc/3306 for 127.0.0.1:37735
2017-03-31 15:08:15 INFO Will move 7 replicas below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host2.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host9.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host10.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host3.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host8.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host4.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host7.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host2.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226526, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host3.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226528, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host10.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226529, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host8.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226530, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host9.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226527, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host7.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226531, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host4.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226532, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 ERROR ReadTopologyInstance(host2.dc:3306) show variables like 'maxscale%': Error 3159: Connections using insecure transport are prohibited while --require_secure_transport=ON.
2017-03-31 15:08:15 ERROR ReadTopologyInstance(host2.dc:3306) show variables like 'maxscale%': Error 3159: Connections using insecure transport are prohibited while --require_secure_transport=ON.
2017-03-31 15:08:15 DEBUG auditType:end-maintenance instance:host2.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226526
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7f10d5]

goroutine 1925650 [running]:
github.com/github/orchestrator/go/inst.moveReplicasViaGTID.func1.1(0xc421785aa0, 0xc42000eb80)
        /builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:610 +0x25
github.com/github/orchestrator/go/inst.moveReplicasViaGTID.func1(0xc421785aa0, 0xc42000eb80, 0xc420378580, 0xc422643730, 0xc422707600, 0xc422707620, 0xc422707640)
        /builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:630 +0xe0
created by github.com/github/orchestrator/go/inst.moveReplicasViaGTID
        /builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:630 +0x327

I am enforcing TLS on some boxes which currently prevents orchestrator reaching them. Maybe there's something related here but I am not sure.

For now just making the issue visible.

feature request: timestamps in recovery.log

In troubleshooting, it would be helpful if timestamps were included somewhere in recovery.log.

from the orchestrator log:
2017-04-06 08:50:45 INFO CommandRun(echo 'Will recover from DeadMaster on dev-mysql101:3306' >> /tmp/recovery.log,[])

but then, I don't have timestamps in the recovery.log. It looks like it is just the most recent recovery - but I am not fully sure. :) Simple solution would be to just prepend the dump to the recovery.log with a commented timestamp.

Anonymize mode leaks the username.

Anonymize mode is great to share screenshot of replication topologies without sharing too much information about the exact minor version running and hostnames.

However, it leaks the username of the logged-in Orchestrator user.

Maybe the username should be replaced by "anon. user"

Thank you!

Setting read_only=OFF if intermediate master goes down.

If the master goes down Orchestrator promotes a new master and change read_only flag to OFF.
If an intermediate master goes down where the read_only was OFF Orchestrator promotes a new server but won't change read_only from ON to OFF. Example if I use ProxySQL it won't know where it should send the write queries anymore.

I can change this variable with a hook, but my question is this is the expected behaviour? Or Orchestrator can change this somehow in default?

Thanks.

Handle Multi-source replication in Orchestrator

This is to discuss multi-source replication and how it can be handled in orchestrator

What?

Oracle MySQL provides multi-source replication from MysQL 5.7
MariaDB provides multi-source replication from MariaDB 10.0
The implementations are not compatible.

Why?

Some people may actually use this so it's good to for orchestrator to at least not do the wrong thing and not break anything.
Better still would be add increasing levels of functionality (according to demand).

Currently orchestrator runs show slave status and records the output, the last row overwrites any data from previous rows. This sort of works but not really.

How?

Recognise when a slave is using multi-source replication

for Oracle MySQL SHOW SLAVE STATUS returns more than one row and a channel_name
For MariaDB I'm not sure how to recognise this
I would add an entry to Instance struct. Something like NumberOfMasters int.

True support for multi-source would require changes to these fields:

MasterKey
IsDetachedMaster
Slave_SQL_Running
Slave_IO_Running
HasReplicationFilters
SupportsOracleGTID
UsingOracleGTID
UsingMariaDBGTID
UsingPseudoGTID
ReadBinlogCoordinates
ExecBinlogCoordinates
IsDetached
RelaylogCoordinates
LastSQLError
LastIOError
SecondsBehindMaster
SQLDelay
ExecutedGtidSet
GtidPurged
SlaveLagSeconds
SlaveHosts
ReplicationDepth
IsCoMaster
HasReplicationCredentials
ReplicationCredentialsAvailable
SemiSyncEnforced

This suggests putting them in a different structure which is referenced from the Instance struct.

Show this in some places

for orchestrator -c topology output we could add an extra field single-source or multi-source to represent this
add a new command orchestrator -c multi-source-topology which would show masters and slaves of a specific instance for a specific instance.
for the HTTP interface we could add a new symbol to indicate the server has more than one master, and if you do a mouse over you get a list of master_name, sql/io replication working?, delay (similar to the output you get when doing a mouse over the compact views

Visualise the output of replication

not sure how we'd do this and perhaps we could only do it for "simple configurations" based on a specific server? That is not show the whole topology but show the specific server as the centre of connections to masters and slaves

Allow certain operations to work on multi-source servers

when relocating a slave make orchestrator aware of the different masters so that the appropriate slave configuration can be adjusted.

Multiple Build.sh issues

I found that the build has the following issues/deficiencies:

Precheck if TOPDIR in GOPATH
As a complete newbie to go I had a lot of trouble building orchestrator manually because my TOPDIR was not in GOPATH.
Given that precheck() in build.sh already checks for GOPATH, wouldn't it be a good idea to check that TOPDIR is inside it at the correct place? If need be, I can provide the required logic as a merge request.
PREFIX isn't considered for the service script for DAEMON_PATH
The PREFIX should be correctly updated in the init.d service script. Once again, let me know if you'd prefer to fix it or want me to do the needful.

Add option to forcefully put a 5.6 slave below a 5.7 machine

This should not be allowed by default but should be possible with a -force flag.

The reason is:

a 5.6 chain with a 5.7 branch.
you want to promote the 5.7 branch to be the master
This would result in 5.7 (new master) replicating to a 5.6 machine (old master)

But reversing replication might not work. So it's best to put a 5.6 machine in the 5.7 branch before switching the master to ensure it is replicating.

Replication from 5.7 to 5.6 only works if 5.7 is configured to create a 5.6 compatible binlog stream and is not really supported. However doing this for a short amount of time can be used to provide a rollback option in case things don't work out with the new version.

And of course 5.7→5.6 might also be 8.0→5.7 or 10.2→10.1..

BulkReadInstance does not read database_instance consistently.

// BulkReadInstance returns a list of all instances from the database
// - hostname:port is good enough
func BulkReadInstance() ([](*InstanceKey), error) {

does not work as expected as it pulls data straight out of database_instance without doing some required munging to set all fields correctly.

Thanks to Shlomi for pointing this out. It may not be a big deal now (and it may not work incorrectly) but it needs fixing to be consistent and correct. It may cause issues later.

Basically I need to call readInstancesByCondition() appropriately.

Meaning of ERROR Mismatching entries

Hello,

I have a mysql database backend with many servers that all have the same schema but orchestrator usually fails when I try relocating slaves. I always get this error about mismatching entries. It isn't clear to me what it means. The data (as far as I can tell) and tables are the same on the master and all the slaves and I don't understand what causes this. Can you provide any additional information on what might cause this "Mismatching entries"?

2016-12-06 22:12:12 ERROR Mismatching entries, aborting: table_id: ### (mydb.priorities) <-> table_id: ### (mydb.queue)
2016-12-06 22:12:12 INFO Started slave on prddba100:3306
2016-12-06 22:12:13 ERROR Unexpected: 0 events processed while iterating logs. Something went wrong; aborting. nextBinlogCoordinatesToMatch: <nil>
2016-12-06 22:12:13 DEBUG auditType:end-maintenance instance:prddba100:3306 cluster:prddba101:3306 message:maintenanceToken: 1384
2016-12-06 22:12:13 FATAL Unexpected: 0 events processed while iterating logs. Something went wrong; aborting. nextBinlogCoordinatesToMatch: <nil>

`Build.sh`: precheck for linux will always be skipped

precheck() checks if $target equals linux, but $target is not set, so the check on linux is always skipped.

A snipped trace is as below:

+ precheck linux
+ local target
+ local ok=0
+ [[ '' == \l\i\n\u\x ]]

Integration between Orchestrator and ProxySQL

Percona Live 2017 - Birds of a Feather Discussion: Integration Between Orchestrator and ProxySQL

Notes on the BoF discussion tonight. (Apologies if I misquoted anyone).

Amsterdam

Shlomi - watched 5 presentations on ProxySQL, it was exploding
Orchestrator and ProxySQL can complement nicely, with Orchestrator handling the topology and failover and ProxySQL buffering connections from the application, especially in times of failover

Rene - solution that fits most of the stuff

Problem Statement

No overview of entire cluster

See effective or not
detecting active writer is a problem
- If you failover, possibly with a network partition, it is not effective to have determine the true master if the old master returns
- Different ProxySQL instances can have different view of who is the master (depending on which side of a network partition)

Solution provided by someone:

Hooked with Consul with Orchestrator

User testing Consul template for ProxySQL
- if it fully works can contribute it to ProxySQL

No knowledge of whole architecture known to individual ProxySQL (Rene)

Jessica (Github)

Reference implementation
- debate - to test
- can make changes and branch deploy
- if trying to deploy two complex systems, any attempt to unite them will be fraught with issues
Orchestrator
- not dependent on external systems
- currently supports to Graphite
- Not away of pushing data to Consul or another

Consul in production

highly available?
- told not to treat it as highly available (Shlomi)
- but we need to treat whatever we depend on for this external source of truth (if we use something external) as HA

Chubby - at Google (Sugu)

If it is up 100% for 3 months , they actually take it down, so users don't get used to it being up
Expect application to cache what they need and react to a system being down

What is the Source of Truth?

What are looking to put into "external system"

Orchestrator

HA - runs with backend database
Service should be high available
Github
- running against Master Master
- with HAproxy in front
- generally Orchestrator HA
- at time of failure
  - should we rely on it to be always HA, to replace Consul?
Simon Mudd - always running 2 clusters
- Don't do failover
  - feel to upgrade whatever cluster
  - then you need to ask

We don't need the coupling between these systems to be 100% HA necessarily (Lee)

Netflix has often posited this notion/pattern of Circuit Breakers in their infrastructure.
- If a service is unavailable/times out, the requestor can fail out and drop down to a secondary option/alternative solution/resolution
In Elastic stack, the Beats have some notion of throttling or queuing, when Elasticsearch indexing is backed up, so individual beats, like Filebeat or Metricbeat, can hold back the work they need to communicate until the service becomes available again

Potential Solutions

dependent on read_only = 0
- ProxySQL will direct traffic on read_only = 0
  - read_only = 1, MySQL instances will be placed in master hostgroup
  - read_only = 0, MySQL instances will be placed in the reader hostgroup
- Orchestrator doesn't have to notify anything
- Issues
  - What happens if old master didn't really go away
    - now two writable servers
    - inherently too dangerous
  - Matthias
    - master-slave in master hostgroup - only one gets picked
      - slave also has read_only = 1
- ProxySQL could be first to identify issues
  - Could be get notification to Orchestrator
    - Orchestrator could check more aggressively
  - This should be the case, since ProxySQL would be receiving connections from the application
Orchestrator notifies ProxySQL of failure
- Issue
  - If 2 of 5 ProxySQL cannot be contacted, what is Orchestrator's reaction?
- At end of the day, edge cases stop becoming edge cases
- Only a problem if Orchestrator cannot talk to 2 ProxySQL servers, and master being partitioned and old master coming back
  - set my.cnf, READ_ONLY = 1 and only turn it on dynamically ALWAYS (Shlomi)
Clustering ProxySQL together?
- If network partition, quorum would keep configuration in sync
  - considering it (Rene)

ProxySQL recognition and administration

Are there any plans to recognize and/or administer proxysql?

Orchestrator should be able to receive a proxysql ip and port and then read all the servers that it connects to. Proxysql should then show up as a master.

I think later options are administering proxysql via the orchestrator interface.

Optionally prevent unintended database schema updates

Some time ago we had 2 configuration settings to optionally avoid database upgrades:

SkipOrchestratorDatabaseUpdate
SmartOrchestratorDatabaseUpdate

These were removed as the code to handle updates was improved and considered more reliable.
I think that a similar option would be good to reinstate, especially on large environments.

Reasoning is:

the current assumption is that all versions of the code in use are identical. If you deploy large amounts of servers and also run orchestrator on application nodes too then achieving this instantly is likely to become harder and harder.
if you have a cluster of orchestrator nodes (providing an http service) then you probably want to update the database only when no nodes are trying to write to the database so possibly only with a single node being active. You really don't want any other servers suddenly trying to upgrade the database without prior notice and preparation.

So probably in a small environment it's ok to let this happen automatically and I think it's fine for it to be the default behaviour. However, an option to prevent this would at least ensure that an unintended run by some orchestrator binary won't generate ALTER TABLE commands which potentially might lock up the main orchestrator node while it tries to write to the same table.

I suggest we add a configuration setting AvoidDatabaseUpdates bool which by default has a value of false and intend to write a patch to support such a change.

I have been bitten by this sort of thing happening a number of times now and would prefer to stop everything and only on a single node allow the update, after which point the configuration setting would be disabled again.

add API/cli to list potential gh-ost replicas

gh-ost replicas are replicas that:

are leaf nodes
have binlog_format=ROW
have log-slave-updates
- including all ancestry
have binlog_row_image=FULL
have no replication filters
- including all ancestry

Potentially we can list replicas with STATEMENT binlog format, since gh-ost is able to --swithc-to-rbr.

Who is the new intermediate master?

Hi,

If I have a topology like this (just an example):

               -> rep3 
rep1 --> rep2-|
               -> rep4

Rep2 is an intermediate master. If rep2 dies Orchestrator processes a DeadIntermediateMaster failover and reorganises the topology like (just an example):

rep1 --> rep4 --> rep3

So rep4 is going to be an intermediate master now. But based on the PostFailoverProcesses placeholders I can not decide who is the new intermediate master.

It has the following placeholders:
{failureType}, {failureDescription}, {failedHost}, {failureCluster}, {failureClusterAlias}, {failureClusterDomain}, {failedPort}, {successorHost}, {successorPort}, {successorAlias}, {countSlaves}, {slaveHosts}, {isDowntimed}, {isSuccessful}, {lostSlaves}

I am trying to call an external script when an intermediate master dies but the script should/has to know who is the new intermediate master after failover.

Is there any solution/ideas for this?

Thanks.

Relocate does not work between arbitrary points in a replication chain

This is to bring up an issue, not a big problem, but just to make it visible. Other people may bump into this and find the current behaviour does not meet expectations. ("drag and drop is so easy")

If you have a deep replication topology then you may find you have to move things around in several steps. This is because the current orchestrator logic checks up to 2 levels deep.

Ideally it would be possible to move (where appropriate conditions apply) between any level in the tree, but it looks to me that this would require quite a bit of code refactoring to allow this. You basically would need to be aware of the "topology tree" and see if the slave to move is in that tree and there are no filters/barriers anyway between the slave(s) current position and that of the intended new [intermediate] master.

Above you can see a real production setup. It might be be desirable to move the bottom server instance-a67d under the primary master instance-390f[*]. I believe that current logic (v2.0.2) does not allow this.

[*] In this specific case there is a filter on instance-6368 so this relocation should not be allowed but in a similar case where no filters existed the relocation should work.

REST /api/status always indicate `"IsActiveNode": false`

REST /api/status not concurring with /web/status

Orchestrator version 1.5.7-1

It looks like a bug, I wanted to find out which Orchestrator node out of two was active, which in my case is node A, and the URI /web/status does reflect that on both nodes. Also, the Available Nodes list doesn't seem to be populating.

However, /api/status returns isActiveNode: false on both nodes, no matter which one is active.

REST API returns a blank current active node, and always returns a false in the isActiveNode for some reason.

Example return values:
Node A

{
  "Code": "OK",
  "Message": "Application node is healthy",
  "Details": {
    "Healthy": true,
    "Hostname": "orc-01.lhrx.somecompany.com",
    "Token": "8df743defc2007211b43375d5e2d4351eef4c49d9d3c678b9617530f71e3b356",
    "IsActiveNode": false,
    "ActiveNode": "",
    "Error": null,
    "AvailableNodes": null
  }
}

Node B

{
  "Code": "OK",
  "Message": "Application node is healthy",
  "Details": {
    "Healthy": true,
    "Hostname": "orc-02.amsx.somecompany.com",
    "Token": "62ce620995be57a5a305c7d94cbfd553491c7e45e2aefbc651234773af55aa21",
    "IsActiveNode": false,
    "ActiveNode": "",
    "Error": null,
    "AvailableNodes": null
  }
}

package building into /tmp/orchestrator-release causes issues if done by more than one developer

The orchestrator build procedure build.sh created a directory /tmp/orchestrator-release under which files are located for "package building". The problem with this approach is that if you have two different developers building on the same server they will both write to the same directory. The default ownership of /tmp/XXXX is usually the user that created it and usually other users can not write there. Consequently the second developer will try and fail to build orchestrator without removing the directory first.

A simple way to avoid this issue would be to build under /tmp/<username>-orchestrator-release which would require minimal changes to any existing build scripts, though perhaps the build directory can be located somewhere more standard.

GTID not found properly (5.7) and some graceful-master-takeover issues

Hi,

I am testing orchestrator with 5.7.17, Master and two slaves. Have moved one of the slaves to change the topology like A-B-C and then executed orchestrator -c graceful-master-takeover -alias myclusteralias

The issues found are:

GTID appears as disabled in the master, the web interface shows the button to enable it, when obviously it is enabled in all the replication chain (GTID_MODE=ON). Slaves are showed with GTID enabled.
This issue causes that the takeover doesn't use GTID (I guess)
Instance B was in read-only before the takeover, after the takeover, the read-only is not disabled, is this a feature or something that should I add via hooks? Should be nice to have a parameter to end the process in the status that you prefer, depending on the takeover reasons/conditions.
Also, for any reason the role change old-master-> new slave doesn't work. It executes a CHANGE MASTER but apparently the replication username in the old master is empty, failing the change master operation (orchestrator user has SELECT ON mysql.slave_master_info in the cluster).
Finally, should be nice to add a feature to force to refactor the topology when you have one master and several slaves below. It requires moving slaves below the new elected master, just before the master-takeover. The process will take a bit longer, moving the slaves, and waiting until they are ready.

Thanks for this amazing tool!
Regards,
Eduardo

Upgrade connections to TLS if needed

If you run with servers in an environment where some servers require TLS connections then it is currently hard to let orchestrator know which servers should be connected using TLS and which should not. There's no per server type of indication.

If you have a configuration to not use TLS and try to connect to a server requiring this you'll get:

$ mysql --ssl-mode=DISABLED -u myuser -h myhost -p
Enter password: 
ERROR 3159 (HY000): Connections using insecure transport are prohibited while --require_secure_transport=ON.

Solved with:

$ mysql --ssl-mode=REQUIRED -u myuser -h myhost -p  # works

See: https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_require_secure_transport

So it looks worthwhile to have a few options of providing support when not all servers are configured the same:

some sort of regexp to determine which might be TLS connectable boxes (ugly and not likely to scale very well)
a way to upgrade the connection if you see it requires TLS and you connected without (see above). That generates extra latency if done for each connection attempt, but the overhead may not be significant if you see this "problem" and modify the dsn for the whole "pool" so that subsequent connections use TLS if that's needed. This may be ok.
a API and/or cli interface to indicate to orchestrator if the server requires TLS. This is likely to be harder to maintain (though an API call could do bulk changes if that were necessary)

It is not 100% clear to me yet which is the best way forward but I do think it is likely that a mixed "TLS connections" / "non-TLS connections" environment may exist in many places so coming up with a way to resolve this would be good.

Galera support

Code that checks for slaves and comasters should also check ws_rep variables and recognize galera nodes as comasters (that can't be dragged or demoted)

Orchestrator reliance on backend mysql server

According to the docs:

Not shown in this picture (for clarity purposes), but the orchestrator backend database and its replicas are themselves one of those topologies polled by orchestrator It eats its own dogfood.

Can you elaborate on what the failure modes of Orchestrator are with regard to the backend server? For instance, if the backend mysql is unavailable can Orchestrator still do failovers based on some cached state? Can it failover it's own backend (hence dogfood)?

I'm a vitess user looking to use Orchestrator for automatic failover. I already have automation in place for creating mysql kubernetes pods for vitess, so am looking to see if I can re-use some of that in creating the orchestrator mysql backend servers.

Collect discovery metrics in a generic fashion

I have been running orchestrator for some time with a custom patch which generates discovery metrics, information for each poll on the time it took to get the status of the MySQL server being checked. Information collected showed how long was spent doing "database calls" on the server being discovered/polled and also on the orchestrator backend database.

I noticed that when orchestrator was polling a large number of MySQL servers that the metrics could vary significantly depending on the location of the orchestrator server compared to the orchestrator backend database. This information has been used to identify and fix several issues and also to provide a bulk import mechanism all of which has been incorporated into orchestrator via pull requests.

However, the metric collection patches have not been provided as pull requests as they were rather ugly and I had not come up with a mechanism which seemed generic enough to be used by anyone.

This issue is to discuss my ideas on solving this properly.

This comprises two parts:

collect the metrics for each discovery
making them visible to an external user for adding to their own monitoring system

The first part has been done against outbrain/orchestrator code so needs to be adapted against github/orchestrator code. That should be relatively straightforward.

For the second part I'd like to generate two API endpoints:

JSON array containing raw data for each discovery collected over the last period P ( say 60-120 seconds, configurable ). This would contain timestamp / hostname:port / metric values. External users would need to then generate "aggregate metrics" for each time period monitored. Users can generate any metrics they want based on these raw values.
JSON structure containing a pre-defined set of aggregated data based on the previous values which could be used directly. This simplifies collection for most users as the aggregations can be used directly without needing to do calculations.

Example aggregated values I use are:

success/ failure counts
median/95percentile latency total discovery time
median/95percentile latency talking to discovered host
median/95percentile latency talking to orchestrator backend

The example above shows 2 different orchestrator systems monitoring some servers. As you can see a small spike shows a sudden unexpected change in metrics times (probably not important here). Metric times are different as the monitoring orchestrator servers are located in different datacentres.

I think that providing the information in the way described would make it easy for any user to collect the values and incorporate them into their own monitoring or graphing systems.

More specific details can be discussed but this issue is discuss this change which I propose to provide as a pull request in the near future.

Better handling of buffered writes needed

This is applicable when: BufferInstanceWrites == true.

I recently added some counters to monitor the number of time InstancePollSeconds gets exceeded during discovery. The number seen should normally be quite low but I've seen that on a busy orchestrator server, especially when talking to a orchestrator backend in a different datacentre that the number of times this happens can jump significantly.

Consequently better management and monitoring of this is needed.

Thoughts involve:

ensuring that the configuration parameters used are dynamically configurable via SIGHUP calls and thus do not require orchestrator to be restarted. This affects the 2 variables: InstanceFlushIntervalMilliseconds and InstanceWriteBufferSize.
adding extra monitoring of the time taken for flushInstanceWriteBuffer to run. A single metric every minute is useless so I need to collect metrics and then be able to provide aggregate data and percentile timings in a similar way to how the discovery timings are handled.
parallelising this function to run against the backend orchestrator server a number of times. (completely serialising this even though the writes are batched is not fully efficient but we should ensure that writes for the same instance are never done through different connections at the same time)

With these changes it should be easier to see where the bottleneck is and to be able to adjust the configuration "dynamically" to ensure the required performance is achieved.

Orchestrator PromotionRule settings appear to not be used (or logic of usage is unclear)

I recently had an intermediate master failure and orchestrator promoted a slave which had a promotion rule must not. This was not expected.

instance.PromotionRule settings seem to be correctly set and used and I was using the api calls to ensure that all servers had the expected PromotionRule I required. This works fine. I double checked the promoted server's rules and they showed must_not.
in my case the analysis showed an issue type of DeadIntermediateMasterWithSingleSlaveFailingToConnect
I see that RecoverDeadIntermediateMaster() calls GetCandidateSiblingOfIntermediateMaster() which as far as I can see makes no reference to PromotionRule settings.
There has been talk that under some circumstances Orchestrator may do the promotion in a couple of steps but it would be good if the intention is documented and if or where this is actually implemented were described so people know what to expect.
as far as I can see the PromotionRule is only used on semi-sync slaves in certain circumstances.

So if PromotionRule is currently used for determining the best alternative intermediate master to promote can you point me in the right direction. I guess the same goes for primary master election too (though the issue I saw was on a failed intermediate master).

If this functionality is missing and you want help in implementing it then please let me know. I really want to use this functionality and given the CandidatePromotionRule type and description, and the fact there's a Instance.PromotionRule column I assumed that this was working now.

Related is the fact that following the exact details of an Intermediate master failure (it being noticed as part of the discovery process and also the fact the slaves also notice this, followed by the process of relocation and final termination of the recovery process) is really hard. If you have few servers then there is less logging but if the number of servers increases finding the relevant lines in the log file is quite tricky. I wonder if there is a way that some sort of reference can be generated and passed down in all related logging to allow me to do something like a grep of that reference and see the full recovery history. This would mean that if a failure happens then you could easily provide a detailed audit of everything that happens now related to the recovery process. While there is auditing information it seems not to be sufficiently detailed at the moment to provide a full trail. I think it would be good to have something which can run and which can give all the information needed to explain in detail the timing of when the problem happened, what was done once it was detected to resolve it, and whether the process completed successfully or not. I think that something like this would be useful and more so for people who are not aware of all the code details of orchestrator.

Smaller Docker Image

Hi all,

I've just created a PR to update the Dockerfile to produce a smaller base image.
Sorry for not creating an issue beforehand, I've just read that's how it should be done.
Anyway, hope it's useful to you.

Thanks for the amazing tool that Orchestrator is!

Do not run orchestrator as root

The provided init scripts make orchestrator start and run as the root user. That's really not necessary, so it would be good to provide sample init scripts that start orchestrator as a (dedicated?) non-root user.

I would suggest

detecting if running as root and issuing a warning that it's better not to
making the init scripts try to start orchestrator as the orchestrator user or some specific non-root user
add comments that any routines called by orchestrator may need to run via sudo to gain enhanced privileges if moving from the current setup.
binding to port 80 (or low ports) may be troublesome, perhaps requiring orchestrator to drop privileges (not sure if that's an issue in go or somehow configuring the user to have such privileges. That's likely to depend on the OS being used, but as not everyone will be binding to low ports this may not be such an issue.
ensure that log writing can write to the appropriate file(s).
this is really not a new issue and it has been resolved by many other applications and it seems to make sense for orchestrator too.

Provide a way to add/see properties for a server

Orchestrator is aware of the replication environment but not aware of the larger environment in which a server may be running in. This information is thus not exposed to DBAs or sysadmins who may be using orchestrator.

It may be desirable to add certain properties to a server, ignored by orchestrator but store by it s o they can be seen by the user, especially on http interface.

Ideas for such properties or labels may be:

it's "role" or "type" used in other parts of configuration management
using a specific type of storage, e.g. filer storage
testing server
using specific types of hardware
other site specific property values

If we allow orchestrator to store this information it must be able to also remove and display it.

given http space is somewhat limited perhaps add another "dot" with a mouse over which provides a list of properties. that's similar to other usage in the GUI
provide a cli interface to extract properties for an instance
provide a http api interface to extract all properties from all instances

These properties should be persistent across an orchestrator restart.

The aim of adding these properties is so that it's easier to see them in the GUI but also external tooling around the MySQL servers or orchestrator may find it helpful to query orchestrator to get these values.

Aggregate overview mouseover could also show delays

A URL such as: https://orchestrator.mydomain.com/web/cluster/somecluster:3306?compact=true provides a good summary. If I'm in compact mode I don't see all the hosts but a bubble with the number of hosts replicating underneath a master or intermediate master.

If you mouse over that bubble then you get a list of hostnames, and mysql versions.

I'd like to also see the replication delay seen by orchestrator as that would allow me to have a quick overview of all the delays and saves me opening up a compact list.

This is convenient if you have several slaves replicating from a master.

ClusterAlias Set by Web Interface Won't Persist

The cluster alias set using web interface won't stay long. After a while, it will be changed back to cluster name, again.

Above screen shot was captured on a read-only instance, and aliases were set in config file, but I'm talking about a read-write instance.

Suggestions for improvements in auditing of the recovery process

Work has been started to audit the recovery process for later analysis. Studying the logs on a busy orchestrator server can be really hard. It's also good to keep this information to later explain what happened and why.

After looking at v2.1.1 I've triggered a test failure of an intermediate master and notice the following. Original logging shows:

2017-04-10 13:54:30     searching for the best candidate sibling of dead intermediate master dead-intermediate-master:3306
2017-04-10 13:54:30     found replacement-intermediate-master:3306 as a replacement for dead-intermediate-master:3306 [any sibling]
2017-04-10 13:54:30     - RecoverDeadIntermediateMaster: will next attempt regrouping of replicas
2017-04-10 13:54:30     - RecoverDeadIntermediateMaster: will next attempt relocating to another DC server
2017-04-10 13:54:30     - RecoverDeadIntermediateMaster: will attempt a candidate intermediate master: replacement-intermediate-master:3306

It would be good to record the "error message trying to connect to the intermediate master". Several problems may occur so knowing which one is triggering the problem would be good.
- In my case the error I saw was: 2017-04-09 13:03:52 WARNING Discovery failed for host: dead-intermediate-master:3306 in 1.269s (Backend: 0.267s, Instance: 1.002s), error: ReadTopologyInstanceBufferable failed: dial tcp 1.2.3.4:3306: i/o timeout
- So perhaps something like dead-intermediate-master had error: dial tcp 1.2.3.4:3306: i/o timeout would be useful?
Prior to "searching for the best candidate sibling of dead intermediate master" it would be good to indicate the list of slaves that are affected by the IM failure.
there is a step: "will next attempt regrouping of replicas". Unless I'm mistaken this arranges the "affected slaves", promoting one of them as a master and putting the others underneath. If that's the case it would seem helpful to indicate which slave was promoted and at least how many were successfully moved under that "temporary master"
again in this output if there are errors indicating the error message seems helpful.

The goal in my case here is to have a checklist of the steps taken, to be able to see if everything went ok, to know why certain actions were taken, and to see a high level overview of failures with enough information to understand their cause.

I am very happy to see the work that has been done so far. This helps a lot.

Orchestrator/orchestrator-agent don't properly handle subscript errors.

The orchestrator-agent repo doesn't have Issues turned on, so I'm putting this here. If there's another place to go with this, let me know.

When doing a seed through Orchestrator, a subscript on the agent running on the seed node (like SendSeedDataCommand) can fail and return an error. That error ends up in the orchestrator-agent logs:

2017-04-05 15:16:20 ERROR exit status 1

However, Orchestrator never detects this and continues to try to start up the seed target:

Seed states
State start time	State action	Error message
2017-04-05 15:02:46	Starting MySQL on target: slave-2	Get http://slave-2:3002/api/mysql-start?token=84defa43fd46885f91cf4ede88dd79ea7cfcbfc2debf2e3bfe98205f9e3ac6e1: net/http: timeout awaiting response headers
2017-04-05 15:02:46	Unmounting logical volume: /dev/vg1/mysql-orchestrator-snapshot-1491339949	
2017-04-05 15:02:46	Executing post-copy command on slave-2	
2017-04-05 15:02:46	Copied 292.35 kB / 510.69 MB (0%)	
2017-04-05 15:02:45	slave-1 will now send data to slave-2 in background	
2017-04-05 15:02:43	Waiting some time for slave-2 to start listening for incoming data	
2017-04-05 15:02:43	slave-2 will now receive data in background	
2017-04-05 15:02:43	Aquiring target host datadir free space on slave-2	
2017-04-05 15:02:43	Erasing MySQL data on slave-2	
2017-04-05 15:02:43	MySQL data volume on source host slave-1 is 535498696 bytes	
2017-04-05 15:02:43	Mounting logical volume: /dev/vg1/mysql-orchestrator-snapshot-1491339949	
2017-04-05 15:02:43	Checking mount point on source slave-1	
2017-04-05 15:02:43	Looking up available snapshots on source slave-1	
2017-04-05 15:02:43	Checking MySQL status on target slave-2	
2017-04-05 15:02:43	getting source agent info for slave-1	
2017-04-05 15:02:43	getting target agent info for slave-2

Orchestrator does see this as a failure, but it marks it as such because the MySQL startup on the seed node fails, not because the transfer failed.

Ideally, the erase step would be skipped until the transfer succeeds, but that's kind of difficult to try to work around without double the required space on the target node. At the very least, though, Orchestrator should properly report that the transfer failed instead of continuing through to start the server up.

Optional input validation of orchestrator.conf.json

I noticed some configuration settings in my /etc/orchestrator.conf.json were no longer valid. Orchestrator reads in the file and does not indicate there's an issue of any sort.

While I'm happy with lazy evaluation and also potentially having invalid settings in the config file it would be most helpful I think to be able to identify such invalid settings. Two thoughts come to mind:

(A) command line "validate my config file" option

$ orchestrator -c check_validate_config_file

which would indicate if the config contents are good or not. Invalid JSON settings would generate an error.

(B): Optional setting to fail if the current config is invalid

Something like: "ValidateConfigSetting": true,

This is the setting I would prefer but of course setting it now would do nothing until the option is actually implemented, so I probably need to use a combination of both (A) and (B).

Anyway this issue is to simply reflect this potential problem. There are a large number of options that can be used in the config file, a type is quite easy to make, or simply code that is changing rapidly may leave a setting which is expected to do something to be invalid and be ignored rather than generate an error.