openark / orchestrator Goto Github PK
View Code? Open in Web Editor NEWMySQL replication topology management and HA
License: Apache License 2.0
MySQL replication topology management and HA
License: Apache License 2.0
For PRs you ask us to check if go test works. For me the current code does not work.
That is I see:
[user@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ go test ./go/...
? github.com/github/orchestrator/go/agent [no test files]
ok github.com/github/orchestrator/go/app 0.017s
? github.com/github/orchestrator/go/attributes [no test files]
? github.com/github/orchestrator/go/cmd/orchestrator [no test files]
ok github.com/github/orchestrator/go/collection 0.017s
ok github.com/github/orchestrator/go/config 0.020s
? github.com/github/orchestrator/go/db [no test files]
? github.com/github/orchestrator/go/discovery [no test files]
ok github.com/github/orchestrator/go/http 0.029s
ok github.com/github/orchestrator/go/inst 0.023s
? github.com/github/orchestrator/go/logic [no test files]
? github.com/github/orchestrator/go/metrics [no test files]
? github.com/github/orchestrator/go/metrics/query [no test files]
? github.com/github/orchestrator/go/os [no test files]
? github.com/github/orchestrator/go/process [no test files]
? github.com/github/orchestrator/go/remote [no test files]
2017-04-30 23:29:54 INFO MutualTLS requested, client certificates will be verified
--- FAIL: TestNewTLSConfig (0.00s)
ssl_test.go:40: Could not create new TLS config: No certificates parsed
--- FAIL: TestVerify (0.00s)
ssl_test.go:94: x509: RSA key missing NULL parameters
2017-04-30 23:29:54 INFO Decrypted /var/folders/v5/6pdpvnbn7pg9h41pnbsr1klscmtspv/T/ssl_test488963850 successfully
--- FAIL: TestAppendKeyPair (0.01s)
ssl_test.go:154: Failed to append certificate and key to tls config: x509: RSA key missing NULL parameters
2017-04-30 23:29:54 INFO Decrypted /var/folders/v5/6pdpvnbn7pg9h41pnbsr1klscmtspv/T/ssl_test572616158 successfully
--- FAIL: TestAppendKeyPairWithPassword (0.01s)
ssl_test.go:169: Failed to append certificate and key to tls config: x509: RSA key missing NULL parameters
FAIL
FAIL github.com/github/orchestrator/go/ssl 0.061s
[user@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$
This is true for me as of commit
commit c61104ea3b9a853affe6c4a6d4f7d37f420ae27d
Merge: 5a9547b1 72a59a23
Author: Shlomi Noach <[email protected]>
Date: Thu Apr 20 17:51:33 2017 +0300
Shlomi, does this build and work for you? If so there's some difference in our environment which is triggering this and I would like to fix mine so the build works cleanly and I can really check that go test ./go/...
actually works. If it makes any difference this is being run on OSX, but also get an issue on Linux.
Note: the build checks via travis do seem to work, so there's some difference here between what you ask us to check if someone creates a PR and what actually gets checked by travis.
The current /web/status url shows the list of members of the orchestrator cluster and which node is active. We also show the version of orchestrator that is running.
A couple of things that are not shown for each orchestrator node:
This gives you some idea of the stability of the cluster and also how long it has been running. Without that it's pretty hard to tell.
Why is this useful? Basically it allows you to see if your network is stable, if the servers have been running for the amount of time you might expect and you can do all of this at the click of a button and without dropping into the OS or using other monitoring tools to see it.
hi,
I just download the new version tarball: orchestrator-2.0.2-linux-amd64.tar.gz
my question are:
when i used command ./orchestrator http
at /usr/local/orchestrator directory, then running log throws ERROR Error 1054: Unknown column 'first_seen_active' in 'field list'
when i used command ./orchestrator --discovery=false http
, access the web site, the running log throws ERROR Error 1054: Unknown column 'candidate_database_instance.promotion_rule' in 'field list'
I looked my orchestrator database, it has 36 tables, and I tryed `select promotion_rule from candidate_database_instance;` , the mysql alert `ERROR 1054 (42S22): Unknown column 'promotion_rule' in 'field list'` message
I think about, the version's sql is old or didn't update?
Storyline: #83
(at least configurable?)
On master failover, do not promote a replica if it hasn't consumed its relay logs. This event can happen when all replicas are greatly lagging on time of master failure.
In such case a reset slave all
on the promoted replica causes data loss.
Typically replicas are not lagging so much, and by the time the failure detection takes place, they will have consumed their relay logs.
They're well advertised somewhere, but not here: https://github.com/github/orchestrator/blob/master/docs/pseudo-gtid.md
In our work flow we use the alias of clusters quite a bit, so I figured that orchestrator -c clusters
would show the aliases. Is there a way to do this?
Hello,
I'm comparing MHA to Orchestrator and I was wondering if in non-GTID replicated environments in cases where the master fails, does Orchestrator make any effort to save, parse and transfer binlog entries from the failed MySQL master to the new MySQL master to coalesce any data that was not captured on the new master through normal MySQL replication?
If not, is this something that would ever been attempted/on the roadmap for the future?
My current replication topology is:
co-master A (writes)--->binlog01 --> slave A
|
|
co-master B (read-only) ---> binlog02 --> slave B
I have some maintenance I need to do on co-master A and want to point my binlog01 server to co-master B. When I try thru the UI it doesn't seem to work, doing it thru the cli gives me the following exception:
[root@inv1 orchestrator]# ./orchestrator -c move-below -i binlog01.domain:3306 -d masterB.domain:3306 --noop
2017-01-13 14:42:51 ERROR Will not resolve empty hostname
panic: runtime error: slice bounds out of range
goroutine 1 [running]:
github.com/outbrain/orchestrator/go/inst.(*Instance).MajorVersion(0xc2080c8000, 0x0, 0x0, 0x0)
/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/inst/instance.go:282 +0xa2
github.com/outbrain/orchestrator/go/inst.(*Instance).IsSmallerMajorVersion(0xc2080c8000, 0xc2080c81c0, 0x82b9a0)
/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/inst/instance.go:308 +0x28
github.com/outbrain/orchestrator/go/inst.(*Instance).CanReplicateFrom(0xc2080c8000, 0xc2080c81c0, 0x1, 0x0, 0x0)
/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/inst/instance.go:437 +0x464
github.com/outbrain/orchestrator/go/inst.MoveBelow(0xc20801f8c0, 0xc20801eb80, 0x0, 0x0, 0x0)
/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/inst/instance_topology.go:407 +0x47c
github.com/outbrain/orchestrator/go/app.Cli(0x7ffca576a686, 0xa, 0x0, 0x7ffca576a694, 0x18, 0x7ffca576a6b0, 0x10, 0xc20802af20, 0x4, 0x0, ...)
/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/app/cli.go:175 +0x1e02
main.main()
/home/snoach/dev/go/src/github.com/outbrain/orchestrator/go/cmd/orchestrator/main.go:646 +0xae4
goroutine 9 [select]:
github.com/pmylund/go-cache.(*janitor).Run(0xc20802b000, 0xc20803def0)
/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:946 +0x13f
created by github.com/pmylund/go-cache.runJanitor
/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:964 +0x8c
goroutine 7 [select]:
github.com/pmylund/go-cache.(*janitor).Run(0xc20802af60, 0xc20803dd70)
/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:946 +0x13f
created by github.com/pmylund/go-cache.runJanitor
/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:964 +0x8c
goroutine 8 [select]:
github.com/pmylund/go-cache.(*janitor).Run(0xc20802af70, 0xc20803ddd0)
/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:946 +0x13f
created by github.com/pmylund/go-cache.runJanitor
/usr/share/golang/src/github.com/pmylund/go-cache/cache.go:964 +0x8c
goroutine 10 [syscall]:
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.initยท1
/usr/local/go/src/os/signal/signal_unix.go:27 +0x35
goroutine 11 [chan receive]:
database/sql.(*DB).connectionOpener(0xc208045b80)
/usr/local/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
/usr/local/go/src/database/sql/sql.go:452 +0x31c
goroutine 17 [syscall, locked to thread]:
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2232 +0x1
goroutine 18 [chan receive]:
database/sql.(*DB).connectionOpener(0xc20811c0a0)
/usr/local/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
/usr/local/go/src/database/sql/sql.go:452 +0x31c
goroutine 20 [chan receive]:
database/sql.(*DB).connectionOpener(0xc20811c640)
/usr/local/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
/usr/local/go/src/database/sql/sql.go:452 +0x31c
is there something I'm missing here?
We are seeing instances where slaves with couple of hours of lag are elected as masters. Is there any configuration for that not to happen?
Hey !
I've just enabled Oracle GTID on my Percona 5.7.10 farm (2 masters + slaves). Tried to relocate using the GTID mode and I got the following error : "Cannot move via GTID as not both instances use GTID".
I first checked into the orchestrator database and I get that :
+--------------------------------------+---------------------+----------------------+-------------+-------------+-------------+
| hostname | last_checked | supports_oracle_gtid | oracle_gtid | server_uuid | gtid_purged |
+--------------------------------------+---------------------+----------------------+-------------+-------------+-------------+
| xxxx | 2017-03-09 17:48:12 | 0 | 1 | | |
| yyyy | 2017-03-09 17:48:12 | 0 | 1 | | |
| zzzz | 2017-03-09 17:48:12 | 0 | 1 | | |
....
The problem is that supports_oracle_gtid is not set while having gtid_mode = ON.
I checked into the code, it seems that you're restricting the query getting those status to isOracleMySQL instances.
https://github.com/github/orchestrator/blob/5c34db981e5b91f03fb31135b1f0bb3bfc3c4ed0/go/inst/instance_dao.go#L303
isPercona() should also get these infos as the GTID implementation is the same.
Thanks !
Orchestrator recovers from master or intermediate master failover pretty well and that is great. However, sometimes if you want to look at the details they are hard to find.
I would like to see the following information being shown in these 3 places:
This would make the reporting of problems and later investigation easier as it wouldn't be necessary to scan through a large number of logs. The information is available but is not necessarily logged or kept for later use. Even when orchestrator works fine it's convenient to see all these details to confirm the correct behaviour, but also should there be an issue it makes it easier to diagnose the cause, whether that's due to orchestrator failing or the systems responding in some unexpected way.
A recent failed recovery shows the recovery steps as:
...
2017-05-07 19:48:37 - RecoverDeadIntermediateMaster: move to candidate intermediate master (hostname-removed:3306) did not complete: <nil>
...
Looking at the code the condition for giving this message is if err != nil || len(errs) > 0
, so it seems we are not showing the information from errs which is where there is some sort of error.
Consequently the logging should show errs in addition to err.
I'll provide a PR for the one-line change to add this missing information.
Seen in a version I'm running based on 2.1.1
Despite #116 I still see build.sh -b
failures on OSX.
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ git diff github/master
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ git show --summary
commit cdba46216e5cc7fe187057caf414977b53621605
Merge: cbc5e2b ba45adf
Author: Shlomi Noach <[email protected]>
Date: Mon Apr 3 14:32:31 2017 +0300
Merge pull request #123 from samveen/fix_116
Fix issue #116: `precheck()` honours `-b`
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ ./build.sh -b
Build only; no packaging
Building via go version go1.8 darwin/amd64
go install runtime/internal/sys: mkdir /usr/local/go/pkg/linux_amd64: permission denied
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ uname
Darwin
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$
Not running as root yet the build.sh package is trying to install into /usr/local/go which is wrong and the referenced /usr/local/go/pkg/linux_amd64
is also wrong for OSX. Building for a patched v2.0.2
version works fine.
Not a big deal right now as I'm working off my own forked branch but I do need to get up to date and expect to do that shortly.
Hi all,
We've been investigating using RDS to replace our MySQL servers, and Orchestrator could end up being great for shuffling topologies around as we set things up.
I've connected Orchestrator to an RDS instance, it got confused regarding hostnames (as the @@HOSTNAME
variable is something internal to RDS's infrastructure), and I know that RDS MySQL's topology adjustment commands are different.
Has any work been done into getting Orchestrator to support RDS, and is there any design consideration within Orchestrator that would outright preclude compatibility if work was put into working around the few differences between RDS MySQL and standard MySQL?
Thanks!
While comparing orchestrator promotions with an internal alternative, I noticed that the order of operations in graceful-master-takeover allows for data loss if any transactions commit after readonly is set. I thought that maybe I could have PreFailoverProcesses STONITH the old master but that process happens after the final coordinates are determined for the slave so there's no way to prevent this race condition.
edit: oops, my test was faulty. I was inadvertantly using a SUPER user which was throwing off the results. This can be closed/deleted.
I see the following using master branch of github against latest commit of Date: Thu Mar 23 21:06:23 2017 +0200, RELEASE_VERSION 2.1.0
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ export GOPATH=~/src/orchestrator
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$ ./build.sh -b
Build only; no packaging
Please install fpm and ensure it is in PATH (typically: 'gem install fpm')
rpmbuild not in PATH, rpm will not be built (OS/X: 'brew install rpm')
[myuser@myhost ~/src/orchestrator/src/github.com/github/orchestrator]$
So looks like some of the option handling has broken. I've not had time to look at this further but will shortly need to update to 2.1 so it would be good to resolve these issues.
I see there are other issues reported in #101 so guess that perhaps some of the breakage is related.
Crash in orchestrator based on a version very close to v2.1.
Crash report says:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7f10d5]
goroutine 1925650 [running]:
github.com/github/orchestrator/go/inst.moveReplicasViaGTID.func1.1(0xc421785aa0, 0xc42000eb80)
/builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:610 +0x25
github.com/github/orchestrator/go/inst.moveReplicasViaGTID.func1(0xc421785aa0, 0xc42000eb80, 0xc420378580, 0xc422643730, 0xc422707600, 0xc422707620, 0xc422707640)
/builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:630 +0xe0
created by github.com/github/orchestrator/go/inst.moveReplicasViaGTID
/builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:630 +0x327
The version of go/inst/instance_topology.go is the same as in v2.0.2 (and that's almost identical except for an unrelated change to v2.1)
I'll try to look at this later but am a bit busy with other things. Fuller anonymised logging shows:
2017-03-31 15:08:15 DEBUG Not elected as active node; active node: orchestrator2.dc; polling
[martini] Started GET /api/relocate-replicas/host5.dc/3306/host6.dc/3306 for 127.0.0.1:37735
2017-03-31 15:08:15 INFO Will move 7 replicas below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host2.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host9.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host10.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host3.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host8.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host4.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 INFO Will move host7.dc:3306 below host6.dc:3306 via GTID
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host2.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226526, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host3.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226528, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host10.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226529, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host8.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226530, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host9.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226527, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host7.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226531, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 DEBUG auditType:begin-maintenance instance:host4.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226532, owner: orchestrator1.dc, reason: move below host6.dc:3306
2017-03-31 15:08:15 ERROR ReadTopologyInstance(host2.dc:3306) show variables like 'maxscale%': Error 3159: Connections using insecure transport are prohibited while --require_secure_transport=ON.
2017-03-31 15:08:15 ERROR ReadTopologyInstance(host2.dc:3306) show variables like 'maxscale%': Error 3159: Connections using insecure transport are prohibited while --require_secure_transport=ON.
2017-03-31 15:08:15 DEBUG auditType:end-maintenance instance:host2.dc:3306 cluster:host1.dc:3306 message:maintenanceToken: 226526
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7f10d5]
goroutine 1925650 [running]:
github.com/github/orchestrator/go/inst.moveReplicasViaGTID.func1.1(0xc421785aa0, 0xc42000eb80)
/builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:610 +0x25
github.com/github/orchestrator/go/inst.moveReplicasViaGTID.func1(0xc421785aa0, 0xc42000eb80, 0xc420378580, 0xc422643730, 0xc422707600, 0xc422707620, 0xc422707640)
/builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:630 +0xe0
created by github.com/github/orchestrator/go/inst.moveReplicasViaGTID
/builddir/build/BUILD/orchestrator-2.0.2.29/GO/src/github.com/github/orchestrator/go/inst/instance_topology.go:630 +0x327
I am enforcing TLS on some boxes which currently prevents orchestrator reaching them. Maybe there's something related here but I am not sure.
For now just making the issue visible.
In troubleshooting, it would be helpful if timestamps were included somewhere in recovery.log.
from the orchestrator log:
2017-04-06 08:50:45 INFO CommandRun(echo 'Will recover from DeadMaster on dev-mysql101:3306' >> /tmp/recovery.log,[])
but then, I don't have timestamps in the recovery.log. It looks like it is just the most recent recovery - but I am not fully sure. :) Simple solution would be to just prepend the dump to the recovery.log with a commented timestamp.
Anonymize mode is great to share screenshot of replication topologies without sharing too much information about the exact minor version running and hostnames.
However, it leaks the username of the logged-in Orchestrator user.
Maybe the username should be replaced by "anon. user"
Thank you!
If the master goes down Orchestrator promotes a new master and change read_only
flag to OFF
.
If an intermediate master goes down where the read_only
was OFF
Orchestrator promotes a new server but won't change read_only
from ON
to OFF
. Example if I use ProxySQL it won't know where it should send the write queries anymore.
I can change this variable with a hook
, but my question is this is the expected behaviour? Or Orchestrator can change this somehow in default?
Thanks.
This is to discuss multi-source replication and how it can be handled in orchestrator
Some people may actually use this so it's good to for orchestrator to at least not do the wrong thing and not break anything.
Better still would be add increasing levels of functionality (according to demand).
Currently orchestrator runs show slave status
and records the output, the last row overwrites any data from previous rows. This sort of works but not really.
Instance struct
. Something like NumberOfMasters int
.MasterKey
IsDetachedMaster
Slave_SQL_Running
Slave_IO_Running
HasReplicationFilters
SupportsOracleGTID
UsingOracleGTID
UsingMariaDBGTID
UsingPseudoGTID
ReadBinlogCoordinates
ExecBinlogCoordinates
IsDetached
RelaylogCoordinates
LastSQLError
LastIOError
SecondsBehindMaster
SQLDelay
ExecutedGtidSet
GtidPurged
SlaveLagSeconds
SlaveHosts
ReplicationDepth
IsCoMaster
HasReplicationCredentials
ReplicationCredentialsAvailable
SemiSyncEnforced
This suggests putting them in a different structure which is referenced from the Instance struct.
orchestrator -c topology
output we could add an extra field single-source
or multi-source
to represent thisorchestrator -c multi-source-topology
which would show masters and slaves of a specific instance for a specific instance.I found that the build has the following issues/deficiencies:
Precheck if TOPDIR
in GOPATH
As a complete newbie to go
I had a lot of trouble building orchestrator
manually because my TOPDIR
was not in GOPATH
.
Given that precheck()
in build.sh
already checks for GOPATH
, wouldn't it be a good idea to check that TOPDIR
is inside it at the correct place? If need be, I can provide the required logic as a merge request.
PREFIX
isn't considered for the service
script for DAEMON_PATH
The PREFIX
should be correctly updated in the init.d service
script. Once again, let me know if you'd prefer to fix it or want me to do the needful.
This should not be allowed by default but should be possible with a -force
flag.
The reason is:
But reversing replication might not work. So it's best to put a 5.6 machine in the 5.7 branch before switching the master to ensure it is replicating.
Replication from 5.7 to 5.6 only works if 5.7 is configured to create a 5.6 compatible binlog stream and is not really supported. However doing this for a short amount of time can be used to provide a rollback option in case things don't work out with the new version.
And of course 5.7โ5.6 might also be 8.0โ5.7 or 10.2โ10.1..
// BulkReadInstance returns a list of all instances from the database
// - hostname:port is good enough
func BulkReadInstance() ([](*InstanceKey), error) {
does not work as expected as it pulls data straight out of database_instance
without doing some required munging to set all fields correctly.
Thanks to Shlomi for pointing this out. It may not be a big deal now (and it may not work incorrectly) but it needs fixing to be consistent and correct. It may cause issues later.
Basically I need to call readInstancesByCondition()
appropriately.
Hello,
I have a mysql database backend with many servers that all have the same schema but orchestrator usually fails when I try relocating slaves. I always get this error about mismatching entries. It isn't clear to me what it means. The data (as far as I can tell) and tables are the same on the master and all the slaves and I don't understand what causes this. Can you provide any additional information on what might cause this "Mismatching entries"?
2016-12-06 22:12:12 ERROR Mismatching entries, aborting: table_id: ### (mydb.priorities) <-> table_id: ### (mydb.queue)
2016-12-06 22:12:12 INFO Started slave on prddba100:3306
2016-12-06 22:12:13 ERROR Unexpected: 0 events processed while iterating logs. Something went wrong; aborting. nextBinlogCoordinatesToMatch: <nil>
2016-12-06 22:12:13 DEBUG auditType:end-maintenance instance:prddba100:3306 cluster:prddba101:3306 message:maintenanceToken: 1384
2016-12-06 22:12:13 FATAL Unexpected: 0 events processed while iterating logs. Something went wrong; aborting. nextBinlogCoordinatesToMatch: <nil>
precheck()
checks if $target
equals linux
, but $target
is not set, so the check on linux is always skipped.
A snipped trace is as below:
+ precheck linux
+ local target
+ local ok=0
+ [[ '' == \l\i\n\u\x ]]
Percona Live 2017 - Birds of a Feather Discussion: Integration Between Orchestrator and ProxySQL
Notes on the BoF discussion tonight. (Apologies if I misquoted anyone).
Amsterdam
Rene - solution that fits most of the stuff
Problem Statement
No overview of entire cluster
Solution provided by someone:
Hooked with Consul with Orchestrator
No knowledge of whole architecture known to individual ProxySQL (Rene)
Jessica (Github)
Consul in production
Chubby - at Google (Sugu)
What is the Source of Truth?
Orchestrator
HA - runs with backend database
Service should be high available
Github
Simon Mudd - always running 2 clusters
We don't need the coupling between these systems to be 100% HA necessarily (Lee)
Potential Solutions
Are there any plans to recognize and/or administer proxysql?
Orchestrator should be able to receive a proxysql ip and port and then read all the servers that it connects to. Proxysql should then show up as a master.
I think later options are administering proxysql via the orchestrator interface.
Some time ago we had 2 configuration settings to optionally avoid database upgrades:
SkipOrchestratorDatabaseUpdate
SmartOrchestratorDatabaseUpdate
These were removed as the code to handle updates was improved and considered more reliable.
I think that a similar option would be good to reinstate, especially on large environments.
Reasoning is:
So probably in a small environment it's ok to let this happen automatically and I think it's fine for it to be the default behaviour. However, an option to prevent this would at least ensure that an unintended run by some orchestrator binary won't generate ALTER TABLE commands which potentially might lock up the main orchestrator node while it tries to write to the same table.
I suggest we add a configuration setting AvoidDatabaseUpdates bool
which by default has a value of false
and intend to write a patch to support such a change.
I have been bitten by this sort of thing happening a number of times now and would prefer to stop everything and only on a single node allow the update, after which point the configuration setting would be disabled again.
gh-ost replicas are replicas that:
binlog_format=ROW
log-slave-updates
binlog_row_image=FULL
Potentially we can list replicas with STATEMENT
binlog format, since gh-ost
is able to --swithc-to-rbr
.
Hi,
If I have a topology like this (just an example):
-> rep3
rep1 --> rep2-|
-> rep4
Rep2 is an intermediate master. If rep2 dies Orchestrator processes a DeadIntermediateMaster
failover and reorganises the topology like (just an example):
rep1 --> rep4 --> rep3
So rep4 is going to be an intermediate master now. But based on the PostFailoverProcesses
placeholders I can not decide who is the new intermediate master.
It has the following placeholders:
{failureType}, {failureDescription}, {failedHost}, {failureCluster}, {failureClusterAlias}, {failureClusterDomain}, {failedPort}, {successorHost}, {successorPort}, {successorAlias}, {countSlaves}, {slaveHosts}, {isDowntimed}, {isSuccessful}, {lostSlaves}
I am trying to call an external script when an intermediate master dies but the script should/has to know who is the new intermediate master after failover.
Is there any solution/ideas for this?
Thanks.
This is to bring up an issue, not a big problem, but just to make it visible. Other people may bump into this and find the current behaviour does not meet expectations. ("drag and drop is so easy")
If you have a deep replication topology then you may find you have to move things around in several steps. This is because the current orchestrator logic checks up to 2 levels deep.
Ideally it would be possible to move (where appropriate conditions apply) between any level in the tree, but it looks to me that this would require quite a bit of code refactoring to allow this. You basically would need to be aware of the "topology tree" and see if the slave to move is in that tree and there are no filters/barriers anyway between the slave(s) current position and that of the intended new [intermediate] master.
Above you can see a real production setup. It might be be desirable to move the bottom server instance-a67d
under the primary master instance-390f
[*]. I believe that current logic (v2.0.2) does not allow this.
[*] In this specific case there is a filter on instance-6368
so this relocation should not be allowed but in a similar case where no filters existed the relocation should work.
It looks like a bug, I wanted to find out which Orchestrator node out of two was active, which in my case is node A, and the URI /web/status does reflect that on both nodes. Also, the Available Nodes list doesn't seem to be populating.
However, /api/status returns isActiveNode: false on both nodes, no matter which one is active.
REST API returns a blank current active node, and always returns a false in the isActiveNode for some reason.
Example return values:
Node A
{
"Code": "OK",
"Message": "Application node is healthy",
"Details": {
"Healthy": true,
"Hostname": "orc-01.lhrx.somecompany.com",
"Token": "8df743defc2007211b43375d5e2d4351eef4c49d9d3c678b9617530f71e3b356",
"IsActiveNode": false,
"ActiveNode": "",
"Error": null,
"AvailableNodes": null
}
}
Node B
{
"Code": "OK",
"Message": "Application node is healthy",
"Details": {
"Healthy": true,
"Hostname": "orc-02.amsx.somecompany.com",
"Token": "62ce620995be57a5a305c7d94cbfd553491c7e45e2aefbc651234773af55aa21",
"IsActiveNode": false,
"ActiveNode": "",
"Error": null,
"AvailableNodes": null
}
}
The orchestrator build procedure build.sh
created a directory /tmp/orchestrator-release
under which files are located for "package building". The problem with this approach is that if you have two different developers building on the same server they will both write to the same directory. The default ownership of /tmp/XXXX
is usually the user that created it and usually other users can not write there. Consequently the second developer will try and fail to build orchestrator without removing the directory first.
A simple way to avoid this issue would be to build under /tmp/<username>-orchestrator-release
which would require minimal changes to any existing build scripts, though perhaps the build directory can be located somewhere more standard.
Hi,
I am testing orchestrator with 5.7.17, Master and two slaves. Have moved one of the slaves to change the topology like A-B-C and then executed orchestrator -c graceful-master-takeover -alias myclusteralias
The issues found are:
mysql
.slave_master_info
in the cluster).Thanks for this amazing tool!
Regards,
Eduardo
If you run with servers in an environment where some servers require TLS connections then it is currently hard to let orchestrator know which servers should be connected using TLS and which should not. There's no per server type of indication.
If you have a configuration to not use TLS and try to connect to a server requiring this you'll get:
$ mysql --ssl-mode=DISABLED -u myuser -h myhost -p
Enter password:
ERROR 3159 (HY000): Connections using insecure transport are prohibited while --require_secure_transport=ON.
Solved with:
$ mysql --ssl-mode=REQUIRED -u myuser -h myhost -p # works
See: https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_require_secure_transport
So it looks worthwhile to have a few options of providing support when not all servers are configured the same:
It is not 100% clear to me yet which is the best way forward but I do think it is likely that a mixed "TLS connections" / "non-TLS connections" environment may exist in many places so coming up with a way to resolve this would be good.
Code that checks for slaves and comasters should also check ws_rep variables and recognize galera nodes as comasters (that can't be dragged or demoted)
According to the docs:
Not shown in this picture (for clarity purposes), but the orchestrator backend database and its replicas are themselves one of those topologies polled by orchestrator It eats its own dogfood.
Can you elaborate on what the failure modes of Orchestrator are with regard to the backend server? For instance, if the backend mysql is unavailable can Orchestrator still do failovers based on some cached state? Can it failover it's own backend (hence dogfood)?
I'm a vitess user looking to use Orchestrator for automatic failover. I already have automation in place for creating mysql kubernetes pods for vitess, so am looking to see if I can re-use some of that in creating the orchestrator mysql backend servers.
I have been running orchestrator for some time with a custom patch which generates discovery metrics, information for each poll on the time it took to get the status of the MySQL server being checked. Information collected showed how long was spent doing "database calls" on the server being discovered/polled and also on the orchestrator backend database.
I noticed that when orchestrator was polling a large number of MySQL servers that the metrics could vary significantly depending on the location of the orchestrator server compared to the orchestrator backend database. This information has been used to identify and fix several issues and also to provide a bulk import mechanism all of which has been incorporated into orchestrator via pull requests.
However, the metric collection patches have not been provided as pull requests as they were rather ugly and I had not come up with a mechanism which seemed generic enough to be used by anyone.
This issue is to discuss my ideas on solving this properly.
This comprises two parts:
The first part has been done against outbrain/orchestrator code so needs to be adapted against github/orchestrator code. That should be relatively straightforward.
For the second part I'd like to generate two API endpoints:
Example aggregated values I use are:
The example above shows 2 different orchestrator systems monitoring some servers. As you can see a small spike shows a sudden unexpected change in metrics times (probably not important here). Metric times are different as the monitoring orchestrator servers are located in different datacentres.
I think that providing the information in the way described would make it easy for any user to collect the values and incorporate them into their own monitoring or graphing systems.
More specific details can be discussed but this issue is discuss this change which I propose to provide as a pull request in the near future.
This is applicable when: BufferInstanceWrites == true
.
I recently added some counters to monitor the number of time InstancePollSeconds
gets exceeded during discovery. The number seen should normally be quite low but I've seen that on a busy orchestrator server, especially when talking to a orchestrator backend in a different datacentre that the number of times this happens can jump significantly.
Consequently better management and monitoring of this is needed.
Thoughts involve:
InstanceFlushIntervalMilliseconds
and InstanceWriteBufferSize
.flushInstanceWriteBuffer
to run. A single metric every minute is useless so I need to collect metrics and then be able to provide aggregate data and percentile timings in a similar way to how the discovery timings are handled.With these changes it should be easier to see where the bottleneck is and to be able to adjust the configuration "dynamically" to ensure the required performance is achieved.
I recently had an intermediate master failure and orchestrator promoted a slave which had a promotion rule must not
. This was not expected.
instance.PromotionRule
settings seem to be correctly set and used and I was using the api calls to ensure that all servers had the expected PromotionRule
I required. This works fine. I double checked the promoted server's rules and they showed must_not
.DeadIntermediateMasterWithSingleSlaveFailingToConnect
RecoverDeadIntermediateMaster()
calls GetCandidateSiblingOfIntermediateMaster()
which as far as I can see makes no reference to PromotionRule
settings.PromotionRule
is only used on semi-sync slaves in certain circumstances.So if PromotionRule
is currently used for determining the best alternative intermediate master to promote can you point me in the right direction. I guess the same goes for primary master election too (though the issue I saw was on a failed intermediate master).
If this functionality is missing and you want help in implementing it then please let me know. I really want to use this functionality and given the CandidatePromotionRule type and description, and the fact there's a Instance.PromotionRule column I assumed that this was working now.
Related is the fact that following the exact details of an Intermediate master failure (it being noticed as part of the discovery process and also the fact the slaves also notice this, followed by the process of relocation and final termination of the recovery process) is really hard. If you have few servers then there is less logging but if the number of servers increases finding the relevant lines in the log file is quite tricky. I wonder if there is a way that some sort of reference
can be generated and passed down in all related logging to allow me to do something like a grep of that reference and see the full recovery history. This would mean that if a failure happens then you could easily provide a detailed audit of everything that happens now related to the recovery process. While there is auditing information it seems not to be sufficiently detailed at the moment to provide a full trail. I think it would be good to have something which can run and which can give all the information needed to explain in detail the timing of when the problem happened, what was done once it was detected to resolve it, and whether the process completed successfully or not. I think that something like this would be useful and more so for people who are not aware of all the code details of orchestrator.
Hi all,
I've just created a PR to update the Dockerfile to produce a smaller base image.
Sorry for not creating an issue beforehand, I've just read that's how it should be done.
Anyway, hope it's useful to you.
Thanks for the amazing tool that Orchestrator is!
The provided init scripts make orchestrator start and run as the root user. That's really not necessary, so it would be good to provide sample init scripts that start orchestrator as a (dedicated?) non-root user.
I would suggest
orchestrator
user or some specific non-root usergo
or somehow configuring the user to have such privileges. That's likely to depend on the OS being used, but as not everyone will be binding to low ports this may not be such an issue.Orchestrator is aware of the replication environment but not aware of the larger environment in which a server may be running in. This information is thus not exposed to DBAs or sysadmins who may be using orchestrator.
It may be desirable to add certain properties to a server, ignored by orchestrator but store by it s o they can be seen by the user, especially on http interface.
Ideas for such properties or labels may be:
If we allow orchestrator to store this information it must be able to also remove and display it.
These properties should be persistent across an orchestrator restart.
The aim of adding these properties is so that it's easier to see them in the GUI but also external tooling around the MySQL servers or orchestrator may find it helpful to query orchestrator to get these values.
A URL such as: https://orchestrator.mydomain.com/web/cluster/somecluster:3306?compact=true provides a good summary. If I'm in compact mode I don't see all the hosts but a bubble with the number of hosts replicating underneath a master or intermediate master.
If you mouse over that bubble then you get a list of hostnames, and mysql versions.
I'd like to also see the replication delay seen by orchestrator as that would allow me to have a quick overview of all the delays and saves me opening up a compact list.
This is convenient if you have several slaves replicating from a master.
Work has been started to audit the recovery process for later analysis. Studying the logs on a busy orchestrator server can be really hard. It's also good to keep this information to later explain what happened and why.
After looking at v2.1.1 I've triggered a test failure of an intermediate master and notice the following. Original logging shows:
2017-04-10 13:54:30 searching for the best candidate sibling of dead intermediate master dead-intermediate-master:3306
2017-04-10 13:54:30 found replacement-intermediate-master:3306 as a replacement for dead-intermediate-master:3306 [any sibling]
2017-04-10 13:54:30 - RecoverDeadIntermediateMaster: will next attempt regrouping of replicas
2017-04-10 13:54:30 - RecoverDeadIntermediateMaster: will next attempt relocating to another DC server
2017-04-10 13:54:30 - RecoverDeadIntermediateMaster: will attempt a candidate intermediate master: replacement-intermediate-master:3306
2017-04-09 13:03:52 WARNING Discovery failed for host: dead-intermediate-master:3306 in 1.269s (Backend: 0.267s, Instance: 1.002s), error: ReadTopologyInstanceBufferable failed: dial tcp 1.2.3.4:3306: i/o timeout
dead-intermediate-master had error: dial tcp 1.2.3.4:3306: i/o timeout
would be useful?The goal in my case here is to have a checklist of the steps taken, to be able to see if everything went ok, to know why certain actions were taken, and to see a high level overview of failures with enough information to understand their cause.
I am very happy to see the work that has been done so far. This helps a lot.
The orchestrator-agent repo doesn't have Issues turned on, so I'm putting this here. If there's another place to go with this, let me know.
When doing a seed through Orchestrator, a subscript on the agent running on the seed node (like SendSeedDataCommand
) can fail and return an error. That error ends up in the orchestrator-agent logs:
2017-04-05 15:16:20 ERROR exit status 1
However, Orchestrator never detects this and continues to try to start up the seed target:
Seed states
State start time State action Error message
2017-04-05 15:02:46 Starting MySQL on target: slave-2 Get http://slave-2:3002/api/mysql-start?token=84defa43fd46885f91cf4ede88dd79ea7cfcbfc2debf2e3bfe98205f9e3ac6e1: net/http: timeout awaiting response headers
2017-04-05 15:02:46 Unmounting logical volume: /dev/vg1/mysql-orchestrator-snapshot-1491339949
2017-04-05 15:02:46 Executing post-copy command on slave-2
2017-04-05 15:02:46 Copied 292.35 kB / 510.69 MB (0%)
2017-04-05 15:02:45 slave-1 will now send data to slave-2 in background
2017-04-05 15:02:43 Waiting some time for slave-2 to start listening for incoming data
2017-04-05 15:02:43 slave-2 will now receive data in background
2017-04-05 15:02:43 Aquiring target host datadir free space on slave-2
2017-04-05 15:02:43 Erasing MySQL data on slave-2
2017-04-05 15:02:43 MySQL data volume on source host slave-1 is 535498696 bytes
2017-04-05 15:02:43 Mounting logical volume: /dev/vg1/mysql-orchestrator-snapshot-1491339949
2017-04-05 15:02:43 Checking mount point on source slave-1
2017-04-05 15:02:43 Looking up available snapshots on source slave-1
2017-04-05 15:02:43 Checking MySQL status on target slave-2
2017-04-05 15:02:43 getting source agent info for slave-1
2017-04-05 15:02:43 getting target agent info for slave-2
Orchestrator does see this as a failure, but it marks it as such because the MySQL startup on the seed node fails, not because the transfer failed.
Ideally, the erase step would be skipped until the transfer succeeds, but that's kind of difficult to try to work around without double the required space on the target node. At the very least, though, Orchestrator should properly report that the transfer failed instead of continuing through to start the server up.
I noticed some configuration settings in my /etc/orchestrator.conf.json
were no longer valid. Orchestrator reads in the file and does not indicate there's an issue of any sort.
While I'm happy with lazy evaluation and also potentially having invalid settings in the config file it would be most helpful I think to be able to identify such invalid settings. Two thoughts come to mind:
$ orchestrator -c check_validate_config_file
which would indicate if the config contents are good or not. Invalid JSON settings would generate an error.
Something like: "ValidateConfigSetting": true,
This is the setting I would prefer but of course setting it now would do nothing until the option is actually implemented, so I probably need to use a combination of both (A) and (B).
Anyway this issue is to simply reflect this potential problem. There are a large number of options that can be used in the config file, a type is quite easy to make, or simply code that is changing rapidly may leave a setting which is expected to do something to be invalid and be ignored rather than generate an error.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.