aegershman / yugabyte-boshrelease Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 0.0 346 KB

yugabytedb bosh release

Home Page: https://github.com/yugabyte/yugabyte-db

License: Apache License 2.0

HTML 41.50% Shell 46.27% HCL 12.23%

bosh bosh-release yugabyte

yugabyte-boshrelease's Introduction

yugabyte-boshrelease

This is a BOSH release for YugabyteDB.

server-to-server tls

TLS for server-to-server ("node-to-node", as in, traffic between tserver and/or master nodes) is on and required by default, i.e. allow_insecure_connections: false by default. You can modify these properties using operator files.

We use BOSH's credhub integration to generate individual certificates for both master and tserver instance groups leveraging wildcard BOSH DNS values for the certificate SANs, meaning the actual hostname DNS values are handled automatically. Since they're both signed by the same CA (by default located in credhub under /services/tls_ca, which is the CA for service instances which nearly all other service offerings in Cloud Foundry leverage for TLS), and each have the same common_name, they should be compatible with one another.

It's a bit unclear to me how common_name and alternative_names should be configured. Is it completely arbitrary? Does the file name actually matter? Does it have to be related to the DNS hostname of each node instance? We'll all figure it out together 💖

For the moment we'll assume it's looking for the name to be the configured hostname of the individual host. We can assume this because of the following log line from /var/vcap/sys/log/yb-master/yb-master.INFO:

tail yb-master.INFO
...
I0305 00:19:30.295537     6 secure.cc:102] Certs directory: /var/vcap/jobs/yb-master/config/certs, name: q-m90323n3s0.q-g88658.bosh

client-to-server tls

TLS for client-to-server (as in, from a client application using the universe) is on, but not required by default, i.e. allow_insecure_connections: true by default for optional use of TLS from clients. You can modify these properties using operator files.

Note, YEDIS does not support client-to-server TLS

regarding rpc_bind and broadcast_bind

You might see lines like this in current configurations:

--rpc_bind_addresses=<%= spec.address %>:<%= p("rpc_bind_addresses_port") %>
--server_broadcast_addresses=<%= spec.address %>:<%= p("rpc_bind_addresses_port") %>

Notice how --server_broadcast_addresses is using an address with rpc_bind_addresses_port as the port. This is because the differences between rpc_bind_addresses_port and something like server_broadcast_addresses_port are too small at the moment to really make a huge difference, so for the time being they're going to be collapsed into one, and only rpc_bind_addresses_port will be referenced. Is it correct? Honestly, not 100% sure. Actually I'm 100% it isn't correct or ideal. But for the time being, it works, and you know what, we'll get there.

why some gflags are BOSH properties and others are just... gflags

Certain flags (but not all) are defined as their own property with their own defaults, descriptions, opsfiles, etc. These properties are (somewhat arbitrarily) important enough to stand out. It's of my opinion that flags important enough to make a difference to a consumer of this release should receive their own property, with reasonable defaults and a description, whereas gflags acts as a backup and a catch-all.

There are many flags which should have reasonable defaults, which either are specific to this BOSH release (and thus aren't defined in upstream Yugabyte), or we feel should be different from the defaults selected from upstream Yugabyte. But if we don't put those configuration flags as their own property in the BOSH job spec, and instead rely on gflags: {x: y} to pass in everything, then there's no way (that I'm aware of?) for the maintainers of this BOSH release to set default gflags in such a way that consumers could selectively override individual flags. For example: if someone wanted to override one flag, like placement_cloud, then all the defaults we set in gflags in the job spec file would back off and deactivate. A consumer would have to define all the defaults we set (if they so chose) in their gflags override in addition to the one flag they wanted to change.

rotating the YCQL admin "cassandra" user password

There is a default YCQL superadmin with the credentials cassandra/cassandra. The password for the cassandra user can be rotated in a two-step process. You'll need to configure the cassandra_password_old property, which will be used while attempting to set the new password to cassandra_password. Once the new password of cassandra_password is set and in-use, you can remove the opsfile for cassandra_password_old at your discretion.

Now, with that said, keep in mind--

The default manifest in manifests/yugabyte.yml will automatically change the cassandra user password to an autogenerated password of ((ycql_cassandra_password)). The cassandra user is then used for other internal administrative tasks, like provisioning other users, etc. It also provides a default "superuser" of admin with a password of ((ycql_superuser_admin_password)). The intent is that this user be used by consuming applications instead of cassandra/cassandra. That's the current ideal, at least.

In order to change the password of a user through ycql.databases.superusers[*].password: some_password, just change the value of some_password in-place. The root cassandra user is used internally to ALTER those superusers, so you don't need to worry about doing fancy swapouts of those passwords. Just change it in the deployment manifest, and it'll rotate on the next deploy.

cutting releases

Having a fully automated release process is a goal. But we want to make sure it's done well, and would like to have it done using github actions if possible. But until then, here's the general workflow. We're assuming any bosh add-blobs and bosh upload-blobs commands have been git commit'ed if blobs are changing, and now we're on the release process.

NOTE: before cutting a new release, make sure that the contents of src/yugabyte-additional/post_install.sh have proper values of ORIG_BREW_HOME and ORIG_LEN and such depending on the upstream version of yugabyte being cut.

cd yugabyte-boshrelease

# first of all, your workspace needs to be up-to-date and clean of dirty commits,
# or else you'll commit something inadvertently to this release
git pull origin main

# to pull all blobs from s3 to local directory, if necessary
bosh sync-blobs

git checkout -b release-x.y.z

# place the release tgz in your /tmp dir in order to calculate a shasum on it, and to upload to a github release
bosh create-release --final --version=x.y.z --tarball=/tmp/yugabyte-x.y.z.tgz

# this will be used to update the versions.yml
shasum -a 1 /tmp/yugabyte-x.y.z.tgz

# use that shasum value to update the manifests/versions.yml
yugabyte_boshrelease_sha1: 582c112d4621361a031e530885f5653868f1bbd0
yugabyte_boshrelease_version: x.y.z

# git commit all of this to the branch
git add -A
git commit -m "release-x.y.z"
git push origin release-x.y.z

# squash 'n merge it into main

now for making the release available as an actual github release:

# after squashing and merging into main...
git checkout main
git pull origin main

# notice the lack of 'v' prefix. not a fan of it.
git tag x.y.z
git push origin --tags

then go to the github releases page, click on the release for the newly created tag, and configure the release with a title, release notes, and an asset copy of the tarball from /tmp/yugabyte-x.y.z.tgz

voila, you're set.

contributing

Ideas, feedback, bug reports, etc. are all welcome, but by no means guaranteed to be implemented, responded to, or merged.

yugabyte-boshrelease's People

Stargazers

Watchers

yugabyte-boshrelease's Issues

cleanup packaging scripts, make them less ugly, etc.

use globbing and such

This will make #30 much easier since it'll be easier to grab packages associated to jobs, etc.

YSQL general

some reading material which may be generally beneficial:

tserver/e954712b-3feb-47cf-b917-c730eca00895:/var/vcap/jobs/yb-tserver# /var/vcap/packages/yugabyte/bin/ysqlsh -h 10.156.89.41 -p 5433
ysqlsh: FATAL:  Not found: Error loading table with oid 1260 in database with oid 1: The object does not exist: table_id: "000000010000300080000000000004ec"

https://www.postgresql.org/docs/11/config-setting.html#CONFIG-SETTING-CONFIGURATION-FILE

remove yugabyted and yb-ctl jobs

In favor of just yb-master and yb-tserver for the time being. It makes more sense to just simplify and cut out other stuff for the time being.

yugabyted and yb-ctl are for local clusters, and we could do some interesting stuff like bosh ssh options to forward on localhost connections to a locally spun up yugabyted cluster and such, but tbh, just get rid of it in favor of a single-master single-tserver deployment option for goofing

put .conf files and such in config/ folder in spec files

remove unnecessary "ephemeral_disk: false" lines in bpm

it should be off by default and it's useless and it adds confusion, just ... get rid of them

status, wipe restart, etc., "generic-cmd" as errands

https://bosh.io/docs/errands/

just pass along the ability to configure commands as errands which can be configured with properties which templatize the run.sh

can do the same thing but using yb-admin https://docs.yugabyte.com/latest/admin/yb-admin/

go back to yugabyte 2.0.0 and create bosh-releases to test upgrade chain to latest

low priority, would be a way just to validate upgradeability

make sure your bosh-dns links are actually returning bosh-dns values via features

https://bosh.io/docs/manifest-v2/#features

you should test at least having the following in the bosh deployment manifest:

features:
  use_dns_addresses: true

but these are all fun:

features:
  use_dns_addresses: true
  randomize_az_placement: true
  use_short_dns_addresses: true

auth, general

fix formatting with editorconfig

probably should swap out editorconfig to tabs, remove redundant declarations, reformat all files, etc.

yb-admin as a job/errand

cleanup pre/post start interrobangs and set lines

just use set -euxo pipefail instead of individual set -e -u lines
prefer /usr/bin/env bash

https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html

consider adding yedis proxy binding flags, etc

note, from an example tserver, looks like its filled in

--cql_proxy_bind_address=q-m97997n3s0.q-g96704.bosh:9042
--cql_proxy_webserver_port=12000
--enable_direct_local_tablet_server_call=true
--inbound_rpc_memory_limit=0
--pgsql_proxy_bind_address=
--redis_proxy_bind_address=q-m97997n3s0.q-g96704.bosh:6379

bpm config on ulimits, locale, open files, etc. as according to YB advice

eval if services should bind to all network interfaces or also localhost

currently we have every service bind to the private IP, but perhaps we should have it just bind to 0.0.0.0 or to more configurable options than just the private IP

for example if using the yedis-cli, you can get on a tserver node and connect to that tserver's yedis api using the private ip of the host, but not localhost; is that a problem? probably not, but worth making a little note about

https://docs.yugabyte.com/latest/troubleshoot/cluster/connect-yedis/#root

consume/provide links could be updated to have named peers

https://github.com/cppforlife/cockroachdb-release/blob/master/jobs/cockroachdb/spec#L13-L26

would enable implicit linking

prometheus and/or indicator protocol integration

see:

evaluate and add in ulimits, process limits, sysctl reqs, etc.

As according to yugabyte doc under 'manual deployment', and with bpm. see this comment in this PR for more details about limits and bpm

also, the security-checklist may have helpful info https://docs.yugabyte.com/latest/secure/security-checklist/

see:

https://ro-che.info/articles/2017-03-26-increase-open-files-limit
https://gemfire.docs.pivotal.io/99/geode/managing/heap_use/lock_memory.html
usage of file-nr?
need to bump the timeout on monit way up? https://starkandwayne.com/blog/quick-guide-to-using-monit-in-bosh/
https://github.com/pivotal-cf/ulimit-release/blob/master/jobs/ulimit/spec maybe it needs to be specified in /etc/limits.conf
cloudfoundry-incubator/docker-boshrelease#37
maartensl/cf-release-ulimits@f78dae0
some interesting things that bosh cassandra release does: https://github.com/orange-cloudfoundry/cassandra-boshrelease/blob/master/jobs/cassandra/templates/bpm-prestart
and pxc-release does: https://github.com/cloudfoundry-incubator/pxc-release/blob/master/jobs/pxc-mysql/spec
good inspiration: https://github.com/cloudfoundry/nats-release/blob/develop/jobs/nats/templates/pre-start.erb

remove capabilties like NET, no need since ports above 1000

tls, server-to-server, client-to-server

In order to do server-to-server TLS and make it easy-as-pie, the bosh variables generation needs to use links to consume one another instead of manual configuration of alternative_names

see:

errand to deactivate the redis setup... is that a thing? do we care?

need to perform setup_redis_table via yb-admin to enable yedis, but can we disable it? does it matter?

package python for clis such as cqlsh

if going to use cqlsh on tservers directly, they'll need python.

if not, as in if we're going to run initial setup as a job/errand in a different instance group, then that job will need python

will figure it out a bit later

./cqlsh 
No appropriate python interpreter found.

cat cqlsh

# bash code here; finds a suitable python interpreter and execs this file.
# prefer unqualified "python" if suitable:
python -c 'import sys; sys.exit(not (0x020700b0 < sys.hexversion < 0x03000000))' 2>/dev/null \
    && exec python "`python -c "import os;print(os.path.dirname(os.path.realpath('$0')))"`/cqlsh.py" "$@"
for pyver in 2.7; do
    which python$pyver > /dev/null 2>&1 && exec python$pyver "`python$pyver -c "import os;print(os.path.dirname(os.path.realpath('$0')))"`/cqlsh.py" "$@"
done
echo "No appropriate python interpreter found." >&2
exit 1

which makes sense since it calls the file cqlsh.py which is the basis of the cassandra cli

https://pypi.org/project/cqlsh/

Originally posted by @aegershman in #56 (comment)

fail if certain configurations are wrong/change, e.g. scaling masters to even number

optionally fail if config doesn't seem right, help put in some validation safeguards

https://github.com/cloudfoundry-incubator/bits-service-release/blob/master/jobs/bits-service/templates/bits_config.yml.erb

erb count

evaluate backup options in general

yb sample app logging redirection

this could be a different issue, but also this applies to anything like yb-admin, etc.

symlinking, bosh, and _you_

http://tiewei.github.io/bosh/BOSH-Terms-and-Working-Steps/

Packages are compiled on demand during the deployment. The director first checks to see if there already is a compiled version of the package for the stemcell version it is being deployed to, and if it doesn't already exist a compiled version, the director will instantiate a compile VM (using the same stemcell version it is going to be deployed to) which will get the package source from the blobstore, compile it, and then package the resulting binaries and store it in the blobstore.

packaging script that is responsible for the compilation, and is run on the compile VM. The script gets two environment variables set from the BOSH agent:

BOSH_INSTALL_TARGET : Tells where to install the files the package generates. It is set to /var/vcap/data/packages//.

BOSH_COMPILE_TARGET : Tells the the directory containing the source (it is the current directory when the packaging script is invoked).

When the package is installed a symlink is created from /var/vcap/packages/ which points to the latest version of the package. This link should be used when referring to another package in the packaging script.

https://docs.yugabyte.com/latest/contribute/core-database/build-from-src/#ubuntu18

here's a bunch of symlinking happening in the relase manifest https://github.com/yugabyte/yugabyte-db/blob/master/yb_release_manifest.json

consider adding cql binding flags, cql transactional by default, etc.

validate whether masters are confused about connecting to themselves

seeing these kinds of log lines on master servers:

I0212 20:52:53.386559    16 reactor.cc:450] Master_R001: Timing out connection Connection (0x000000000291e010) server 10.156.89.36:55155 => 10.156.89.36:7100 - it has been idle for 65.0004s (delta: 65.0004, current time: 996.106, last activity time: 931.106)

makes me wonder if we need to be more clever about the master connection string and have it filter it's own hostname out and replace it with localhost? just thoughts

does the yugabyte helm chart do it? https://github.com/yugabyte/charts/blob/master/stable/yugabyte/templates/_helpers.tpl#L57

actually useful collection of operators and manifest

see:

finish yb-sample-apps

running them on CF doesn't work, why gdangit

#19

actual doc

consider ability to compile yb from source in packaging

not a priority at all, just writing down as a reminder that it could be an option

override node/universe uuids to match bosh-managed uuids

No clue if this is a good or awful idea, but would be nice if these params matched the instance/deployment uuid

--instance_uuid_override=
--cluster_uuid=

see:

cockroachdb ideas

https://github.com/cppforlife/cockroachdb-release/blob/master/jobs/cockroachdb/spec

Use bosh -d cockroachdb ssh cockroachdb/0 --opts=" -L 8080:127.0.0.1:8080" to expose service locally.

aside:

https://www.cockroachlabs.com/blog/unpacking-competitive-benchmarks/

eval other stemcell lines

might be irrelevant, tbh.

contacting follower master nodes on :7000 hangs, only resolves on leader

It might be intended, or might have something to do with #53 because it's doing a redirect of some sort. doesn't really matter, tbh, but it means you have to guess and check each master to see which resolves to the leader

break out yb-sample-apps.jar from yugabyte blobstore subfolder

maybe. I don't know.

Inbound connection requests coming from CF syslog-scheduler

inbound yb_rpc calls from 10.156.86.21, which appears to be from syslog_scheduler/59ac6012-0da8-4c84-9e7d-fe016f2e92fd from cf deployment

from tserver logs at http://10.156.89.37:9000/logs

W0211 18:23:02.524586    12 connection.cc:281] Connection (0x000000000222f8d0) server 10.156.86.21:35906 => 10.156.89.37:9100: Command sequence failure: Network error (yb/rpc/yb_rpc.cc:141): Invalid connection header: 1603010101010000FD03033052F1AB8F4DC20D88135C77B735F13E19090F441CAB71D7C905FC09079A3D0320D8F867C2AED349FD20E14760971EFFC749FD904C2204DB46BC9C40B3119102E80026C02FC030C02BC02CCCA8CCA9C013C009C014C00A009C009D002F0035C012000A1301130313020100008E00000013001100000E73797374656D2D6D657472696373000500050100000000000A000A0008001D001700180019000B00020100000D001A0018080404030807080508060401050106010503060302010203FF0100010000120000002B00050403040303003300260024001D00203FD19BC43DFD73E0DBEF13E59A04BC8B9618DF8EB2AB0B99CECC15F020B96276
W0211 18:23:02.524672    12 tcp_stream.cc:130] { local: 10.156.89.37:9100 remote: 10.156.86.21:35906 }: Shutting down with pending inbound data ({ capacity: 131072 pos: 0 size: 262 }, status = Network error (yb/rpc/yb_rpc.cc:141): Invalid connection header: 1603010101010000FD03033052F1AB8F4DC20D88135C77B735F13E19090F441CAB71D7C905FC09079A3D0320D8F867C2AED349FD20E14760971EFFC749FD904C2204DB46BC9C40B3119102E80026C02FC030C02BC02CCCA8CCA9C013C009C014C00A009C009D002F0035C012000A1301130313020100008E00000013001100000E73797374656D2D6D657472696373000500050100000000000A000A0008001D001700180019000B00020100000D001A0018080404030807080508060401050106010503060302010203FF0100010000120000002B00050403040303003300260024001D00203FD19BC43DFD73E0DBEF13E59A04BC8B9618DF8EB2AB0B99CECC15F020B96276)
W0211 18:23:02.524732    12 tcp_stream.cc:130] { local: 10.156.89.37:9100 remote: 10.156.86.21:35906 }: Shutting down with pending inbound data ({ capacity: 131072 pos: 0 size: 262 }, status = Service unavailable (yb/rpc/reactor.cc:91): Shutdown connection (system error 108))

Found logs from the scheduler, that's hilarious, it pings on :9100 and I believe this is what causes the tservers to puke:

<14>1 2020-02-11T18:31:02.866711Z 10.156.86.21 loggr-metric-scraper rs2 - [instance@47450 director="" deployment="cf-52b8aeeeda6f562e05f9" group="syslog_scheduler" az="us-west-2a" id="59ac6012-0da8-4c84-9e7d-fe016f2e92fd"] [id: syslog_scheduler, instance_id: , metric_url: https://10.156.89.37:9100/metrics]: Get https://10.156.89.37:9100/metrics: EOF

So... I think we could try changing the binding ports to communicate on something different? Or find some way to not fail on those requests?

consider switching from flagfiles to pure cli args

because the flags library that yugabyte uses will FAIL on validation if using an --argument=like_this passed directly to the yb-{master,tserver} binary via args, but will ALLOW for unknown or invalid flags when resolving a flagfile

not a huge deal

see #78 for an interesting rationale of this (--use-cassandra-auth to masters)