Code Monkey home page Code Monkey logo

yugabyte-boshrelease's Introduction

yugabyte-boshrelease

This is a BOSH release for YugabyteDB.

server-to-server tls

TLS for server-to-server ("node-to-node", as in, traffic between tserver and/or master nodes) is on and required by default, i.e. allow_insecure_connections: false by default. You can modify these properties using operator files.

We use BOSH's credhub integration to generate individual certificates for both master and tserver instance groups leveraging wildcard BOSH DNS values for the certificate SANs, meaning the actual hostname DNS values are handled automatically. Since they're both signed by the same CA (by default located in credhub under /services/tls_ca, which is the CA for service instances which nearly all other service offerings in Cloud Foundry leverage for TLS), and each have the same common_name, they should be compatible with one another.

It's a bit unclear to me how common_name and alternative_names should be configured. Is it completely arbitrary? Does the file name actually matter? Does it have to be related to the DNS hostname of each node instance? We'll all figure it out together ๐Ÿ’–

For the moment we'll assume it's looking for the name to be the configured hostname of the individual host. We can assume this because of the following log line from /var/vcap/sys/log/yb-master/yb-master.INFO:

tail yb-master.INFO
...
I0305 00:19:30.295537     6 secure.cc:102] Certs directory: /var/vcap/jobs/yb-master/config/certs, name: q-m90323n3s0.q-g88658.bosh

client-to-server tls

TLS for client-to-server (as in, from a client application using the universe) is on, but not required by default, i.e. allow_insecure_connections: true by default for optional use of TLS from clients. You can modify these properties using operator files.

Note, YEDIS does not support client-to-server TLS

regarding rpc_bind and broadcast_bind

You might see lines like this in current configurations:

--rpc_bind_addresses=<%= spec.address %>:<%= p("rpc_bind_addresses_port") %>
--server_broadcast_addresses=<%= spec.address %>:<%= p("rpc_bind_addresses_port") %>

Notice how --server_broadcast_addresses is using an address with rpc_bind_addresses_port as the port. This is because the differences between rpc_bind_addresses_port and something like server_broadcast_addresses_port are too small at the moment to really make a huge difference, so for the time being they're going to be collapsed into one, and only rpc_bind_addresses_port will be referenced. Is it correct? Honestly, not 100% sure. Actually I'm 100% it isn't correct or ideal. But for the time being, it works, and you know what, we'll get there.

why some gflags are BOSH properties and others are just... gflags

Certain flags (but not all) are defined as their own property with their own defaults, descriptions, opsfiles, etc. These properties are (somewhat arbitrarily) important enough to stand out. It's of my opinion that flags important enough to make a difference to a consumer of this release should receive their own property, with reasonable defaults and a description, whereas gflags acts as a backup and a catch-all.

There are many flags which should have reasonable defaults, which either are specific to this BOSH release (and thus aren't defined in upstream Yugabyte), or we feel should be different from the defaults selected from upstream Yugabyte. But if we don't put those configuration flags as their own property in the BOSH job spec, and instead rely on gflags: {x: y} to pass in everything, then there's no way (that I'm aware of?) for the maintainers of this BOSH release to set default gflags in such a way that consumers could selectively override individual flags. For example: if someone wanted to override one flag, like placement_cloud, then all the defaults we set in gflags in the job spec file would back off and deactivate. A consumer would have to define all the defaults we set (if they so chose) in their gflags override in addition to the one flag they wanted to change.

rotating the YCQL admin "cassandra" user password

There is a default YCQL superadmin with the credentials cassandra/cassandra. The password for the cassandra user can be rotated in a two-step process. You'll need to configure the cassandra_password_old property, which will be used while attempting to set the new password to cassandra_password. Once the new password of cassandra_password is set and in-use, you can remove the opsfile for cassandra_password_old at your discretion.

Now, with that said, keep in mind--

The default manifest in manifests/yugabyte.yml will automatically change the cassandra user password to an autogenerated password of ((ycql_cassandra_password)). The cassandra user is then used for other internal administrative tasks, like provisioning other users, etc. It also provides a default "superuser" of admin with a password of ((ycql_superuser_admin_password)). The intent is that this user be used by consuming applications instead of cassandra/cassandra. That's the current ideal, at least.

In order to change the password of a user through ycql.databases.superusers[*].password: some_password, just change the value of some_password in-place. The root cassandra user is used internally to ALTER those superusers, so you don't need to worry about doing fancy swapouts of those passwords. Just change it in the deployment manifest, and it'll rotate on the next deploy.

cutting releases

Having a fully automated release process is a goal. But we want to make sure it's done well, and would like to have it done using github actions if possible. But until then, here's the general workflow. We're assuming any bosh add-blobs and bosh upload-blobs commands have been git commit'ed if blobs are changing, and now we're on the release process.

NOTE: before cutting a new release, make sure that the contents of src/yugabyte-additional/post_install.sh have proper values of ORIG_BREW_HOME and ORIG_LEN and such depending on the upstream version of yugabyte being cut.

cd yugabyte-boshrelease

# first of all, your workspace needs to be up-to-date and clean of dirty commits,
# or else you'll commit something inadvertently to this release
git pull origin main

# to pull all blobs from s3 to local directory, if necessary
bosh sync-blobs

git checkout -b release-x.y.z

# place the release tgz in your /tmp dir in order to calculate a shasum on it, and to upload to a github release
bosh create-release --final --version=x.y.z --tarball=/tmp/yugabyte-x.y.z.tgz

# this will be used to update the versions.yml
shasum -a 1 /tmp/yugabyte-x.y.z.tgz

# use that shasum value to update the manifests/versions.yml
yugabyte_boshrelease_sha1: 582c112d4621361a031e530885f5653868f1bbd0
yugabyte_boshrelease_version: x.y.z

# git commit all of this to the branch
git add -A
git commit -m "release-x.y.z"
git push origin release-x.y.z

# squash 'n merge it into main

now for making the release available as an actual github release:

# after squashing and merging into main...
git checkout main
git pull origin main

# notice the lack of 'v' prefix. not a fan of it.
git tag x.y.z
git push origin --tags

then go to the github releases page, click on the release for the newly created tag, and configure the release with a title, release notes, and an asset copy of the tarball from /tmp/yugabyte-x.y.z.tgz

voila, you're set.

contributing

Ideas, feedback, bug reports, etc. are all welcome, but by no means guaranteed to be implemented, responded to, or merged.

yugabyte-boshrelease's People

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

yugabyte-boshrelease's Issues

YSQL general

some reading material which may be generally beneficial:

Screen Shot 2020-03-16 at 9 12 01 AM

tserver/e954712b-3feb-47cf-b917-c730eca00895:/var/vcap/jobs/yb-tserver# /var/vcap/packages/yugabyte/bin/ysqlsh -h 10.156.89.41 -p 5433
ysqlsh: FATAL:  Not found: Error loading table with oid 1260 in database with oid 1: The object does not exist: table_id: "000000010000300080000000000004ec"

https://www.postgresql.org/docs/11/config-setting.html#CONFIG-SETTING-CONFIGURATION-FILE

remove yugabyted and yb-ctl jobs

In favor of just yb-master and yb-tserver for the time being. It makes more sense to just simplify and cut out other stuff for the time being.

yugabyted and yb-ctl are for local clusters, and we could do some interesting stuff like bosh ssh options to forward on localhost connections to a locally spun up yugabyted cluster and such, but tbh, just get rid of it in favor of a single-master single-tserver deployment option for goofing

consider adding yedis proxy binding flags, etc

note, from an example tserver, looks like its filled in

--cql_proxy_bind_address=q-m97997n3s0.q-g96704.bosh:9042
--cql_proxy_webserver_port=12000
--enable_direct_local_tablet_server_call=true
--inbound_rpc_memory_limit=0
--pgsql_proxy_bind_address=
--redis_proxy_bind_address=q-m97997n3s0.q-g96704.bosh:6379

eval if services should bind to all network interfaces or also localhost

currently we have every service bind to the private IP, but perhaps we should have it just bind to 0.0.0.0 or to more configurable options than just the private IP

for example if using the yedis-cli, you can get on a tserver node and connect to that tserver's yedis api using the private ip of the host, but not localhost; is that a problem? probably not, but worth making a little note about

https://docs.yugabyte.com/latest/troubleshoot/cluster/connect-yedis/#root

consume/provide links could be updated to have named peers

prometheus and/or indicator protocol integration

evaluate and add in ulimits, process limits, sysctl reqs, etc.

tls, server-to-server, client-to-server

In order to do server-to-server TLS and make it easy-as-pie, the bosh variables generation needs to use links to consume one another instead of manual configuration of alternative_names

see:

package python for clis such as cqlsh

if going to use cqlsh on tservers directly, they'll need python.

if not, as in if we're going to run initial setup as a job/errand in a different instance group, then that job will need python

will figure it out a bit later

./cqlsh 
No appropriate python interpreter found.

cat cqlsh

# bash code here; finds a suitable python interpreter and execs this file.
# prefer unqualified "python" if suitable:
python -c 'import sys; sys.exit(not (0x020700b0 < sys.hexversion < 0x03000000))' 2>/dev/null \
    && exec python "`python -c "import os;print(os.path.dirname(os.path.realpath('$0')))"`/cqlsh.py" "$@"
for pyver in 2.7; do
    which python$pyver > /dev/null 2>&1 && exec python$pyver "`python$pyver -c "import os;print(os.path.dirname(os.path.realpath('$0')))"`/cqlsh.py" "$@"
done
echo "No appropriate python interpreter found." >&2
exit 1

which makes sense since it calls the file cqlsh.py which is the basis of the cassandra cli

https://pypi.org/project/cqlsh/

Originally posted by @aegershman in #56 (comment)

symlinking, bosh, and _you_

http://tiewei.github.io/bosh/BOSH-Terms-and-Working-Steps/

Packages are compiled on demand during the deployment. The director first checks to see if there already is a compiled version of the package for the stemcell version it is being deployed to, and if it doesn't already exist a compiled version, the director will instantiate a compile VM (using the same stemcell version it is going to be deployed to) which will get the package source from the blobstore, compile it, and then package the resulting binaries and store it in the blobstore.

packaging script that is responsible for the compilation, and is run on the compile VM. The script gets two environment variables set from the BOSH agent:

BOSH_INSTALL_TARGET : Tells where to install the files the package generates. It is set to /var/vcap/data/packages//.

BOSH_COMPILE_TARGET : Tells the the directory containing the source (it is the current directory when the packaging script is invoked).

When the package is installed a symlink is created from /var/vcap/packages/ which points to the latest version of the package. This link should be used when referring to another package in the packaging script.

https://docs.yugabyte.com/latest/contribute/core-database/build-from-src/#ubuntu18

here's a bunch of symlinking happening in the relase manifest https://github.com/yugabyte/yugabyte-db/blob/master/yb_release_manifest.json

validate whether masters are confused about connecting to themselves

seeing these kinds of log lines on master servers:

I0212 20:52:53.386559    16 reactor.cc:450] Master_R001: Timing out connection Connection (0x000000000291e010) server 10.156.89.36:55155 => 10.156.89.36:7100 - it has been idle for 65.0004s (delta: 65.0004, current time: 996.106, last activity time: 931.106)

makes me wonder if we need to be more clever about the master connection string and have it filter it's own hostname out and replace it with localhost? just thoughts

does the yugabyte helm chart do it? https://github.com/yugabyte/charts/blob/master/stable/yugabyte/templates/_helpers.tpl#L57

override node/universe uuids to match bosh-managed uuids

Inbound connection requests coming from CF syslog-scheduler

inbound yb_rpc calls from 10.156.86.21, which appears to be from syslog_scheduler/59ac6012-0da8-4c84-9e7d-fe016f2e92fd from cf deployment

from tserver logs at http://10.156.89.37:9000/logs

W0211 18:23:02.524586    12 connection.cc:281] Connection (0x000000000222f8d0) server 10.156.86.21:35906 => 10.156.89.37:9100: Command sequence failure: Network error (yb/rpc/yb_rpc.cc:141): Invalid connection header: 1603010101010000FD03033052F1AB8F4DC20D88135C77B735F13E19090F441CAB71D7C905FC09079A3D0320D8F867C2AED349FD20E14760971EFFC749FD904C2204DB46BC9C40B3119102E80026C02FC030C02BC02CCCA8CCA9C013C009C014C00A009C009D002F0035C012000A1301130313020100008E00000013001100000E73797374656D2D6D657472696373000500050100000000000A000A0008001D001700180019000B00020100000D001A0018080404030807080508060401050106010503060302010203FF0100010000120000002B00050403040303003300260024001D00203FD19BC43DFD73E0DBEF13E59A04BC8B9618DF8EB2AB0B99CECC15F020B96276
W0211 18:23:02.524672    12 tcp_stream.cc:130] { local: 10.156.89.37:9100 remote: 10.156.86.21:35906 }: Shutting down with pending inbound data ({ capacity: 131072 pos: 0 size: 262 }, status = Network error (yb/rpc/yb_rpc.cc:141): Invalid connection header: 1603010101010000FD03033052F1AB8F4DC20D88135C77B735F13E19090F441CAB71D7C905FC09079A3D0320D8F867C2AED349FD20E14760971EFFC749FD904C2204DB46BC9C40B3119102E80026C02FC030C02BC02CCCA8CCA9C013C009C014C00A009C009D002F0035C012000A1301130313020100008E00000013001100000E73797374656D2D6D657472696373000500050100000000000A000A0008001D001700180019000B00020100000D001A0018080404030807080508060401050106010503060302010203FF0100010000120000002B00050403040303003300260024001D00203FD19BC43DFD73E0DBEF13E59A04BC8B9618DF8EB2AB0B99CECC15F020B96276)
W0211 18:23:02.524732    12 tcp_stream.cc:130] { local: 10.156.89.37:9100 remote: 10.156.86.21:35906 }: Shutting down with pending inbound data ({ capacity: 131072 pos: 0 size: 262 }, status = Service unavailable (yb/rpc/reactor.cc:91): Shutdown connection (system error 108))

Found logs from the scheduler, that's hilarious, it pings on :9100 and I believe this is what causes the tservers to puke:

<14>1 2020-02-11T18:31:02.866711Z 10.156.86.21 loggr-metric-scraper rs2 - [instance@47450 director="" deployment="cf-52b8aeeeda6f562e05f9" group="syslog_scheduler" az="us-west-2a" id="59ac6012-0da8-4c84-9e7d-fe016f2e92fd"] [id: syslog_scheduler, instance_id: , metric_url: https://10.156.89.37:9100/metrics]: Get https://10.156.89.37:9100/metrics: EOF

So... I think we could try changing the binding ports to communicate on something different? Or find some way to not fail on those requests?

consider switching from flagfiles to pure cli args

because the flags library that yugabyte uses will FAIL on validation if using an --argument=like_this passed directly to the yb-{master,tserver} binary via args, but will ALLOW for unknown or invalid flags when resolving a flagfile

not a huge deal

see #78 for an interesting rationale of this (--use-cassandra-auth to masters)

nodes reporting in have hosts of localhost

Screen Shot 2020-02-12 at 3 08 11 PM

doesn't appear to be something which is affecting the cluster at this exact moment, but am curious why that's happening

EDIT spotted in the wild during an upgrade. Notice how it's using bosh-dns hostname here. Interesting.

Screen Shot 2020-03-04 at 8 00 02 PM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.