OCP 4.9.9 on AWS Based on HA Cockroach DB
- Only CockroachDB
- CockroachDB, NodeHealthCheck, PoisonPill
- CockroachDB, MachineHealthCheck, PoisonPill
- Commands Used
Since this test involves stopping instances in AWS, we have to be confident that the pod that we are sending requests from is not on an instance that goes down. Due to this, we will be deployering the crdb-tester
pod with tolerations and a nodeSelector to schedule it on the master node.
Scenerio: When we deploy a StatefulSet of 3 replicas and stop a node are we still able to read and write from the database?
Scenerio: When we deploy a StatefulSet of 3 replicas and stop two nodes are we still able to read and write from the database?
Install CockroachDB
helm install cockroachdb -n cockroachdb charts/cockroachdb
Insert data into CockroachDB
kubectl exec -it pod/cockroachdb-0 -n cockroachdb -c cockroachdb -- cockroach sql --insecure --execute="CREATE TABLE roaches (name STRING, country STRING); INSERT INTO roaches VALUES ('American Cockroach', 'United States'), ('Brownbanded Cockroach', 'United States')"
Create a pod to communicate to the cockroach service
helm install crdb-tester charts/tester-pod
Stop 1 Instance in AWS
Write Test
kubectl exec -it crdb-tester -- cockroach sql --insecure --host=cockroachdb-public.cockroachdb.svc.cluster.local:26257 --execute="INSERT INTO roaches VALUES ('A', 'Apple'), ('B', 'Banana')"
output:
INSERT 2
Time: 860ms
Read Test
kubectl exec -it crdb-tester -- cockroach sql --insecure --host=cockroachdb-public.cockroachdb.svc.cluster.local:26257 --execute="SELECT * FROM roaches;"
output:
name | country
------------------------+----------------
American Cockroach | United States
Brownbanded Cockroach | United States
A | Apple
B | Banana
(4 rows)
Time: 7ms
Stop Another Instance in AWS
Write Test
kubectl exec -it crdb-tester -- cockroach sql --insecure --host=cockroachdb-public.cockroachdb.svc.cluster.local:26257 --execute="INSERT INTO roaches VALUES ('C', 'Candy'), ('D', 'Donut')"
output
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
dial tcp 172.30.252.79:26257: connect: no route to host
Failed running "sql"
command terminated with exit code 1
Read Test
kubectl exec -it crdb-tester -- cockroach sql --insecure --host=cockroachdb-public.cockroachdb.svc.cluster.local:26257 --execute="SELECT * FROM roaches;"
output
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
dial tcp 172.30.252.79:26257: connect: no route to host
Failed running "sql"
command terminated with exit code 1
Start all instances.
Install CockroachDB
helm uninstall cockroachdb -n cockroachdb
Make sure everything is gone
kubectl get pod,pvc -n cockroachdb
kubectl delete pods,pvc --all -n cockroachdb
Delete tester pod
helm uninstall crdb-tester
Scenerio: When the node is lost poison pill remediation is created and marks the node as scheduling disabled.
Install CockroachDB
helm install cockroachdb -n cockroachdb charts/cockroachdb
Insert data into CockroachDB
kubectl exec -it pod/cockroachdb-0 -n cockroachdb -c cockroachdb -- cockroach sql --insecure --execute="CREATE TABLE roaches (name STRING, country STRING); INSERT INTO roaches VALUES ('American Cockroach', 'United States'), ('Brownbanded Cockroach', 'United States')"
Install NodeHealthCheck and PoisonPill
helm install nhc -n openshift-operators charts/nhc-operator
Install Node Health Check and Poison Pill Configration
helm install node-health-check -n openshift-operators charts/node-health-check
Create a pod to communicate to the cockroach service
helm install crdb-tester charts/tester-pod
Replace autogenerated PoisonPillConfig
kubectl replace -f -<<EOF
apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillConfig
metadata:
name: poison-pill-config
namespace: openshift-operators
spec:
apiCheckInterval: 15s
apiServerTimeout: 5s
isSoftwareRebootEnabled: true
maxApiErrorThreshold: 3
peerApiServerTimeout: 5s
peerDialTimeout: 5s
peerRequestTimeout: 5s
peerUpdateInterval: 15m
safeTimeToAssumeNodeRebootedSeconds: 10
watchdogFilePath: /dev/watchdog
EOF
Stop Instances in AWS
Write Test
kubectl exec -it crdb-tester -- cockroach sql --insecure --host=cockroachdb-public.cockroachdb.svc.cluster.local:26257 --execute="INSERT INTO roaches VALUES ('C', 'Candy'), ('D', 'Donut')"
output
INSERT 2
Time: 22ms
Read Test
kubectl exec -it crdb-tester -- cockroach sql --insecure --host=cockroachdb-public.cockroachdb.svc.cluster.local:26257 --execute="SELECT * FROM roaches;"
output
name | country
------------------------+----------------
American Cockroach | United States
Brownbanded Cockroach | United States
C | Candy
D | Donut
(4 rows)
Time: 352ms
Start all instances.
Install CockroachDB
helm uninstall cockroachdb -n cockroachdb
Make sure everything is gone
kubectl get pod,pvc -n cockroachdb
kubectl delete pod,pvc -n cockroachdb --all
Uninstall Install Node Health Check and Poison Pill Configration
helm uninstall node-health-check -n openshift-operators
Uninstall NodeHealthCheck and PoisonPill
helm uninstall nhc -n openshift-operators
Delete tester pod
helm uninstall crdb-tester
Uninstall CSVs for NodeHealthCheck and PoisonPill
kubectl delete csv -n openshift-operators node-healthcheck-operator.v0.1.0
kubectl delete csv -n openshift-operators poison-pill.v0.2.0
kubectl delete ds poison-pill-ds -n openshift-operators
Scenerio: A node is not able to reach the api server due to transient failure. The poison pill remediation marked the node as SchedulingDisabled. But once the node was back, the node was eventually marked as Ready. The pod was rerun in the same node.
Install CockroachDB
helm install cockroachdb -n cockroachdb charts/cockroachdb
Insert data into CockroachDB
kubectl exec -it pod/cockroachdb-0 -n cockroachdb -c cockroachdb -- cockroach sql --insecure --execute="CREATE TABLE roaches (name STRING, country STRING); INSERT INTO roaches VALUES ('American Cockroach', 'United States'), ('Brownbanded Cockroach', 'United States')"
Install MachineHealthCheck and PoisonPill
helm install mch -n openshift-operators charts/machine-health-check
Install PoisonPillRemediationTemplate
kubectl apply -f -<<EOF
apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillRemediationTemplate
metadata:
namespace: openshift-machine-api
name: poison-pill-default-template
spec:
template:
spec: {}
EOF
Create a pod to communicate to the cockroach service
helm install crdb-tester charts/tester-pod
Replace autogenerated PoisonPillConfig
kubectl replace -f -<<EOF
apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillConfig
metadata:
name: poison-pill-config
namespace: openshift-operators
spec:
apiCheckInterval: 15s
apiServerTimeout: 5s
isSoftwareRebootEnabled: true
maxApiErrorThreshold: 3
peerApiServerTimeout: 5s
peerDialTimeout: 5s
peerRequestTimeout: 5s
peerUpdateInterval: 15m
safeTimeToAssumeNodeRebootedSeconds: 10
watchdogFilePath: /dev/watchdog
EOF
Stop Instance in AWS
Write Test
kubectl exec -it crdb-tester -- cockroach sql --insecure --host=cockroachdb-public.cockroachdb.svc.cluster.local:26257 --execute="INSERT INTO roaches VALUES ('C', 'Candy'), ('D', 'Donut')"
output
INSERT 2
Time: 22ms
Read Test
kubectl exec -it crdb-tester -- cockroach sql --insecure --host=cockroachdb-public.cockroachdb.svc.cluster.local:26257 --execute="SELECT * FROM roaches;"
output
name | country
------------------------+----------------
American Cockroach | United States
Brownbanded Cockroach | United States
C | Candy
D | Donut
(4 rows)
Time: 2ms
Start all instances.
Uninstall CockroachDB
helm uninstall cockroachdb -n cockroachdb
Make sure everything is gone
kubectl get pod,pvc -n cockroachdb
kubectl delete pod,pvc -n cockroachdb --all
Uninstall MachineHealthCheck and PoisonPill
helm uninstall mch -n openshift-operators
Delete pod
helm uninstall crdb-tester
Uninstall CSV for PoisonPill
kubectl delete csv -n openshift-operators poison-pill.v0.2.0
kubectl delete ds poison-pill-ds -n openshift-operators
Uninstall PoisonPillRemediationTemplate
kubectl delete -f -<<EOF
apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillRemediationTemplate
metadata:
namespace: openshift-machine-api
name: poison-pill-default-template
spec:
template:
spec: {}
EOF
Remove PoisonPillConfig
kubectl delete -f -<<EOF
apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillConfig
metadata:
name: poison-pill-config
namespace: openshift-operators
spec:
apiCheckInterval: 15s
apiServerTimeout: 5s
isSoftwareRebootEnabled: true
maxApiErrorThreshold: 3
peerApiServerTimeout: 5s
peerDialTimeout: 5s
peerRequestTimeout: 5s
peerUpdateInterval: 15m
safeTimeToAssumeNodeRebootedSeconds: 10
watchdogFilePath: /dev/watchdog
EOF
Watch pod name, node of pod, and pod status in cockroachdb
kubectl get pod -o=custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase -n cockroachdb -w
Watch PoisonPillRemediation in all namespaces
kubectl get ppr -A -w
Watch nodes in cockroachdb
kubectl get nodes -n cockroachdb -w
Quickly check the controller managers
k logs deploy/poison-pill-controller-manager -n openshift-operators -c manager
k logs deploy/node-healthcheck-operator-controller-manager -n openshift-operators -c manager