Overview
This issue outlines a set of improvements that should be taken within the near term and longer term to allow the core team to run QA tests more quickly and with less effort.
During the release of Tendermint version v0.37.x, we executed steps of the new release process outlined in RELEASES.md to ensure there was not a clear regression in the quality of the software. The steps performed were quite manual, requiring the operator to run a series of scripts from their local machine to setup the instances, generate the configuration files, start the processes, run the load, and capture the results. After running this process once we have demonstrated that large scale testnets on virtual machines are a reasonable way to test Tendermint and we have learned a lot about how to orchestrate a large Tendermint network. We should improve this process to reduce the amount of effort required to run the QA process and capture the results.
Near term improvements
This section suggests a set of changes that should be implemented within the next 1-1.5 quarters. These changes largely comprise migrating logic from scripts in the tendermint-testnet repository into the Tendermint e2e test runner that performs a similar set of functionality using docker instances on a local network. The logic, as implemented in the testnet repository, is written as a set of shell scripts and ansible playbooks that are not very portable, not tolerant to transient failures in the network and digital ocean API, and are difficult add functionality too due to their already large degree of complexity.
Runner generates the network configuration files
Currently, a bash script and ansible playbook create the set of Tendermint configuration files for the test network and copy them to the testnet machines. This logic can and should be moved to the e2e runner. The runner already is used for most of the testnet configuration generation with the bash script just updating a few config values and the IP addresses so that they match those from the Digital Ocean infrastructure.
Runner adds the load to the network
The e2e test runner currently generates load for the e2e tests. This logic could be extended to generate transactions for the release testnets.
The release testnets require a transaction data format that is more specific than what the nightly tests currently use. The data format can be ported over to be used by the e2e runner. Additional work will be needed to incorporate the tm-loadtest
periodic load generation logic into the runner.
Runner starts and stops the processes
The nightly e2e runner currently starts and stops the Tendermint docker instances during the nightly tests. This logic can be adapted to start and stop the Tendermint process on remote nodes during a large testnet running on many machines.
Runner retrieves the data
Currently, retrieving the Tendermint blockstore and the prometheus data captured during the large scale testnet is a manual process performed with a pair of ansible playbooks 1 2. The data is collected by the network operator upon completion of the test. The data is then manually uploaded to Digital Ocean storage.
This procedure can be automated and combined into the runner process. Upon completion of the test, the runner can fetch the blockstore and the prometheus database and automatically upload them to Digital Ocean, either by placing them onto a mounted drive that is intended for reuse, or by uploading them directly to a Digital Ocean 'space'.
Long term improvements
Runner manages the infrastructure
In the long term, the runner should be improved to directly manage the infrastructure running the testnet. This means the runner, running on a single DO instance, should be updated to able to spawn and destroy all of the necessary droplets.
Managing a fleet of infrastructure is complex and existing tools and practices like Terraform run from the command line have many advantages. Terraform implements a declarative syntax, idempotent requests for resource creation, and has built in definitions for many Digital Ocean resource types already.
A future version of the runner should be augmented to perform the role of resource creation and destruction without operator intervention. This would need to be carefully done so as to avoid any possible scenarios where the tool provisions too many resources or fails to destroy resources and leaves them running indefinitely. This is listed as a long term improvement because it is complex and will take more careful consideration.
Runner triggered from a github action upon release
Once the runner is able to provision resources in digital ocean, run the entire suite automatically, and retrieve and upload the results, it should be enhanced to started from a github action when a release is triggered.
Overall TODO
Original issue: tendermint/tendermint#9580