The fabric-chaos-testing from davidkel

Supprt K8s in similar manner to docker

No details provided in client for evaluate

{"component":"CLIENT","timestamp":"2021-11-01T18:21:08.487Z","txnId":"6ae19d4db7e679a20dc01875eae813ee395b86158da20edc1845067bea7f7431","stage":"Failed","message":"10 ABORTED: evaluate call to endorser returned an error response, see attached details for more info"}

Improve the names for logging options

Current logging options are
'logOnlyOnFailure', 'AllPoints', 'Failure&Success'

which isn't very consistent. I think we should rename them to

Failure
All
Failure&Success

Note that All is not equal to Failure and Success
Failure and Success outputs only all log points for a txn if the txn fails or the single success message. All will output all log points for a successful transaction as well.

[EPIC] Implement a chaos summary reporter from long running chaos testing

The report creator must be able to recognise and ignore or report as expected problems which we know would happen under certain conditions
Actions are going to cause problems, eg

kill leader orderer
take down 3 or more orderers
take down gateway peer

These will generate errors, but what we are looking for is client recovery without having the re-start the client or the gateway peer

Client needs to report additional error in an error instance not just the message

An error of

{"component":"CLIENT","timestamp":"2021-10-28T14:06:24.664Z","txnId":"531544e8def41638475181253759e80748377f3d47181d12d9fa11fe869337a6","stage":"Failed","message":"10 ABORTED: failed to endorse transaction, see attached details for more info"}

says there are more details, but the details aren't captured. We should capture these

testing of gateway for the following error scenarios should be in scenario tests

chaincode execution timeout (say for remaining peers staying up)
chaincode container crash (scenario test may already exist for this)

Update env sample and docs to reflect config changes

When only logging failures maybe output a message every so often with stats

Useful to know that the client is still running

optional through configuration
ability to only report if it thinks the client has stalled (ie stopped sending any further txns)

Test submitAsync transactions

Add support to send config transactions

Apart from business transactions we should also be able to test the effect of config transactions happening at the same time to see the effect on the system

Unsure whether to make this part of the client or have a separately running app

Config transactions that would have an effect

adding an org to a channel
removing an org from a channel
adding an orderer to a channel
removing an orderer from a channel
changing an orderer eg the address
adding/changing/removing an anchor peer
policy changes ?

Not specific to config transactions but might also want to consider

joining a peer to the channel
unjoining a peer to the channel
deploying a chaincode update (maybe include adding/changing/removing SBEs)
changing a chaincode endorsement policy
changing a private collection definition
adding chaincode to a peer without chaincode

Test adding an organisation

This should provide the capability to add Org4 to the existing 3 Org network

Use fabric-test operator to stand up and manage a fabric network

This is a manual activity to determine what is required to make use of fabric-test operator, specifically

create a network and input spec
ensure that the client can work with this network
ensure that the chaos engine can work with this network
test the network to ensure it behaves as expected with the client and chaos engine (eg no failure scenarios should work because endorsement policy is correct)

Add support for chaos engine to terminate after a specific time period, scenarios run or loops of scenarios

Add support for chaincode events

chaos engine should report basic scenarios run stats at termination

Test adding a new orderer

This would add orderer6 to the current network environment

Perform manual testing of gateway peer going down cleanly

Issues raised

hyperledger/fabric-gateway#248 (appears to be fixed)
hyperledger/fabric-gateway#262 (appears to be fixed)
hyperledger/fabric-gateway#261 (documentation issue)

blocked by
github.com/hyperledger/fabric#2985

When the client app starts it should output the configuration it is using

Investigate TC tool in docker to simulate network problems

https://github.com/lukaszlach/docker-tc

Perform manual testing on evaluate when non-gateway peers terminate badly

tested by

installing chaincode on peer0 of each org
gateway peer was peer1.org1
stopped peer peer1.org2, peer1.org3 so that chaos engine won't include these when attempting any stop or restart

Do we want a counter to detect client not recovered or gateway peer down

At the moment if the client exits when it just so happens that the last set of stats recorded that all failed then it will exit with 2 saying the client/network didn't recover. We mitigate this by ensuring the chaos engine has shutdown for a period of time to allow for successful transactions. Would a counter for number of sequential failures be useful to output as stats ?

We could do something similar for no transactions being submitted/evaluated but this time we could also have a threshold value as well. If at termination we have exceeded that threshold then we will exit with 3 to indicate that the gateway peer is likely to be down. This threshold value would have to match up with when the chaos engine terminated and how long the client was left running for after to ensure that if the final scenario in chaos kills the gateway peer permanently that this will be detected.

Perform manual testing of gateway peer terminating badly

Perform manual testing on non-gateway peers going down cleanly in a submit scenario

The testing here needs to be split into 2 separate tasks

test scenarios where we expect the system to continue to work because there are enough alternative peers to satisfy the used endorsement policy. We should try multiple different endorsement policies
test scenarios where we expect the system to not be able to satisfy the endorsement policy. We need to see what kind of errors clients may receive in these scenarios

Endorsement policies to test

default (Majority)
explicit Majority: OR(AND(‘Org1MSP.member’,‘Org2MSP.member’),AND(‘Org1MSP.member’,‘Org3MSP.member’),AND(‘Org3MSP.member’,‘Org2MSP.member’))
explicit All orgs

Perform manual testing on evaluate when non-gateway peers go down cleanly

This is going to be more tricky to test, because if the chaos engine runs in evaluate mode only and the scenarios always leave the gateway peer up, then the gateway peer is always going to be chosen for the evaluate, the only slight change is that it may try the other peer in the org first if the 2 have the same block heights, but no guarantee (Would need to confirm with Andy but I suspect that the gateway peer will always be the first peer to try if the block heights are identical to cut down on any unnecessary network traffic)

Test changing an address of a node

This would allow us to test the following scenarios

change of a hostname for a peer
change of a hostname for an orderer
change of a port for a peer
change of a port for an orderer

Implement Java Client

add support for private data as this also has an influence

Perform manual testing on non-gateway peers terminate badly in a submit scenario

Test with a majority implicit EP only, don't see a need to do others although if time will do so

Issues

Perform manual testing on ordering nodes terminating badly in a submit scenario

To note, if you kill the leader then that will cause problems to occur
For this test I turned of submit timeouts because of the potential time it takes to move to try another orderer if 1 goes down (upto 20 seconds has been seen).

Remove client timeout handling and use SDK timeout support instead

We should move our test client to use the official timeouts. What would be good to test as well would be that the client doesn't remain hanging on exit (it would mean changing the client to just exit the loop when ctrl-c is received but not doing a process.exit() ) if timeouts or problems are hit

[EPIC] Define and deploy solution as an automated service

The current proposal for fabric test is 2 build runs

bring down and up a set of peers which should never cause an issue
complete chaos where everything goes down and up which will cause failures but client always recovers without being restarted

Implement Go Client

Client sort of detects if the gateway peer is down

If the gateway peer goes down and stays down then the client just loops with the message

{"component":"CLIENT","timestamp":"2021-12-24T11:25:05.608Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:10.608Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:15.610Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:20.610Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:25.616Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:30.616Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:35.617Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:40.621Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:45.626Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:50.626Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:55.628Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:26:00.628Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
request to terminate received, stopping......
{"component":"CLIENT","timestamp":"2021-12-24T11:26:05.629Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:26:10.630Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:26:15.631Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:26:16.976Z","stage":"FINAL-STATS","message":"Submit: good=200, bad=5. Evals: good=310, bad=10"}

I assume this is down to the waitForReady call we have and the reason it doesn't terminate straight away is because wait for a timeout most likely the wait for ready grpc timeout.

Issue being raised to capture this behaviour so it's documented and also a consideration as to whether it could be reported as an exit code

Perform manual testing on ordering nodes going down cleanly in a submit scenario

need to test

enough orderers remain to achieve concensus
not enough orderers to achieve concensus

Test what happens if chaincode is not installed on all the peers

Chaos engine to support load balanced gateway peers

linked to #8

chaos engine should report extended stats on termination

I was thinking along the lines of

number of each different scenarios run (useful for random mode)
number of times each orderer was stopped and started
number of times each non gateway peer was stopped and started
number of times the gateway peer was stopped and started

Add support for client to output final stats at termination and also optionally set and exit code if failures occurred

When the client terminates it should output final stats
if an env variable such as FailOnExit is set to true then the client should also terminate with a non-zero exit code if any failures were detected

Test what happens if some peers are not endorsing peers

Apparently this could be done by not setting the CORE_PEER_GOSSIP_EXTERNALENDPOINT

means they should not be used to endorse for evaluate or submit, but will just commit

Implement automatic chaincode event listening recovery in client

In the client if the gateway peer is restarted, the chaincode event listener stops completely, we need to implement an automatic recovery of this so that the chaos engine can continue to test, this needs to be good enough to not miss events but at the same time not flood the client with constant repeats of checking

Move from a less restrictive to more restrictive endorsement policy
Move from a more restrictive to a less restrictive endorsement policy

davidkel / fabric-chaos-testing Goto Github PK

fabric-chaos-testing's People

Contributors

Stargazers

Watchers

Forkers

fabric-chaos-testing's Issues

Issues

Recommend Projects

Recommend Topics

Recommend Org