Code Monkey home page Code Monkey logo

fabric-chaos-testing's People

Contributors

davidkel avatar sapthasurendran avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

sapthasurendran

fabric-chaos-testing's Issues

No details provided in client for evaluate

{"component":"CLIENT","timestamp":"2021-11-01T18:21:08.487Z","txnId":"6ae19d4db7e679a20dc01875eae813ee395b86158da20edc1845067bea7f7431","stage":"Failed","message":"10 ABORTED: evaluate call to endorser returned an error response, see attached details for more info"}

Improve the names for logging options

Current logging options are
'logOnlyOnFailure', 'AllPoints', 'Failure&Success'

which isn't very consistent. I think we should rename them to

  • Failure
  • All
  • Failure&Success

Note that All is not equal to Failure and Success
Failure and Success outputs only all log points for a txn if the txn fails or the single success message. All will output all log points for a successful transaction as well.

[EPIC] Implement a chaos summary reporter from long running chaos testing

The report creator must be able to recognise and ignore or report as expected problems which we know would happen under certain conditions
Actions are going to cause problems, eg

  • kill leader orderer
  • take down 3 or more orderers
  • take down gateway peer

These will generate errors, but what we are looking for is client recovery without having the re-start the client or the gateway peer

Client needs to report additional error in an error instance not just the message

An error of

{"component":"CLIENT","timestamp":"2021-10-28T14:06:24.664Z","txnId":"531544e8def41638475181253759e80748377f3d47181d12d9fa11fe869337a6","stage":"Failed","message":"10 ABORTED: failed to endorse transaction, see attached details for more info"}

says there are more details, but the details aren't captured. We should capture these

Add support to send config transactions

Apart from business transactions we should also be able to test the effect of config transactions happening at the same time to see the effect on the system

Unsure whether to make this part of the client or have a separately running app

Config transactions that would have an effect

  • adding an org to a channel
  • removing an org from a channel
  • adding an orderer to a channel
  • removing an orderer from a channel
  • changing an orderer eg the address
  • adding/changing/removing an anchor peer
  • policy changes ?

Not specific to config transactions but might also want to consider

  • joining a peer to the channel
  • unjoining a peer to the channel
  • deploying a chaincode update (maybe include adding/changing/removing SBEs)
  • changing a chaincode endorsement policy
  • changing a private collection definition
  • adding chaincode to a peer without chaincode

Use fabric-test operator to stand up and manage a fabric network

This is a manual activity to determine what is required to make use of fabric-test operator, specifically

  • create a network and input spec
  • ensure that the client can work with this network
  • ensure that the chaos engine can work with this network
  • test the network to ensure it behaves as expected with the client and chaos engine (eg no failure scenarios should work because endorsement policy is correct)

Do we want a counter to detect client not recovered or gateway peer down

At the moment if the client exits when it just so happens that the last set of stats recorded that all failed then it will exit with 2 saying the client/network didn't recover. We mitigate this by ensuring the chaos engine has shutdown for a period of time to allow for successful transactions. Would a counter for number of sequential failures be useful to output as stats ?

We could do something similar for no transactions being submitted/evaluated but this time we could also have a threshold value as well. If at termination we have exceeded that threshold then we will exit with 3 to indicate that the gateway peer is likely to be down. This threshold value would have to match up with when the chaos engine terminated and how long the client was left running for after to ensure that if the final scenario in chaos kills the gateway peer permanently that this will be detected.

Perform manual testing on non-gateway peers going down cleanly in a submit scenario

The testing here needs to be split into 2 separate tasks

  1. test scenarios where we expect the system to continue to work because there are enough alternative peers to satisfy the used endorsement policy. We should try multiple different endorsement policies
  2. test scenarios where we expect the system to not be able to satisfy the endorsement policy. We need to see what kind of errors clients may receive in these scenarios

Endorsement policies to test

  • default (Majority)
  • explicit Majority: OR(AND(‘Org1MSP.member’,‘Org2MSP.member’),AND(‘Org1MSP.member’,‘Org3MSP.member’),AND(‘Org3MSP.member’,‘Org2MSP.member’))
  • explicit All orgs

Perform manual testing on evaluate when non-gateway peers go down cleanly

This is going to be more tricky to test, because if the chaos engine runs in evaluate mode only and the scenarios always leave the gateway peer up, then the gateway peer is always going to be chosen for the evaluate, the only slight change is that it may try the other peer in the org first if the 2 have the same block heights, but no guarantee (Would need to confirm with Andy but I suspect that the gateway peer will always be the first peer to try if the block heights are identical to cut down on any unnecessary network traffic)

Test changing an address of a node

This would allow us to test the following scenarios

  • change of a hostname for a peer
  • change of a hostname for an orderer
  • change of a port for a peer
  • change of a port for an orderer

Remove client timeout handling and use SDK timeout support instead

We should move our test client to use the official timeouts. What would be good to test as well would be that the client doesn't remain hanging on exit (it would mean changing the client to just exit the loop when ctrl-c is received but not doing a process.exit() ) if timeouts or problems are hit

[EPIC] Define and deploy solution as an automated service

The current proposal for fabric test is 2 build runs

  • bring down and up a set of peers which should never cause an issue
  • complete chaos where everything goes down and up which will cause failures but client always recovers without being restarted

Client sort of detects if the gateway peer is down

If the gateway peer goes down and stays down then the client just loops with the message

{"component":"CLIENT","timestamp":"2021-12-24T11:25:05.608Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:10.608Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:15.610Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:20.610Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:25.616Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:30.616Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:35.617Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:40.621Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:45.626Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:50.626Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:25:55.628Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:26:00.628Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
request to terminate received, stopping......
{"component":"CLIENT","timestamp":"2021-12-24T11:26:05.629Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:26:10.630Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:26:15.631Z","stage":"STATS","message":"WARNING: Client/Network may have stalled, no new transactions are being evaluated or endorsed"}
{"component":"CLIENT","timestamp":"2021-12-24T11:26:16.976Z","stage":"FINAL-STATS","message":"Submit: good=200, bad=5. Evals: good=310, bad=10"}

I assume this is down to the waitForReady call we have and the reason it doesn't terminate straight away is because wait for a timeout most likely the wait for ready grpc timeout.

Issue being raised to capture this behaviour so it's documented and also a consideration as to whether it could be reported as an exit code

chaos engine should report extended stats on termination

I was thinking along the lines of

  • number of each different scenarios run (useful for random mode)
  • number of times each orderer was stopped and started
  • number of times each non gateway peer was stopped and started
  • number of times the gateway peer was stopped and started

Implement automatic chaincode event listening recovery in client

In the client if the gateway peer is restarted, the chaincode event listener stops completely, we need to implement an automatic recovery of this so that the chaos engine can continue to test, this needs to be good enough to not miss events but at the same time not flood the client with constant repeats of checking

Evaluate success message should include more info

{"component":"CLIENT","timestamp":"2021-11-01T17:50:32.106Z","txnId":"bdb747da265d8a34bc34b5b3257458883abbf1d2679bc1499e52daa21a67d4ac","stage":"Evaluated","message":""}

Txn name, params etc similar to submit

Test changing an endorsement policy

Add Support to dynamically change the endorsement policy and test the following scenarios

  • Move from a less restrictive to more restrictive endorsement policy
  • Move from a more restrictive to a less restrictive endorsement policy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.