Code Monkey home page Code Monkey logo

chogori-platform's People

Contributors

ahsank avatar ammuv avatar ccjeff avatar dlifw avatar ivan-avramov avatar jerryhfeng avatar jfunston avatar johnfangafw avatar mankan1 avatar piggesthjy avatar raduvine avatar yazhifeng avatar zsstrike avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chogori-platform's Issues

Fix silent conversion of commits to aborts in 3SI client

If a txn operation fails (read/write/heartbeat), the txn needs to be aborted. The 3SI client will ensure it is aborted, but if the user tried to commit it, the client will return a 200OK. This could cause bugs in the user application. The solution is to return a 4xx error if the user tries to commit a failed transaction.

Handle end txn error cases in SKV Client

If there are ongoing read or write requests, we should not issue an end request and should throw an exception to indicate a user bug. Same for issuing more requests after an end request.

Implement a test for these cases.

Document reasoning behind the design choice.

Fix SKV client for dynamic schema usage

SKV client works for static schemas such as with TPC-C but does not work for the dynamic schema case of reading and writing SKVRecords. Fix it and add tests for it.

Unit Test Cases for SKVRecord

Add more unit test cases for SKVRecord.

References

docs/SKV.md (includes overview of SKVRecord)
test/k23si/SKVRecordTest.cpp (existing unit test cases)
src/k2/dto/SKVRecord.h (SKVRecord interface)
src/k2/dto/FieldTypes.cpp (String conversions for creating keys)

Happy-path test cases

Test case Expected Result
Serialize a record with composite partition and range keys (e.g. partition key is string field + uint32_t field + string field) Byte sequence of partitionKey string and rangeKey string are as expected with proper encoding and NULL byte separators
Serialize a record with a composite partition key and one key field NULL Byte sequence of partitionKey is as expected with NULL field encoding
Serialize a record with a composite partition key and one key field (designated NullLast) is NULL Byte sequence of partitionKey is a expected with NULL last field encoding
Serialize a record with one value field skipped (i.e. using "skipNext()") Fields can be deserialized successfully using the deserializeNextOptional function and with the FOR_EACH_RECORD_FIELD macro
Deserialize fields out of order by name Fields can be deserialized successfully

Error test cases

Test case Expected Result
getPartitionKey() is called before all partition key fields are serialized Exception is thrown
getRangeKey() is called before all range key fields are serialized Exception is thrown
deserializeField(string name) on a name that is not in schema Exception is thrown
seekField() with a field index out-of-bounds for the schema Exception is thrown
Deserialize a field that has not been serialized for the document Exception is thrown

Distributed tracing for 3SI operations

trace operation support (trace a read or write, including PUSH, and accumulated trace for txn on server). I imagine we start a txn with a trace flag, and that makes the server accumulate an event log for the txn, including all PUSH operations performed against it. We can use this with inspect to validate expected results (e.g. trace two conflicting txns)

We should be able to trace across cluster components: CPO, persistence, etc. We should also consider integration with the logging and metrics systems.

Fix finalization error in TPC-C test

When running test_k23si_tpcc.sh the test shows a handful of finalization errors during the load phase, which has sync finalize turned on. These need to be investigated and fixed.

[0026:00:27:06.931.008]-nodepool-(0) [ERROR] [/build/src/k2/module/k23si/TxnManager.cpp:414 @operator()]Finalize request did not succeed for {pvid={id=0, rangeV=1, assignmentV=1}, colName=TPCC, mtr={txnid=9221627776308204174, timestamp={tsoId=1, endCount=1605305669540315000(18579:22:14:29.540.315), delta=4608}, priority=medium}, trh={schema=district pkey=, rkey=}, key={schema=orderline pkey=, rkey=
}, action=commit}, status=[408 Request Timeout]: partition deadline exceeded

Fix K2SI client exceptions

K23SI client should return an exceptional future wherever possible instead of throwing an exception directly.

Implement SKV Partial Update

A partial update is when the user only wants to write a subset of the fields of a record. We want to optimize for this case with a new partialUpdate RPC and interface, which will reduce traffic over the network.

Make rejectIfExists writes idempotent

If a write with the rejectIfExists flag is retried, the client may see a failure even if the first try placed a write intent. Make this operation idempotent by adding an op sequence id to write requests. This can be reused for future read-modify-write operations too.

Add a core-to-core communication method for the applet running in the same process

I implemented a core-to-core communication method for the applet running in the same process in order to speed up the RPC speed in the same machine. To test it, I merged the txbench_client.cpp and txbench_server.cpp into a single file: txbench_combine.cpp so that the server and the client can run in the same process.

However, when I tested it, it always returned the segmentation fault. When I try to fix this problem, I found this issue is from this function: https://github.com/futurewei-cloud/chogori-platform/blob/master/src/k2/transport/RPCDispatcher.cpp#L118-L151

Here is the way to reproduce the segmentation fault. Run txbench_combine with args: ./txbench_combine -c 2 --tcp_endpoints 12345 12346 --tcp_remotes tcp+k2rpc://0.0.0.0:12345 --memory 10G --poll-mode --cpuset 9-10, it will return a segmentation fault. If I changed the args --tcp_remotes tcp+k2rpc://0.0.0.0:12345 to --tcp_remotes tcp+k2rpc://0.0.0.0:12346, then the benchmark script works well. The only difference between the args is the first args asks the client to communicate with the server that is in the same core, so it will trigger a loop of send/receive requests, which will be processed by _handleNewMessage function.

Implement Stock Level TPC-C Transaction

See section 2.8 in the spec: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf

If the spec mentions any terminal display or file output, we do not want to do that but the data needed for it should be retrieved from SKV and stored in the transaction context. For example, in the New Order transaction in src/k2/cmd/tpcc/transactions.h the tax, discount, and total_amount variables are saved as member variables but are not used in the code.

Read cache overlapping interval merging

Right now the k23si read cache keeps overlapping intervals in memory. It might be better to split and merge overlapping intervals on insertion to save memory

Use SKV Partial Update in TPC-C Benchmark

In the tpcc benchmark for SKV, change full writes into partial writes where possible. The changes need to be made in src/k2/cmd/tpcc/transactions.h. If you need a reference on what the transactions are supposed to do you can look at the specification: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf

You can use test/integration/test_k23si_tpcc.sh locally test if the changes work. For performance testing, I recently updated the scripts and README in the cluster directory. You will need to modify the configuration to work on 3 machines, with fewer cores used and fewer TPC-C warehouses used. You don't need to do performance testing for this task but getting setup for performance testing will be very useful soon.

TSO heartbeat timeout issue

Logs from chogori-sql:

[0000:19:59:16.077.602]-nodepool-(1) [WARN] [/build/src/k2/module/k23si/TxnManager.cpp:77 @operator()]heartbeat expired on: {txnId={trh={schema=00000001000030008000000000000a30 pkey=^A00000001000030008000000000000a30^@^A^A^@^A^A��f^]^BGK���2#�X��^@^A, rk
ey=}, mtr={txnid=11254935246207970076, timestamp={tsoId=1, endCount=1611080472035753000(18646:18:21:12.035.753), delta=4608}, priority=medium}}, writeKeys=[0], rwExpiry={tsoId=1, endCount=1611080472035753000(18646:18:21:12.035.753), delta=4608}, hbExpiry
=0000:19:55:21.882.900, syncfin=0}

[0000:19:59:32.829.201]-k2_pg-(0) [DEBUG] [/build/src/k2/connector/yb/pggate/k23si_seastar_app.cc:219 @operator()]Write...
[0000:19:59:32.829.222]-k2_pg-(139781049149184) [ERROR] [/build/src/k2/connector/yb/pggate/k2_adapter.cc:441 @operator()]K2 write failed due to hb not allowed for the txn state
[0000:19:59:32.829.234]-k2_pg-(139781049149184) [DEBUG] [/build/src/k2/connector/yb/pggate/k2_adapter.cc:443 @operator()]K2 write status: [405 Method Not Allowed]: hb not allowed for the txn state

finally, the txn failed

2021-01-19 18:27:05.272 UTC [129] FATAL: Invalid argument: hb not allowed for the txn state

Integration Test Cases for Schema Creation

Add more integration test cases for schema creation.

References

src/k2/dto/ControlPlaneOracle.h (Schema and schema create request definitions)
test/cpo/CPOTest.cpp (existing schema creation tests)

Happy path test cases

Test case Expected Result
Create a new version of an existing schema by renaming a (non-key) field 2xx success code, schema can be retrieved through GetSchemasRequest from CPO
Create a new version of an existing schema by adding a new field 2xx success code, schema can be retrieved through GetSchemasRequest from CPO
Create a schema which does not have any range key fields set 2xx success code, schema can be retrieved through GetSchemasRequest from CPO

Error test cases

Test case Expected Result
Create a schema with duplicate field names 400 error code, schema does not exist in result set from GetSchemasRequest
Create a schema by setting partitionKeyFields manually by index, and an index is out of bounds of the fields 400 error code, schema does not exist in result set from Get Schemas Request
Create a schema where the field at index 0 is not a partition or range key field 400 error code, schema does not exist in result set from GetSchemasRequest
Create a new version of an existing schema where a key field is renamed 409 error code, schema does not exist in result set from GetSchemasRequest
Create a new version of an existing schema where the type of a key field changes 409 error code, schema does not exist in result set from GetSchemasRequest
Create a new version of an existing schema where a key field is removed 409 error code, schema does not exist in result set from GetSchemasRequest

3SI Transaction test cases scenario 02

Implement scenario 02 test cases from https://github.com/futurewei-cloud/chogori-platform/blob/master/docs/RFC/K23SI_testing.md

You will need to implement the functionality to delay finalization. You can do this with an option in the txn END request similar to the existing syncFinalize option. The delay itself would go here: https://github.com/futurewei-cloud/chogori-platform/blob/master/src/k2/module/k23si/TxnManager.cpp#L344

You may also need to increase the heartbeat deadline config option for the tests.

Also consider other test cases that could be added to this scenario, especially requests against the keys that have the WI and aborted records.

Add plog support

References:
src/k2/dto/Persistence.h (Dtos for the communication between PlogServer and PlogClient)
src/k2/dto/PartitionGroup.h (Dtos for register/get plog partition groups)
src/k2/persistence/plog/PlogServer.h
src/k2/persistence/plog/PlogServer.cpp
src/k2/persistence/plog/PlogClient.h
src/k2/persistence/plog/PlogClient.cpp
src/k2/cmd/demo/plog_server.cpp (Start the PlogServer instance)
test/plog/*
test/integration/test_plog.sh (Intergration unit test cases for plog service)

push against non-existing transaction

This happened a few times during laptop-deployed testing. it hits this case in Module.cpp:1043
case dto::TxnRecordState::Deleted:
default:
K2ASSERT(log::skvsvr, false, "Invalid transaction state: {}", incumbent.state);
}

[0002:08:06:43.250.163]-nodepool-(k2::skv_server:0) [ERROR] [/build/src/k2/module/k23si/Module.cpp:1043 @handleTxnPush] Invalid transaction state: Deleted

The state Deleted is used as the in-memory state of a transaction while we're recording that it was deleted in Persistence

  • after finalize
  • if retention window expires on a force-aborted txn

It is possible to have a race condition where a push is issued but by the time it is handled the incumbent has been finalized.

Support decimal64 and decimal128 as SKV types

decimal64 (providing 16 digits of precision) and decimal128 (providing 34 digits of precision) can be used to support the SQL decimal data type up to those levels of precision. They are included as a GCC standard library extension so implementation work will be reduced.

This task is to add decimal64 and decimal128 as SKV schema types, which can be used as data fields and as part of filter expressions but will not be supported as key fields. Also change SKV's TPC-C benchmark to use the new types where appropriate.

Implement Order Status TPC-C Transaction

See section 2.6 in the spec: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf

If the spec mentions any terminal display or file output, we do not want to do that but the data needed for it should be retrieved from SKV and stored in the transaction context. For example, in the New Order transaction in src/k2/cmd/tpcc/transactions.h the tax, discount, and total_amount variables are saved as member variables but are not used in the code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.