mongodb-labs / drivers-atlas-testing Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 26.0 642 KB

Drivers Atlas Testing

License: Apache License 2.0

Python 79.74% Shell 12.45% Ruby 2.98% Go 2.50% PHP 2.32%

drivers-atlas-testing's People

Contributors

Stargazers

Watchers

drivers-atlas-testing's Issues

Provide compete Atlas api response in exceptions

I am currently getting the following error from astrolabe:

atlasclient.exceptions.AtlasApiError: 404: Not Found. Error Code: 'NOT_ATLAS_GROUP' (GET https://cloud.mongodb.com/api/atlas/v1.0/groups/byName/x)

If I inspect the response returned by the api, I see this:

(Pdb) response.content
b'{"detail":"Group 5e9d09c5e373b17421f92ca2 is not an Atlas group; use the Cloud Manager Public API at /api/public/v1.0 to access it.","error":404,"errorCode":"NOT_ATLAS_GROUP","parameters":["5e9d09c5e373b17421f92ca2"],"reason":"Not Found"}'

As a user of Astrolabe I would like Astrolabe to include all information that api returns into its exceptions so that I can effectively troubleshooting errors.

Add documentation for running astrolabe locally

Running astrolabe locally will be required when new drivers are adding integrations and also while troubleshooting issues that are uncovered. It will be extremely helpful to have a guide that outlines the steps one must take to do this.

Run tests in MongoDB Drivers Team organization

Currently, my personal Atlas account is used to run the tests. We should start using the team Atlas organization (which can use Atlas credits) instead.

Astrolabe attempts reading the sentinel file before it is written on Windows

It appears that due to a quirk in how signal handling works Cygwin, astrolabe sometimes attempts to read the sentinel file results.json before the workload executor has actually written it.

This is probably because of the fact that after the CTRL_BREAK_EVENT is sent to a workload executor that is wrapped in a shell script, the call to wait returns simply when the shell script itself has terminated but the background process started by it has not. This is corroborated by the log output which indicates that astrolabe thinks that the Python workload executor has terminated before the executor script even prints "Writing statistics to Sentinel file ...":

INFO:astrolabe.runner:Initializing cluster '638743e002'
INFO:astrolabe.runner:Waiting for a test cluster to become ready
INFO:astrolabe.runner:Test cluster '638743e002' is ready
INFO:astrolabe.runner:Running test 'retryWrites_toggleServerSideJS' on cluster '638743e002'
INFO:astrolabe.utils:Starting workload executor subprocess
INFO:astrolabe.utils:Started workload executor [PID: 3808]
INFO:astrolabe.runner:Pushing process arguments update
INFO:astrolabe.runner:Waiting for cluster maintenance to complete
INFO:astrolabe.runner:Cluster maintenance complete
INFO:astrolabe.utils:Stopping workload executor [PID: 3808]
INFO:astrolabe.utils:Stopped workload executor [PID: 3808]
INFO:astrolabe.utils:Reading sentinel file 'C:\\data\\mci\\8eb729fb72c80ab434d9cc0f2890cee6\\astrolabe-src\\results.json'
INFO:astrolabe.runner:FAILED: 'retryWrites_toggleServerSideJS'
INFO:astrolabe.runner:Workload Statistics: {'numErrors': -1, 'numFailures': -1, 'numSuccesses': -1}
Workload statistics: {'numErrors': 0, 'numFailures': 0, 'numSuccesses': 1845}
Writing statistics to sentinel file 'C:\\data\\mci\\8eb729fb72c80ab434d9cc0f2890cee6\\astrolabe-src\\results.json'
INFO:astrolabe.runner:Cluster '638743e002' marked for deletion.

Add 'dry-run' functionality to the test runner

Users should be able to set a flag to use astrolabe in dry-run mode to make it faster to debug issues when integrating new drivers. Integrating a new driver usually requires running the test suite multiple times in quick succession with changes being made to the evergreen configuration and/or the driver-provided scripts. Currently, this process is extremely cumbersome as some maintenance tasks can take a very long time so users must wait for a long time before being able to see the outcome of a given run.

In dry-run mode, astrolabe would simply run the driver workload for a pre-determined duration of time on an existing cluster. No maintenance would be performed. The cluster would not be torn down at the end of the run. An additional flag could be used to specify which cluster to use or one could be selected at random from what is available.

Refer to Atlas Projects as projects everywhere

Currently, we use a mix of 'group' and 'project' to refer to the same thing. "Project" is more user-friendly and self-explanatory and we should switch to using that entirely.

Ensure Atlas clusters are marked for deletion at the end of Evergreen build

Safeguards need to be put in place to ensure that test clusters are deleted at the end of each evergreen build. Currently, the clusters are not deleted if, for example, astrolabe encounters an error during a build.

Use Atlas test cluster to synchronize operations between astrolabe and workload executors

A configurable setting should be added to astrolabe that specifies a special namespace, e.g. sentinel_database.sentinel_collection in the test database, which will then be used by astrolabe and workload executors to synchronize their operations.

This might look something like this:

After astrolabe starts the workload executor, it writes the following record (with writeConcern: majority) to sentinel_database.sentinel_collection:

{ '_id': <run_id>, status: 'inProgress' }

Here, run_id is some identifier that is known to both astrolabe as well as the workload executor.

After each iteration of running all operations in the operations array (see https://mongodb-labs.github.io/drivers-atlas-testing/spec-test-format.html), the workload executor checks the sentinel_database.sentinel_collection collection (with readConcern: majority) for the record bearing _id: <run_id>. On seeing that the status is still inProgress, the workload executor continues onto the next iteration of running operations.
Once the maintenance has completed and astrolabe wants to tell the workload executor to quit, it updates the sentinel record (using writeConcern: majority) to:

{ '_id': <run_id>, status: 'done' }

On the next check, the workload executor sees that the status is now done, and it updates this record with execution statistics (using writeConcern: majority):

{ '_id': <run_id>, 'status': 'done', 'executionStats': {<field1>: <value1>, ...} }

After this, the workload executor exits.

Astrolabe waits on the $PID of the workload executor to exit. Once it has exited, it reads the execution statistics that are written by the workload executor.

Advantages of this approach

No more signal handling - this has been a thorn in implementation and we are only up to 2 languages at this point. Workload executors are already equipped to talk to the Atlas deployment so we know that the approach proposed herein will be painless to implement.
Workload executors can 'run anywhere' - since we no longer rely on platform-specific signals, we can coordinate between astrolabe and a workload executor no matter where they are running. This will be especially helpful in the context of running the workload executors inside containers where signals are not a viable option for process synchronization.
Enable support for more complex communication - we can support more complex interactions between the workload executor and astrolabe with this design
No more sentinel files - we no longer rely on files written by the workload executor to communicate execution stats.
Use what you build - this one is pretty obvious (DBs are used to store state and communicate state between processes that might get partitioned).

Edge cases

Workload executor is partitioned from the Atlas test cluster: this will make the W-E unable to read the sentinel document (could be caused, e.g. by a bug in the driver being tested or due to the Atlas test cluster going offline). This can be handled by using an appropriate timeout on the wait performed by astrolabe on the workload-executors $PID. If the workload executor does not stop running within the timeout, an error will be reported.
Astrolabe is partitioned from the Atlas test cluster: this is possible even in the current design. If astrolabe cannot write the sentinel document at the start of a run, we can mark the run a system failure. If astrolabe cannot update the record when it needs to signal the W-E to stop, OR it cannot read the execution stats, we can mark this as a test failure as the maintenance or workload possibly broke something.

CC: @mbroadst @vincentkam

Workload executor validation evergreen task

Need to add a patch-only task to run the workload executor validation in evergreen. This will probably be easier to use for driver engineers as they won't have to manually set up environment variables and download/install the respective drivers (instead we will piggyback off of install-driver.sh and the usual scaffolding that is used in production builds).

--project-name option referenced in docs does not exist

https://mongodb-labs.github.io/drivers-atlas-testing/installing-running-locally.html#running-atlas-planned-maintenance-tests says:

$ astrolabe spec-tests run-one <path/to/test-file.yaml> -e <path/to/workload-executor> --project-name --cluster-name-salt

When I run that I get https://gist.github.com/p-mongo/e4ee62d774c96d4efd9d1cb1efb418a3.

If I use --group-name instead of --project-name this step appears to succeed.

Validate test specification before provisioning cloud resources

We should add a validation stage to the spec runner setup that validates each maintenance plan that is loaded from a spec test file. This way, we can detect inconsistencies in the test specification before starting the test and conserve time and cloud resources.

E.g. currently, we check that the maintenance plan in a test is a valid one (by asserting that not both clusterConfiguration and processArgs are missing from maintenancePlan.final) deep in the spec runner.

Only inject certifi certificates on Windows if using TLS

Currently, we always end up passing tlsCAFile=certifi.where() on windows which ends up implicitly enabling TLS even if the server is running without it. The error looks like this:

======================================================================
ERROR: test_simple (astrolabe.validator.ValidateWorkloadExecutor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabe\validator.py", line 45, in setUp
    load_test_data(self.CONNECTION_STRING, DRIVER_WORKLOAD)
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabe\utils.py", line 140, in load_test_data
    coll.drop()
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\collection.py", line 1103, in drop
    dbo.drop_collection(self.__name, session=session)
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\database.py", line 914, in drop_collection
    with self.__client._socket_for_writes(session) as sock_info:
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\mongo_client.py", line 1266, in _socket_for_writes
    server = self._select_server(writable_server_selector, session)
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\mongo_client.py", line 1253, in _select_server
    server = topology.select_server(server_selector)
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\topology.py", line 235, in select_server
    address))
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\topology.py", line 193, in select_servers
    selector, server_timeout, address)
  File "c:\data\mci\8a509a3df3bdd3bf029acf5b0edc5187\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\topology.py", line 209, in _select_servers_loop
    self._error_message(selector))
pymongo.errors.ServerSelectionTimeoutError: SSL handshake failed: localhost:27017: [WinError 10054] An existing connection was forcibly closed by the remote host

Instead, we can use a try...except block to only inject the CA cert file on SSL failures.

Use cloud-dev Atlas for running Evergreen against drivers-atlas-testing

Stop using certifi to connect to Atlas

We should stop using certifi certificates (and also remove it as an astrolabe dependency) once the build ticket is done.

Prepend environment variables used by astrolabe with ASTROLABE_*

This will reduce ambiguity when such variables are used.

Tests on Evergreen failing due to premature patching of processArgs

While this may not show up while manually running the tests, when several builds are kicked off simultaneously on Evergreen, many of them fail with the following error:

==============================  ====================
Test name                       Atlas cluster name
==============================  ====================
retryWrites_toggleServerSideJS  df6cba4c3e
==============================  ====================
Traceback (most recent call last):
  File "astrolabevenv/bin/astrolabe", line 11, in <module>
    load_entry_point('astrolabe', 'console_scripts', 'astrolabe')()
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabevenv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabevenv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabevenv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabevenv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabevenv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabevenv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabevenv/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabe/cli.py", line 431, in run_single_test
    failed = runner.run()
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabe/spec_runner.py", line 324, in run
    case.initialize()
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/astrolabe/spec_runner.py", line 135, in initialize
    clusters[self.cluster_name].processArgs.patch(**process_args)
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/atlasclient/client.py", line 56, in patch
    return self._client.request('PATCH', self._path, **params)
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/atlasclient/client.py", line 214, in request
    return self.handle_response(method, response)
  File "/data/mci/964575159638987f662132cfcaab2bff/astrolabe-src/atlasclient/client.py", line 255, in handle_response
    raise AtlasApiError('404: Not Found.', **kwargs)
atlasclient.exceptions.AtlasApiError: 404: Not Found. Error Code: 'CLUSTER_NOT_FOUND' (PATCH https://cloud.mongodb.com/api/atlas/v1.0/groups/5e8e3954fd6ba4520d3f1bbe/clusters/df6cba4c3e/processArgs)

The error seems to indicate that we are prematurely accessing the cluster resource associated with that test cluster for the test that has failed. Adding a waiting period after the cluster is created, but before the processArgs are patched should resolve the issue.

Fix issues with PyMongo's integration on Windows

PyMongo's integration doesn't really work on Windows at the moment. Primarily, this is due to missing CA certificates on the Evergreen windows hosts which render the workload executor unable to communicate with the Atlas cluster. We can work around this issue by installing certifi if we are running on windows and using that CA certs from that package in our call to MongoClient.

Make validation a bit more forgiving

The validate-workload-executor checks the count of ops against the expected count. It should allow one more operation than expected, so that workload executors can write the results file in the signal handler without waiting for the main loop to actually exit.

Workaround Atlas limitation on maximum number of Clusters under a Project

Atlas limits each Project to 25 clusters. This means we are limited to 25 concurrent builds at any time in the drivers-atlas-testing evergreen project. In the absence of this restriction, we often end up with too many simultaneously executing jobs that end up failing with the CLUSTER_NOT_FOUND error code (see #47 for details).

We need to figure out a workaround for this limitation. CC: @mbroadst

Make astrolabe info <cmd> output more user-friendly

Users have complained about it being confusing to interpret these tables as it is not clear what the Internal Variable ID is. We can replace that with the name of the option on the command line or something else the user cares about.

Add Workload Executor Behavioral Description to Integration Guide

The specification/technical design included in the documentation for this project is not intended to be a living document and will soon become outdated considering that changes to the framework are likely as new languages write integrations and uncover shortcomings in the original design. In that light, the "Workload Executor" section in the Integration Guide should be expanded to include a complete behavioral description (a.k.a 'the contract') expected from workload executors. This section can then be updated as changes are made and will be the source of truth for new drivers to write integrations and for existing integrations to update their workload executors.

Document that workload executors MUST ignore the testData key in the workload executor spec

This is currently only documented in the YAML test format specification. We should document it in the workload execution spec (and add a comment about it to the workload executor pseudocode) because users are likelier to be reading that document when implementing their executors.

Add post test log retrieval functionality

After a test case has been run, astrolabe should:

wait until Atlas logs are current (i.e logs from the period of time during which tests were run are available via the API),
download the mongod/mongos logs in an appropriately named folder hierarchy that makes it easy to find logs corresponding to specific tests,
create a zip archive/tarball containing all the logs.

Investigate why Atlas Group retrieval fails for some groups

Workload executors should report JSON output using a file

Currently, we require workload executors to report statistics of the run back to the test orchestrator using JSON output that is sent to STDERR. This method of doing things breaks down however in the event that the workload produces any output that is sent to STDERR during the test run. Some languages don't provide any mechanism to redirect such output to STDOUT.

To accommodate all languages, we can instead require this JSON output to be written as ASCII text by the workload executor to a file with a fixed, pre-determined name in the current working directory where the workload executor is invoked. The orchestrator can parse this file for the desired information once the workload executor has been terminated. If the file is not found, the orchestrator can assume that something went wrong and therefore mark the run as a failure.

Make astrolabe resilient to faulty specification test files

Consider a situation wherein there exists a typo in one of the test scenario files - say an expected field is missing. As written currently, the test runner will notice the error when it attempts to access the missing field, probably in astrolabe.spec_runner.AtlasTestCase.run. Attempting to access this non-existent field would raise an AttributeError which would be propagated back to the user and terminate the entire test run.

astrolabe should be robustified to deal with this kind of erroneous test file content. A faulty spec test should only cause that particular test to fail, but should not impede the remainder of the test run from proceeding as usual.

Remove CA cert workarounds on Windows once Evergreen hosts get updated certs

We should be able to remove the workaround using certifi introduced in #26 once BUILD-10841 has been resolved.

Use non-dotted directories for workload executors

Currently workload executors are located in subdirectories of .evergreen. In Unix the leading dot makes a file hidden, which means:

Commands like ls do not show these files/directories by default
Shell completion by default does not show them

Also, the leading dot is awkward to type. I use Dvorak layout but even on qwerty "a" is on the home row and "." is one row down. On Dvorak . and e are adjacent (same finger) and are arranged vertically.

Is it possible to have the workload executors be under a path that does not contain dotted directories please?

Astrolabe does not stop when cluster creation fails

I am seeing this output when running a test:

INFO:astrolabe.runner:Initializing cluster '51a1f7e8d0'
DEBUG:atlasclient.client:Request (POST https://cloud.mongodb.com/api/atlas/v1.0/groups/5f5b0c2ccb08f43b11905b20/clusters {'auth': <requests.auth.HTTPDigestAuth object at 0x7f6f084e40d0>, 'params': {}, 'json': {'clusterType': 'REPLICASET', 'providerSettings': {'providerName': 'AWS', 'regionName': 'US_WEST_1', 'instanceSizeName': 'M10'}, 'replicationSpecs': [{'numShards': 1, 'regionsConfig': {'US_WEST_1': {'electableNodes': 1, 'priority': 1}, 'US_EAST_1': {'electableNodes': 1, 'priority': 1}, 'US_EAST_2': {'electableNodes': 1, 'priority': 2}}}], 'name': '51a1f7e8d0'}, 'timeout': 10.0})
DEBUG:atlasclient.client:Response (POST {'detail': 'The required attribute readOnlyNodes was not specified.', 'error': 400, 'errorCode': 'MISSING_ATTRIBUTE', 'parameters': ['readOnlyNodes'], 'reason': 'Bad Request'})
INFO:astrolabe.runner:Waiting for a test cluster to become ready
DEBUG:astrolabe.poller:Polling [<AtlasTestCase: retryReads_primaryTakeover>] [elapsed: 0.00 seconds]
DEBUG:atlasclient.client:Request (GET https://cloud.mongodb.com/api/atlas/v1.0/groups/5f5b0c2ccb08f43b11905b20/clusters/51a1f7e8d0 {'auth': <requests.auth.HTTPDigestAuth object at 0x7f6f084e40d0>, 'params': {}, 'json': {}, 'timeout': 10.0})
DEBUG:atlasclient.client:Response (GET {'detail': 'No cluster named 51a1f7e8d0 exists in group 5f5b0c2ccb08f43b11905b20.', 'error': 404, 'errorCode': 'CLUSTER_NOT_FOUND', 'parameters': ['51a1f7e8d0', '5f5b0c2ccb08f43b11905b20'], 'reason': 'Not Found'})

Even though cluster creation failed, astrolabe appears to continue and tries to read it and operate on it.

I expected cluster creation failure to abort execution.

Windows support

Currently, astrolabe cannot run spec tests on Windows because PyMongo cannot reliably connect to Atlas clusters on windows. It fails with the following traceback:

[2020/04/01 02:09:22.209] Traceback (most recent call last):
[2020/04/01 02:09:22.209]   File "C:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\Scripts\astrolabe-script.py", line 11, in <module>
[2020/04/01 02:09:22.209]     load_entry_point('astrolabe', 'console_scripts', 'astrolabe')()
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 829, in __call__
[2020/04/01 02:09:22.209]     return self.main(*args, **kwargs)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 782, in main
[2020/04/01 02:09:22.209]     rv = self.invoke(ctx)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 1259, in invoke
[2020/04/01 02:09:22.209]     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 1259, in invoke
[2020/04/01 02:09:22.209]     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 1066, in invoke
[2020/04/01 02:09:22.209]     return ctx.invoke(self.callback, **ctx.params)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 610, in invoke
[2020/04/01 02:09:22.209]     return callback(*args, **kwargs)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\click\decorators.py", line 21, in new_func
[2020/04/01 02:09:22.209]     return f(get_current_context(), *args, **kwargs)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabe\cli.py", line 428, in run_single_test
[2020/04/01 02:09:22.209]     failed = runner.run()
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabe\spec_runner.py", line 360, in run
[2020/04/01 02:09:22.209]     xunit_test = active_case.run(persist_cluster=self.persist_clusters)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabe\spec_runner.py", line 165, in run
[2020/04/01 02:09:22.209]     coll.drop()
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\collection.py", line 1103, in drop
[2020/04/01 02:09:22.209]     dbo.drop_collection(self.__name, session=session)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\database.py", line 914, in drop_collection
[2020/04/01 02:09:22.209]     with self.__client._socket_for_writes(session) as sock_info:
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\mongo_client.py", line 1266, in _socket_for_writes
[2020/04/01 02:09:22.209]     server = self._select_server(writable_server_selector, session)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\mongo_client.py", line 1253, in _select_server
[2020/04/01 02:09:22.209]     server = topology.select_server(server_selector)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\topology.py", line 235, in select_server
[2020/04/01 02:09:22.209]     address))
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\topology.py", line 193, in select_servers
[2020/04/01 02:09:22.209]     selector, server_timeout, address)
[2020/04/01 02:09:22.209]   File "c:\data\mci\fcc36544582db08774836579fbb262db\astrolabe-src\astrolabevenv\lib\site-packages\pymongo\topology.py", line 209, in _select_servers_loop
[2020/04/01 02:09:22.209]     self._error_message(selector))
[2020/04/01 02:09:22.209] pymongo.errors.ServerSelectionTimeoutError: 0412be4b71-shard-00-00.90gnc.mongodb.net:27017: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076),0412be4b71-shard-00-02.90gnc.mongodb.net:27017: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076),0412be4b71-shard-00-01.90gnc.mongodb.net:27017: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)

This is likely due to missing system CA certs on Windows hosts in Evergreen (see BUILD-10841).

Once the certs issue is resolved we might uncover other issues to running things reliably on Windows. These kinks need to be ironed out so that this framework can be consumed by driver languages that must use Windows as a testing platform.

Workaround hitting rate limits while polling Atlas API endpoints

Atlas API resources are rate-limited on a per-project basis. Since each and every evergreen build of this project uses the same Atlas project, it is possible to run into API rate limits when multiple builds are running simultaneously.

In the absence of a backoff/retry logic, hitting the rate limit results in the entire test run failing with a message like:

INFO:astrolabe.runner:Initializing cluster '420b243009'
INFO:astrolabe.runner:Waiting for a test cluster to become ready
Traceback (most recent call last):
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/util/connection.py", line 61, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 976, in _validate_conn
    conn.connect()
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/connection.py", line 308, in connect
    conn = self._new_conn()
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/connection.py", line 172, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fefa834e9e8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 725, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cloud.mongodb.com', port=443): Max retries exceeded with url: /api/atlas/v1.0/groups/5e8e3954fd6ba4520d3f1bbe/clusters/420b243009 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefa834e9e8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/atlasclient/client.py", line 210, in request
    response = requests.request(method, url, **request_kwargs)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cloud.mongodb.com', port=443): Max retries exceeded with url: /api/atlas/v1.0/groups/5e8e3954fd6ba4520d3f1bbe/clusters/420b243009 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefa834e9e8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "astrolabevenv/bin/astrolabe", line 33, in <module>
    sys.exit(load_entry_point('astrolabe', 'console_scripts', 'astrolabe')())
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabevenv/lib/python3.6/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabe/cli.py", line 441, in run_single_test
    failed = runner.run()
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabe/runner.py", line 323, in run
    args=("IDLE",), kwargs={})
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabe/poller.py", line 50, in poll
    return_value = self._check_ready(obj, attribute, args, kwargs)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabe/poller.py", line 67, in _check_ready
    return bool(getattr(obj, attribute)(*args, **kwargs))
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/astrolabe/runner.py", line 91, in is_cluster_state
    cluster_info = self.cluster_url.get().data
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/atlasclient/client.py", line 56, in get
    return self._client.request('GET', self._path, **params)
  File "/data/mci/14a1c0a1a91704cf6d127b1cc65cab0e/astrolabe-src/atlasclient/client.py", line 215, in request
    request_method=method
atlasclient.exceptions.AtlasClientError: HTTPSConnectionPool(host='cloud.mongodb.com', port=443): Max retries exceeded with url: /api/atlas/v1.0/groups/5e8e3954fd6ba4520d3f1bbe/clusters/420b243009 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefa834e9e8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)) (GET https://cloud.mongodb.com/api/atlas/v1.0/groups/5e8e3954fd6ba4520d3f1bbe/clusters/420b243009)
Command failed: error waiting on process '2d258a64-438f-4981-99f6-7403862b2caa': exit status 1

We should improve astrolabe to account for this failure mode and appropriately wait/backoff when such errors are encountered.

Add integration guide

At a minimum, documentation that delineates how a new driver may use astrolabe to implement atlas planned maintenance testing needs to added.

Mechanism for validating workload executors

Add a command to astrolabe that takes a WORKLOAD_EXECUTOR and a connection string and runs a series of workloads that are known to fail as a smoketest for checking whether the Workload Executor works properly and returns the correct JSON output.

This would be really helpful to have as it can be quite hard to figure out whether the workload executor is working correctly especially when there are no failures. Since the workload executor runs in a background process, on the off chance that it fails silently (due to an implementation bug usually), we could end up getting spurious 'passing'/'green' builds.

The validator would run a bunch of invalid workloads to stress-test the workload executor and to check that it raises the expected errors (e.g. run a non-existent command N times and and check that the workload executor returns numErrors=N). Additionally, we can also run queries that succeed and then have astrolabe check that we see the expected changes on the server (e.g. issue a write via the workload executor and then check using astrolabe that the expected document shows up on the server).

All astrolabe commands should not require Atlas API credentials

Commands that are informational (e.g. astrolabe info environment-variables) should be executable without needing to specify Atlas API credentials. This should be possible to do by checking command group names in where the client is created and attached to the click context.

Migrate to ubuntu1804-drivers-atlas-testing

This distro is limited to 25 instances so it helps us workaround #48.

We must also document that drivers need to use this distro to write their Linux-based test cases.

Support quick-finishing when workload-executor throws an error

When the workload executor is incorrectly implemented or error-ing due to any other reason, the test runner should cut short test execution, report the error raised by the executor via JUnit, and then move onto the next test.

Currently, the test runner doesn't check if the workload executor has errored and keeps waiting for cluster maintenance to complete even if the driver workload is not being run.

Let users provide the complete workload executor invocation instead of a standalone executable

It seems like many drivers will have to 'wrap' their workload executor scripts in a bash script (or similar) to adhere to the API currently required by astrolabe for workload executors. This design makes it impossible for the 'wrapper script' to set the correct exit status upon being terminated via a SIGINT or CTRL_BREAK_EVENT signal. Consider the following example (*nix):

wrapper.sh (analogous to the workload executor 'wrapper' shell script):

#!/bin/sh

trap "exit 0" INT                       # We set a handler for SIGINT because if we don't, this shell script will set a non-zero exit status when it receives `SIGINT` irrespective of the exit status of workload.py
python workload.py

workload.py (analogous to a driver's 'native' workload executor script):

import signal

def handler(signum, frame):
    exit(2)

signal.signal(signal.SIGINT, handler)                    # The script will exit with code 2 when it encounters SIGINT

while True:
    pass            # Usually, we'd do some driver operations here

Now we run the wrapper:

$ ./wrapper.sh
^C%                     # This is us manually pressing CTRL-C/Command-C on OSX to send SIGINT
$ echo $?
0                       # This is the exit status that will be seen by astrolabe

As the example shows, the exit code set by the native workload executor script is lost due to the wrapper script. AFAICT there is no workaround for this, the only alternative being that we stop relying on exit codes entirely.

This change would also make it easier for drivers to integrate astrolabe since this wrapper script might vary between different platforms or runtimes for some drivers. For example, for PyMongo, we should be able to specify the workload executor as follows:

$ astrolabe spec-tests run-one tests/retryReads-resizeCluster.yaml --workload-executor "path/to/pymongovenv/bin/python .evergreen/python/pymongo/workload-executor.py"

which is equivalent to:

$ WORKLOAD_EXECUTOR="path/to/pymongovenv/bin/python .evergreen/python/pymongo/workload-executor.py" astrolabe spec-tests run-one tests/retryReads-resizeCluster.yaml

This would make it much easier to account for a different invocation pattern. In the case of PyMongo, for example, on windows, we could simply do:

$ WORKLOAD_EXECUTOR="path/to/pymongovenv/Scripts/python.exe .evergreen/python/pymongo/workload-executor.py" astrolabe spec-tests run-one tests/retryReads-resizeCluster.yaml

Bash scripts wrapping native workload executors need not run them as background processes

Originally, the bash wrapper for native workload executors ran the native script/executable as a background process so that the script's exit code upon termination by astrolabe could be set to be the same as the exit code of the native workload executor. This was needed because:

bash scripts that wrapped a native executor always gave a non zero code if they handled a SIGINT without setting a trap
astrolabe relied on the exit code to ascertain test-run success/failure

Astrolabe now uses execution statistics to determine test success/failure so we no longer need to do this. This significantly simplifies the workload executor wrapper script.

As part of this ticket, we should:

Update the documentation/specification
Update the Python workload executor wrapper since this is referenced in the documentation as an example of how to wrap a native executor in a bash script

Stop using Travis CI for updating GitHub pages documentation

Using Travis is a maintenance burden as teams are using Evergreen for most other CI tasks. We should have a roadmap for migrating off of Travis CI.

Document the use of CTRL_BREAK_EVENT on Windows instead of SIGINT to interrupt workload executors

There are many limitations with using signal.CTRL_C_EVENT to interrupt a subprocess on Windows. Consider, for example, the following scripts:

pyscript.py (analogous to 'the framework', i.e. astrolabe):

import subprocess
import os
import signal
import sys
import time


cmd = subprocess.Popen([sys.executable, "bgproc.py"],
        creationflags=subprocess.CREATE_NEW_PROCESS_GROUP,
        stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(2)
os.kill(cmd.pid, signal.CTRL_C_EVENT)
stdout, stderr = cmd.communicate(timeout=10)

print("stdout: {}".format(stdout))
print("stderr: {}".format(stderr))
print("exit code: {}".format(cmd.returncode))

bgproc.py (analogous to a driver workload executor script):

import signal

print("hello world")

try:
    while True:
        pass
except KeyboardInterrupt:
    print("caught ctrl-c!")
    exit(0)

Running python.ext pyscript.py, we'd expect to see bgproc.py's execution interrupted by the CTRL_C_EVENT signal, which is 'handled' in the except KeyboardInterrupt block. However, we actually find that interruption of this script is not interrupted at all by the signal causing the call to communicate to timeout:

$ C:/python/Python37/python.exe pyscript.py
Traceback (most recent call last):
  File "pyscript.py", line 13, in <module>
    stdout, stderr = cmd.communicate(timeout=10)
  File "C:\python\Python37\lib\subprocess.py", line 964, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "C:\python\Python37\lib\subprocess.py", line 1298, in _communicate
    raise TimeoutExpired(self.args, orig_timeout)
subprocess.TimeoutExpired: Command '['C:\\python\\Python37\\python.exe', 'bgproc.py']' timed out after 10 seconds

After observing this peculiar behavior, I investigated further and found that on Windows there are many deficiencies with the IPC APIs. The situation is further complicated by deficient/incorrect Python documentation (specifically, the correct usage of CTRL_C_EVENT, CTRL_BREAK_EVENT, CREATE_NEW_PROCESS_GROUP, os.kill on Windows). Some resources with pertinent information/discussions are:

In light of this, we need a new way to stop the Workload Executor on windows.

Support for workload executors that take a long time to start up

It seems some languages will end up with workload executor implementations that take a significant amount of time to start. We should modify the spec runner to support these kinds of executors.

The main work here involves ensuring that the workload starts running before the maintenance plan is applied. Without any safeguards, it is possible for maintenance to start, and even complete without the workload executor running a single operation.

Add alternative mechanism to signals for stopping workload executors

The current design for this project uses SIGINT (on *nix) and CTRL_BREAK_EVENT (on Windows) to coordinate the shutdown of the workload executor process after maintenance has been successfully run on the Atlas cluster.
Driver authors have to rely on standard APIs provided by their language in order to write a workload executor to conform to this spec. In practice, this has proven to be easier said than done. To reduce implementation complexity, we should consider providing an alternative mechanism to signals - something that is easier to implement and more platform-independent. An obvious solution would be to have astrolabe write a tombstone file to a pre-determined location when maintenance has completed, having workload executors periodically check for the existence of this file, and having them terminate when the file is eventually found.

Run drivers-atlas-testing tests on TravisCI

Figure out how to run planned maintenance tests on a pre-determined schedule

It seems like there is no way to tell Evergreen to run a job every N days/weeks. Instead, Evergreen is almost always triggered by commits to the tracked repository. We need to figure out a way to work around this and run these tests at a pre-determined cadence.

Move Spec Test Format into a separate file

Since the full technical design/specification document will not be kept up-to-date as this testing framework evolves, we should move (or duplicate) the test format section into a separate file that can be kept up to date as and when the format evolves. This file can then be 'included' in the appropriate place in the complete technical design document.

We should also add a README.md to the tests/ directory that points readers to the documentation that describes the test format.

Make it safe for workload executors to output large amounts of text

Currently, when running the workload executor, we PIPE the output from its stderr and stdout streams. To avoid blocking the child process when the pipe fills up, we need to continuously read the worker_subprocess output (via worker_subprocess.stdout.read(..)).

Thanks to @ShaneHarvey for the context:

See the note in subprocess.wait:

This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

https://docs.python.org/3/library/subprocess.html#subprocess.Popen.wait

Just to be clear: there's no risk of deadlock in the current approach. The risk is that the child process can fill up the pipe with output and then it will block until we call worker_subprocess.communicate() down below.

A more elegant solution will be to redirect the worker_subprocess output to a file. Then we can open and read the file after the process exits.

Add documentation for workload executor validation

Documentation on how to use this feature locally and in evergreen (see #66) is required.

Investigate CLUSTER_NOT_FOUND errors on Evergreen

Some test runs on Evergreen fail with this message:

========================  =========================================================================================================================================================================
Configuration option      Value
========================  =========================================================================================================================================================================
Atlas organization name   MongoDB Drivers Team
Atlas group/project name  drivers-atlas-testing
Salt for cluster names    drivers_atlas_testing_tests_python_windows__driver~pymongo_3.10.x_platform~windows_64_runtime~python37_windows_563e7ce4e286e1555a69dfb315d1edb720812fbd_20_04_23_23_37_21
Polling frequency (Hz)    1.0
Polling timeout (s)       1200.0
========================  =========================================================================================================================================================================
INFO:astrolabe.spec_runner:Loading spec test from file 'C:\\data\\mci\\c7405177fd9fbede40ca08b68d5cc243\\astrolabe-src\\tests\\retryWrites-resizeCluster.yaml'
INFO:astrolabe.spec_runner:Verifying organization 'MongoDB Drivers Team'
INFO:astrolabe.spec_runner:Successfully verified organization 'MongoDB Drivers Team'
INFO:astrolabe.spec_runner:Verifying project 'drivers-atlas-testing'
INFO:astrolabe.spec_runner:Successfully verified project 'drivers-atlas-testing'
INFO:astrolabe.spec_runner:Verifying user 'atlasuser'
INFO:astrolabe.spec_runner:Successfully verified user 'atlasuser'
INFO:astrolabe.spec_runner:Enabling access from anywhere on project 'drivers-atlas-testing'
INFO:astrolabe.spec_runner:Successfully enabled access from anywhere on project 'drivers-atlas-testing'
INFO:astrolabe.spec_runner:Astrolabe Test Plan
=========================  ====================
Test name                  Atlas cluster name
=========================  ====================
retryWrites_resizeCluster  f490dab32b
=========================  ====================
INFO:astrolabe.spec_runner:Initializing cluster 'f490dab32b'
INFO:astrolabe.spec_runner:Waiting for a test cluster to become ready
Traceback (most recent call last):
  File "C:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabevenv\Scripts\astrolabe-script.py", line 11, in <module>
    load_entry_point('astrolabe', 'console_scripts', 'astrolabe')()
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabevenv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabevenv\lib\site-packages\click\decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabe\cli.py", line 431, in run_single_test
    failed = runner.run()
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabe\spec_runner.py", line 337, in run
    args=("IDLE",), kwargs={})
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabe\poller.py", line 50, in poll
    return_value = self._check_ready(obj, attribute, args, kwargs)
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabe\poller.py", line 67, in _check_ready
    return bool(getattr(obj, attribute)(*args, **kwargs))
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\astrolabe\spec_runner.py", line 92, in is_cluster_state
    cluster_info = self.cluster_url.get().data
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\atlasclient\client.py", line 51, in get
    return self._client.request('GET', self._path, **params)
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\atlasclient\client.py", line 214, in request
    return self.handle_response(method, response)
  File "c:\data\mci\c7405177fd9fbede40ca08b68d5cc243\astrolabe-src\atlasclient\client.py", line 255, in handle_response
    raise AtlasApiError('404: Not Found.', **kwargs)
atlasclient.exceptions.AtlasApiError: 404: Not Found. Error Code: 'CLUSTER_NOT_FOUND' (GET https://cloud.mongodb.com/api/atlas/v1.0/groups/5e8e3954fd6ba4520d3f1bbe/clusters/f490dab32b)
Command failed: error waiting on process '98f82a23-cb22-4491-8704-757442fe6b23': exit status 1

These need to investigated and the root cause fixed.

Migrate to using IP Access List endpoints

Astrolabe currently uses IP Whitelist endpoints. These endpoints are deprecated and expected to stop working in ~6 months. We should modify astrolabe to use the IP Access list endpoints instead.

CC: @p-mongo

Stop relying upon workload executor exit-codes to determine success/failure

Due to the mechanism using which we terminate workload executors (specifically, issuing SIGINT/CTRL_BREAK_EVENT signals depending on the platform) and the nesting of natively implemented executors behind shell scripts (see, e.g., PyMongo's integration), it can be misleading and/or unreliable to use the exit code of the executors themselves to ascertain whether a workload succeeded or failed.

For example, on windows, when a script is terminated by the CTRL_BREAK_EVENT it sets the exit code to 3221225786. In some initial testing, it proved to be quite difficult to force the script to exit with a different code in-lieu of this system default (see failed evergreen patch builds: https://evergreen.mongodb.com/task/drivers_atlas_testing_tests_python_windows__driver~pymongo_master_platform~windows_64_runtime~python37_windows_retryReads_toggleServerSideJS_patch_57327b4c365d858296dcb3afe7218f5729fd5960_5e9f801b3627e0082715e32d_20_04_21_23_22_14, https://evergreen.mongodb.com/task/drivers_atlas_testing_tests_python_windows__driver~pymongo_master_platform~windows_64_runtime~python37_windows_retryReads_toggleServerSideJS_patch_57327b4c365d858296dcb3afe7218f5729fd5960_5e9f35ac57e85a4efa4745e6_20_04_21_18_04_53). Instead of trying to chase down the problem and figuring out how to set the exit code correctly, we should simply rely upon the workload statistics that the executor is required to output to determine success.

mongodb-labs / drivers-atlas-testing Goto Github PK

drivers-atlas-testing's People

Contributors

Stargazers

Watchers

Forkers

drivers-atlas-testing's Issues

Recommend Projects

Recommend Topics

Recommend Org