Code Monkey home page Code Monkey logo

workscripts's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

workscripts's Issues

Add auth support for direct to shard connections

When making direct shard connection and using Auth/TLS and providing an URI of this form for cluster connection:

mongodb://HOST:PORT/?tls=true&tlsCertificateKeyFile=/etc/ssl/mongo-key.pem&tlsCAFile=/etc/ssl/mongo-ca.pem&authSource=$external&authMechanism=MONGODB-X509

The problem is that the options are not passed when initiating a connection to each shard

The errors on the shard's primary looks like this:

2021-11-17T07:59:18.202+0000 I NETWORK  [conn313260] Error receiving request from client: SSLHandshakeFailed: The server is configured to only allow SSL connections. Ending connection from 10.69.12.111:40512 (connection id: 313260)

Accept user/password as arguments of `make_direct_shard_connection`

The function is currently opening a direct connection to a shard with an URI not including username and password. While this works for testing clusters with no authentication enabled, this causes error in production clusters.

For example, this error is thrown when trying to call splitVector through a direct connection:

Traceback (most recent call last):
--
File "defragment_sharded_collection.py", line 1078, in <module>
loop.run_until_complete(main(args))
File "/usr/lib64/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "defragment_sharded_collection.py", line 959, in main
await asyncio.gather(*tasks)
File "defragment_sharded_collection.py", line 938, in split_oversized_chunks
await coll.split_chunk(c, target_chunk_size_kb, conn)
File "defragment_sharded_collection.py", line 112, in split_chunk
}, codec_options=self.cluster.client.codec_options)
File "/usr/lib64/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib64/python3.7/site-packages/pymongo/database.py", line 734, in command
**kwargs,
File "/usr/local/lib64/python3.7/site-packages/pymongo/database.py", line 615, in _command
client=self.__client,
File "/usr/local/lib64/python3.7/site-packages/pymongo/pool.py", line 764, in command
exhaust_allowed=exhaust_allowed,
File "/usr/local/lib64/python3.7/site-packages/pymongo/network.py", line 164, in command
parse_write_concern_error=parse_write_concern_error,
File "/usr/local/lib64/python3.7/site-packages/pymongo/helpers.py", line 180, in _check_command_response
raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: command splitVector requires authentication, full error: {'operationTime': Timestamp(1658204283, 1), 'ok': 0.0, 'errmsg': 'command splitVector requires authentication', 'code': 13, 'codeName': 'Unauthorized', 'lastCommittedOpTime': Timestamp(1658204283, 1), '$clusterTime': {'clusterTime': Timestamp(1658204271, 1), 'signature': xx'_MANUALLY_OMITTED_xx'
}}

A user hardcoded username and password and everything worked:

diff common.py common.py.bak
--
131c131
< uri = 'mongodb://mongoadmin:password@' + conn_parts[1]
—
> uri = 'mongodb://' + conn_parts[1]

The script breaks if the shard key is using UUID type

In my tests the script is unable to work correctly if the shard key is using UUID:

2022-03-10 13:24:09,213 [INFO] Bounds: [{'location': 'US', 'xdr_device_id': UUID('00000100-95b4-4058-ba9e-c32e0fab201f')}, {'location': 'US', 'xdr_device_id': UUID('52df6ff6-cbea-4425-83f9-942e055022a4')}]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61739/61739 [00:00<00:00, 182577.57 chunk/s]
Traceback (most recent call last):
  File "defragment_sharded_collection.py", line 1089, in <module>
    loop.run_until_complete(main(args))
  File "/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "defragment_sharded_collection.py", line 621, in main
    await asyncio.gather(*tasks)
  File "defragment_sharded_collection.py", line 569, in merge_chunks_on_shard
    await coll.merge_chunks(consecutive_chunks.batch)
  File "defragment_sharded_collection.py", line 137, in merge_chunks
    }, codec_options=self.cluster.client.codec_options)
  File "/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.6/site-packages/pymongo/database.py", line 761, in command
    codec_options, session=session, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pymongo/database.py", line 652, in _command
    client=self.__client)
  File "/usr/local/lib/python3.6/site-packages/pymongo/pool.py", line 721, in command
    exhaust_allowed=exhaust_allowed)
  File "/usr/local/lib/python3.6/site-packages/pymongo/network.py", line 163, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/usr/local/lib/python3.6/site-packages/pymongo/helpers.py", line 167, in _check_command_response
    raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: Failed to commit chunk merge :: caused by :: could not merge chunks, shard atlas-11qc0x-shard-0 does not contain a sequence of chunks that exactly fills the range [{ location: "US", xdr_device_id: BinData(3, 0000010095B44058BA9EC32E0FAB201F) }, { location: "US", xdr_device_id: BinData(3, 52DF6FF6CBEA442583F9942E055022A4) }), full error: {'ok': 0.0, 'errmsg': 'Failed to commit chunk merge :: caused by :: could not merge chunks, shard atlas-11qc0x-shard-0 does not contain a sequence of chunks that exactly fills the range [{ location: "US", xdr_device_id: BinData(3, 0000010095B44058BA9EC32E0FAB201F) }, { location: "US", xdr_device_id: BinData(3, 52DF6FF6CBEA442583F9942E055022A4) })', 'code': 20, 'codeName': 'IllegalOperation', 'operationTime': Timestamp(1646879043, 1), '$clusterTime': {'clusterTime': Timestamp(1646879048, 2), 'signature': {'hash': b'\xf3s#Y\x07\xc73E\x83\xe0\x19\xb4\xc5\xa2\xd1\xfdvK\x10=', 'keyId': 7072997257389277206}}}

In this particular case the range in question is owned by a shard-1, but the merge command complains about shard-0. In my research everything points to UUID encoding

Make data size calculation threshold in phase 1 configurable

During phase 1 we aggregate chunks to merge in a batch. When the estimated batch size reached 90 % of the target chunk size we calculate the real size of the range.

The estimated batch size is calculated by mean of the average chunk size of a collection in a specific shard. So in case we underestimate the size of batch we will check the real data size too late and by that time the real batch size could be bigger than the configured maximum chunk size.

The goal of this ticket is to introduce a new CLI parameter --phase_1_calc_size_threshold (0 <= value <= 1) default 0.9 to allow the user decide how frequently we should check the real size of the batch

Defrag script crash on 4.4 due to unexpected balancer setting format.

On a 4.4.10 cluster:

╰─ ./defragment_sharded_collection.py "mongodb://user:password@localhost:27017"  --ns "POCDB.POCCOLL" --phases phase1                                                                                                                                                                                                                                                    ─╯
2021-11-05 10:53:45,047 [INFO] Starting with parameters: 'mongodb://user:password@localhost:27017 --ns POCDB.POCCOLL --phases phase1'
Server is running at FCV 4.4
Traceback (most recent call last):
  File "./defragment_sharded_collection.py", line 1069, in <module>
    loop.run_until_complete(main(args))
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "./defragment_sharded_collection.py", line 222, in main
    if not args.dryrun and (balancer_doc is None or balancer_doc['mode'] != 'off'):
KeyError: 'mode'

#----------------

Renatos-MBP-2(mongos-4.4.10)[mongos] config> db.settings.find()
{
  "_id": "balancer",
  "stopped": true
}

Add option to prevent writes to config collections

The defragmentation script currently use direct writes to config.chunks to store the chunk size estimations. Since Atlas clusters do not allow direct writes to config collections (such as config.chunks) we can't really run the script there.

We should add a CLI parameter that can be used by Atlas customers to prevent writing to config collections.

Codec option is breaking the updates on `config.chunks`

Tested on version 4.0 with FCV 3.6:

2021-12-13 16:45:50,415 [WARNING] Error Chunk 
[
   {
      'OWNER_ID':'<string_ID>',
      '_id':UUID('3840ff81-7bc2-31e9-a370-926ae1c334b8')
   },
   {
      'OWNER_ID':'<string_ID>',
      '_id':UUID('434948f4-10e1-8aa1-9763-f4b105ebd380')
   }
]
wasn't updated:
{
   'nModified':0,
   'n':0,
   'opTime':{
      'ts':Timestamp(1639413950, 16),
      't':1
   },
   'electionId':ObjectId('7fffffff0000000000000001'),
   'ok':1.0,
   'operationTime':Timestamp(1639413950, 6),
   '$clusterTime':{ 
      'clusterTime':Timestamp(1639413950, 16),
      'signature':{
         'hash':b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
         'keyId':0
      }
   },
   'updatedExisting':False
}
occurred while writing the chunk size

Defragmentation script stuck on overlapping range deleter task

There is nothing that prevent the script to move a range back and forth between two shards. If this happens the second migration will get stuck waiting for the range deleter of the first migration to complete, since the default range deleter task execution delay is 15min the script will get stuck for 15 min.

Chunks without size estimation let Phase 2 exit too early

Tthe script is exiting the current processing shard when running over a large chunk or a chunk that doesn’t have defrag_collection_est_size set (accounted as 0) and the actual check dataSize returns a larger value.

Chunks that are estimated with a size of 0 or without size estimated will be in the first line of processing.

Phase 2 leaves chunk entries without `defrag_collection_est_size`

After running phase 1 we have the guarantee that all the chunk entries have the defrag_collection_est_size. This is not true for phase 2 in fact after running it there are chunks that are missing the defrag_collection_est_size, this makes impossible to run phase 3 separately.

As an example:

db.getSiblingDB('config').chunks.find({ns: 'test.coll', defrag_collection_est_size: {$exists: false}}).sort({shard: 1, min: 1}).count()
72

[defragmentation script] Handle merges failed due to LockBusy error

When a merge fails due to LockBusy error, we're currently simply logging the event. However, the script assumes that the merge succeeded and later fails when trying to move a non-existing chunk.

Example:

2021-12-13 16:45:10,389 [WARNING] Lock error occurred while trying to merge chunk range [{'_id': MinKey()}, {'_id': -9177112346776393672}].
                                        This indicates the presence of an older MongoDB version.
2021-12-13 16:45:10,390 [WARNING] Lock error occurred while trying to merge chunk range [{'_id': -3039780283654318710}, {'_id': -3033469910773206085}].
                                        This indicates the presence of an older MongoDB version.
2021-12-13 16:45:10,567 [WARNING] Lock error occurred while trying to merge chunk range [{'_id': -2941652387631697706}, {'_id': -2878518760475703397}].
                                        This indicates the presence of an older MongoDB version.
2021-12-13 16:45:16,808 [INFO] Collection size 2.4GiB. Avg chunk size Phase I 9.2MiB
2021-12-13 16:45:16,808 [INFO] Number chunks on shard defrag_hashed-rs0:      92  Data-Size:  841.4MiB
2021-12-13 16:45:16,808 [INFO] Number chunks on shard defrag_hashed-rs2:      92  Data-Size:  823.4MiB
2021-12-13 16:45:16,808 [INFO] Number chunks on shard defrag_hashed-rs1:      89  Data-Size:  835.6MiB
2021-12-13 16:45:16,808 [INFO] Phase II: Moving and merging small chunks
2021-12-13 16:45:16,808 [INFO] Number of small chunks: 8, Number of chunks with unkown size: 0
Phase II: iteration 1. Remainging chunks to process 8, total chunks 273                                                                                                              
Moving small chunks off shard defrag_hashed-rs0                                                                                                                                      
Moving small chunks off shard defrag_hashed-rs1                                                                                                                                      
Moving small chunks off shard defrag_hashed-rs2                                                                                                                                      
 50%|██████████████████████████████████████████████████████████████████████▌                                                                      | 4/8 [00:01<00:01,  2.11 chunks/s]
Traceback (most recent call last):
  File "workscripts/ctools/defragment_sharded_collection.py", line 1060, in <module>
    loop.run_until_complete(main(args))
  File "/opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/python3-v3.Aah/lib/python3.7/asyncio/base_events.py", line 568, in run_until_complete
    return future.result()
  File "workscripts/ctools/defragment_sharded_collection.py", line 906, in main
    total_moved_data_kb = await phase_2()
  File "workscripts/ctools/defragment_sharded_collection.py", line 833, in phase_2
    moved_data_kb += await move_merge_chunks_by_size(shard_id, progress)
  File "workscripts/ctools/defragment_sharded_collection.py", line 776, in move_merge_chunks_by_size
    await coll.move_chunk(c, target_shard)
  File "workscripts/ctools/defragment_sharded_collection.py", line 128, in move_chunk
    }, codec_options=self.cluster.client.codec_options)
  File "/opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/python3-v3.Aah/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/database.py", line 740, in command
    codec_options, session=session, **kwargs)
  File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/database.py", line 637, in _command
    client=self.__client)
  File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/pool.py", line 694, in command
    exhaust_allowed=exhaust_allowed)
  File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/network.py", line 161, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/helpers.py", line 160, in _check_command_response
    raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: no chunk found with the shard key bounds [{ _id: -3039780283654318710 }, { _id: -3033469910773206085 }), full error: {'ok': 0.0, 'errmsg': 'no chunk found with the shard key bounds [{ _id: -3039780283654318710 }, { _id: -3033469910773206085 })', 'operationTime': Timestamp(1639413918, 93), '$clusterTime': {'clusterTime': Timestamp(1639413918, 93), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}}

Collect phase 3 stats

For each shard, print:

  • How many splits have been performed (increment global counter in split_chunk)
  • How many chunks the shard owns
  • The average chunk size

For the collection, print:

  • Total number of chunks
  • Average chunk size

The script should have a hard cap on the number of chunks it tries to merge at a time to avoid hitting 16 MB BSON size limit

It has been encountered that in case with a large number of mostly empty chunks the script can fail due to exceeding the BSON size limit. The issue arises from the script attempting to merge too many chunks at once

pymongo.errors.OperationFailure: Failed to commit chunk merge :: caused by :: BSONObj size: 20122654 (0x1330C1E) is invalid. Size must be between 0 and 16793600(16MB) First element: 0: { ns: "config.chunks", q: { query: { ... } }, full error: {'ok': 0.0, 'errmsg': 'Failed to commit chunk merge :: caused by :: BSONObj size: 20122654 (0x1330C1E) is invalid. Size must be between 0 and 16793600(16MB) First element: 0: { ns: "config.chunks", q: { query: { ... } }', 'code': 10334, 'codeName': 'BSONObjectTooLarge', 'operationTime': Timestamp(1646363925, 102), '$clusterTime': {'clusterTime': Timestamp(1646363925, 104), 'signature': {'hash': b'N*^\x19\xa3\x99\xa8\teG\x82\x1a2\xcf\x9a\xa7\xf9=\xf5\xce', 'keyId': 7010451572802978420}}}

It's probably best to introduce a hard cap which, if exceeded, will require multiple merge iterations

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.