kaloianm / workscripts Goto Github PK
View Code? Open in Web Editor NEWMiscellaneous scripts to improve my MongoDB work productivity
License: The Unlicense
Miscellaneous scripts to improve my MongoDB work productivity
License: The Unlicense
When making direct shard connection and using Auth/TLS and providing an URI of this form for cluster connection:
mongodb://HOST:PORT/?tls=true&tlsCertificateKeyFile=/etc/ssl/mongo-key.pem&tlsCAFile=/etc/ssl/mongo-ca.pem&authSource=$external&authMechanism=MONGODB-X509
The problem is that the options are not passed when initiating a connection to each shard
The errors on the shard's primary looks like this:
2021-11-17T07:59:18.202+0000 I NETWORK [conn313260] Error receiving request from client: SSLHandshakeFailed: The server is configured to only allow SSL connections. Ending connection from 10.69.12.111:40512 (connection id: 313260)
The function is currently opening a direct connection to a shard with an URI not including username and password. While this works for testing clusters with no authentication enabled, this causes error in production clusters.
For example, this error is thrown when trying to call splitVector
through a direct connection:
Traceback (most recent call last):
--
File "defragment_sharded_collection.py", line 1078, in <module>
loop.run_until_complete(main(args))
File "/usr/lib64/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "defragment_sharded_collection.py", line 959, in main
await asyncio.gather(*tasks)
File "defragment_sharded_collection.py", line 938, in split_oversized_chunks
await coll.split_chunk(c, target_chunk_size_kb, conn)
File "defragment_sharded_collection.py", line 112, in split_chunk
}, codec_options=self.cluster.client.codec_options)
File "/usr/lib64/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib64/python3.7/site-packages/pymongo/database.py", line 734, in command
**kwargs,
File "/usr/local/lib64/python3.7/site-packages/pymongo/database.py", line 615, in _command
client=self.__client,
File "/usr/local/lib64/python3.7/site-packages/pymongo/pool.py", line 764, in command
exhaust_allowed=exhaust_allowed,
File "/usr/local/lib64/python3.7/site-packages/pymongo/network.py", line 164, in command
parse_write_concern_error=parse_write_concern_error,
File "/usr/local/lib64/python3.7/site-packages/pymongo/helpers.py", line 180, in _check_command_response
raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: command splitVector requires authentication, full error: {'operationTime': Timestamp(1658204283, 1), 'ok': 0.0, 'errmsg': 'command splitVector requires authentication', 'code': 13, 'codeName': 'Unauthorized', 'lastCommittedOpTime': Timestamp(1658204283, 1), '$clusterTime': {'clusterTime': Timestamp(1658204271, 1), 'signature': xx'_MANUALLY_OMITTED_xx'
}}
A user hardcoded username and password and everything worked:
diff common.py common.py.bak
--
131c131
< uri = 'mongodb://mongoadmin:password@' + conn_parts[1]
—
> uri = 'mongodb://' + conn_parts[1]
In my tests the script is unable to work correctly if the shard key is using UUID:
2022-03-10 13:24:09,213 [INFO] Bounds: [{'location': 'US', 'xdr_device_id': UUID('00000100-95b4-4058-ba9e-c32e0fab201f')}, {'location': 'US', 'xdr_device_id': UUID('52df6ff6-cbea-4425-83f9-942e055022a4')}]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61739/61739 [00:00<00:00, 182577.57 chunk/s]
Traceback (most recent call last):
File "defragment_sharded_collection.py", line 1089, in <module>
loop.run_until_complete(main(args))
File "/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
return future.result()
File "defragment_sharded_collection.py", line 621, in main
await asyncio.gather(*tasks)
File "defragment_sharded_collection.py", line 569, in merge_chunks_on_shard
await coll.merge_chunks(consecutive_chunks.batch)
File "defragment_sharded_collection.py", line 137, in merge_chunks
}, codec_options=self.cluster.client.codec_options)
File "/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.6/site-packages/pymongo/database.py", line 761, in command
codec_options, session=session, **kwargs)
File "/usr/local/lib/python3.6/site-packages/pymongo/database.py", line 652, in _command
client=self.__client)
File "/usr/local/lib/python3.6/site-packages/pymongo/pool.py", line 721, in command
exhaust_allowed=exhaust_allowed)
File "/usr/local/lib/python3.6/site-packages/pymongo/network.py", line 163, in command
parse_write_concern_error=parse_write_concern_error)
File "/usr/local/lib/python3.6/site-packages/pymongo/helpers.py", line 167, in _check_command_response
raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: Failed to commit chunk merge :: caused by :: could not merge chunks, shard atlas-11qc0x-shard-0 does not contain a sequence of chunks that exactly fills the range [{ location: "US", xdr_device_id: BinData(3, 0000010095B44058BA9EC32E0FAB201F) }, { location: "US", xdr_device_id: BinData(3, 52DF6FF6CBEA442583F9942E055022A4) }), full error: {'ok': 0.0, 'errmsg': 'Failed to commit chunk merge :: caused by :: could not merge chunks, shard atlas-11qc0x-shard-0 does not contain a sequence of chunks that exactly fills the range [{ location: "US", xdr_device_id: BinData(3, 0000010095B44058BA9EC32E0FAB201F) }, { location: "US", xdr_device_id: BinData(3, 52DF6FF6CBEA442583F9942E055022A4) })', 'code': 20, 'codeName': 'IllegalOperation', 'operationTime': Timestamp(1646879043, 1), '$clusterTime': {'clusterTime': Timestamp(1646879048, 2), 'signature': {'hash': b'\xf3s#Y\x07\xc73E\x83\xe0\x19\xb4\xc5\xa2\xd1\xfdvK\x10=', 'keyId': 7072997257389277206}}}
In this particular case the range in question is owned by a shard-1, but the merge command complains about shard-0. In my research everything points to UUID encoding
Currently phase 2 of the defragmentation script doesn't take into account zones (tags) when moving chunks, so until we properly support zones we should abort phase 2
During phase 1 we aggregate chunks to merge in a batch. When the estimated batch size reached 90 % of the target chunk size we calculate the real size of the range.
The estimated batch size is calculated by mean of the average chunk size of a collection in a specific shard. So in case we underestimate the size of batch we will check the real data size too late and by that time the real batch size could be bigger than the configured maximum chunk size.
The goal of this ticket is to introduce a new CLI parameter --phase_1_calc_size_threshold
(0 <= value <= 1
) default 0.9
to allow the user decide how frequently we should check the real size of the batch
Phase 3 calls repeatedly split_chunk function that on every execution queries config.shards
to fetch the shard's host.
Instead we could cache this information and re-use it on all subsequent executions.
Starting from mongodb version 5.0 chunks entry in config.chunks
contains only the collection UUID rather than the collection namespace. So until we add support for it we should add a safety check to prevent running the script in those versions.
On a 4.4.10
cluster:
╰─ ./defragment_sharded_collection.py "mongodb://user:password@localhost:27017" --ns "POCDB.POCCOLL" --phases phase1 ─╯
2021-11-05 10:53:45,047 [INFO] Starting with parameters: 'mongodb://user:password@localhost:27017 --ns POCDB.POCCOLL --phases phase1'
Server is running at FCV 4.4
Traceback (most recent call last):
File "./defragment_sharded_collection.py", line 1069, in <module>
loop.run_until_complete(main(args))
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "./defragment_sharded_collection.py", line 222, in main
if not args.dryrun and (balancer_doc is None or balancer_doc['mode'] != 'off'):
KeyError: 'mode'
#----------------
Renatos-MBP-2(mongos-4.4.10)[mongos] config> db.settings.find()
{
"_id": "balancer",
"stopped": true
}
The defragmentation script currently use direct writes to config.chunks
to store the chunk size estimations. Since Atlas clusters do not allow direct writes to config collections (such as config.chunks
) we can't really run the script there.
We should add a CLI parameter that can be used by Atlas customers to prevent writing to config collections.
Tested on version 4.0
with FCV 3.6
:
2021-12-13 16:45:50,415 [WARNING] Error Chunk
[
{
'OWNER_ID':'<string_ID>',
'_id':UUID('3840ff81-7bc2-31e9-a370-926ae1c334b8')
},
{
'OWNER_ID':'<string_ID>',
'_id':UUID('434948f4-10e1-8aa1-9763-f4b105ebd380')
}
]
wasn't updated:
{
'nModified':0,
'n':0,
'opTime':{
'ts':Timestamp(1639413950, 16),
't':1
},
'electionId':ObjectId('7fffffff0000000000000001'),
'ok':1.0,
'operationTime':Timestamp(1639413950, 6),
'$clusterTime':{
'clusterTime':Timestamp(1639413950, 16),
'signature':{
'hash':b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
'keyId':0
}
},
'updatedExisting':False
}
occurred while writing the chunk size
There is nothing that prevent the script to move a range back and forth between two shards. If this happens the second migration will get stuck waiting for the range deleter of the first migration to complete, since the default range deleter task execution delay is 15min the script will get stuck for 15 min.
Currently the de-fragmentation script doesn't work for collections with nested shardkey because we currently use frozenset(shardkey.items())
to retrieve the hash of shardkey. If the shardkey is nested (not flat dictionary) this doesn't work.
One possible solution would be to use the pickle python library representation to hash the shard keys.
Kal,
Is there any good reason for not allowing phase 1 to be executed on a zoned collection? If there is none, then it should be allowed.
Dmitry
Tthe script is exiting the current processing shard when running over a large chunk or a chunk that doesn’t have defrag_collection_est_size
set (accounted as 0) and the actual check dataSize returns a larger value.
Chunks that are estimated with a size of 0 or without size estimated will be in the first line of processing.
After running phase 1 we have the guarantee that all the chunk entries have the defrag_collection_est_size
. This is not true for phase 2 in fact after running it there are chunks that are missing the defrag_collection_est_size
, this makes impossible to run phase 3 separately.
As an example:
db.getSiblingDB('config').chunks.find({ns: 'test.coll', defrag_collection_est_size: {$exists: false}}).sort({shard: 1, min: 1}).count()
72
Improve phase 3 performances by sending one splitChunk command instead of invoking single splits
In phase 2 we should process shards in descending order starting from the one with more chunks (probably also the one with more small chunks).
When a merge fails due to LockBusy
error, we're currently simply logging the event. However, the script assumes that the merge succeeded and later fails when trying to move a non-existing chunk.
Example:
2021-12-13 16:45:10,389 [WARNING] Lock error occurred while trying to merge chunk range [{'_id': MinKey()}, {'_id': -9177112346776393672}].
This indicates the presence of an older MongoDB version.
2021-12-13 16:45:10,390 [WARNING] Lock error occurred while trying to merge chunk range [{'_id': -3039780283654318710}, {'_id': -3033469910773206085}].
This indicates the presence of an older MongoDB version.
2021-12-13 16:45:10,567 [WARNING] Lock error occurred while trying to merge chunk range [{'_id': -2941652387631697706}, {'_id': -2878518760475703397}].
This indicates the presence of an older MongoDB version.
2021-12-13 16:45:16,808 [INFO] Collection size 2.4GiB. Avg chunk size Phase I 9.2MiB
2021-12-13 16:45:16,808 [INFO] Number chunks on shard defrag_hashed-rs0: 92 Data-Size: 841.4MiB
2021-12-13 16:45:16,808 [INFO] Number chunks on shard defrag_hashed-rs2: 92 Data-Size: 823.4MiB
2021-12-13 16:45:16,808 [INFO] Number chunks on shard defrag_hashed-rs1: 89 Data-Size: 835.6MiB
2021-12-13 16:45:16,808 [INFO] Phase II: Moving and merging small chunks
2021-12-13 16:45:16,808 [INFO] Number of small chunks: 8, Number of chunks with unkown size: 0
Phase II: iteration 1. Remainging chunks to process 8, total chunks 273
Moving small chunks off shard defrag_hashed-rs0
Moving small chunks off shard defrag_hashed-rs1
Moving small chunks off shard defrag_hashed-rs2
50%|██████████████████████████████████████████████████████████████████████▌ | 4/8 [00:01<00:01, 2.11 chunks/s]
Traceback (most recent call last):
File "workscripts/ctools/defragment_sharded_collection.py", line 1060, in <module>
loop.run_until_complete(main(args))
File "/opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/python3-v3.Aah/lib/python3.7/asyncio/base_events.py", line 568, in run_until_complete
return future.result()
File "workscripts/ctools/defragment_sharded_collection.py", line 906, in main
total_moved_data_kb = await phase_2()
File "workscripts/ctools/defragment_sharded_collection.py", line 833, in phase_2
moved_data_kb += await move_merge_chunks_by_size(shard_id, progress)
File "workscripts/ctools/defragment_sharded_collection.py", line 776, in move_merge_chunks_by_size
await coll.move_chunk(c, target_shard)
File "workscripts/ctools/defragment_sharded_collection.py", line 128, in move_chunk
}, codec_options=self.cluster.client.codec_options)
File "/opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/python3-v3.Aah/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/database.py", line 740, in command
codec_options, session=session, **kwargs)
File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/database.py", line 637, in _command
client=self.__client)
File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/pool.py", line 694, in command
exhaust_allowed=exhaust_allowed)
File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/network.py", line 161, in command
parse_write_concern_error=parse_write_concern_error)
File "/home/ubuntu/test-defrag/.defrag-venv/lib/python3.7/site-packages/pymongo/helpers.py", line 160, in _check_command_response
raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: no chunk found with the shard key bounds [{ _id: -3039780283654318710 }, { _id: -3033469910773206085 }), full error: {'ok': 0.0, 'errmsg': 'no chunk found with the shard key bounds [{ _id: -3039780283654318710 }, { _id: -3033469910773206085 })', 'operationTime': Timestamp(1639413918, 93), '$clusterTime': {'clusterTime': Timestamp(1639413918, 93), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}}
For each shard, print:
split_chunk
)For the collection, print:
It has been encountered that in case with a large number of mostly empty chunks the script can fail due to exceeding the BSON size limit. The issue arises from the script attempting to merge too many chunks at once
pymongo.errors.OperationFailure: Failed to commit chunk merge :: caused by :: BSONObj size: 20122654 (0x1330C1E) is invalid. Size must be between 0 and 16793600(16MB) First element: 0: { ns: "config.chunks", q: { query: { ... } }, full error: {'ok': 0.0, 'errmsg': 'Failed to commit chunk merge :: caused by :: BSONObj size: 20122654 (0x1330C1E) is invalid. Size must be between 0 and 16793600(16MB) First element: 0: { ns: "config.chunks", q: { query: { ... } }', 'code': 10334, 'codeName': 'BSONObjectTooLarge', 'operationTime': Timestamp(1646363925, 102), '$clusterTime': {'clusterTime': Timestamp(1646363925, 104), 'signature': {'hash': b'N*^\x19\xa3\x99\xa8\teG\x82\x1a2\xcf\x9a\xa7\xf9=\xf5\xce', 'keyId': 7010451572802978420}}}
It's probably best to introduce a hard cap which, if exceeded, will require multiple merge iterations
Add command line arguments to introduce delays between merges during phase 1 and between splits during phase 3
A chunk could be moved to its right neighbour when it is larger in size, but residing in an "overloaded" shard (compared to the destination one).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.