MongoDB GridFS <-> Amazon S3 copy script
This Perl script can be used to copy files stored in MongoDB GridFS to Amazon S3 and vice versa.
Usage
- Copy configuration template file
gridfs-to-s3.yml.dist
togridfs-to-s3.yml
. - Edit configuration file
gridfs-to-s3.yml
to set the MongoDB and GridFS connection settings and other properties. - To copy files from GridFS to S3, run:
./copy-gridfs-to-s3.pl gridfs-to-s3.yml
- To copy files from S3 to GridFS, run:
./copy-s3-to-gridfs.pl gridfs-to-s3.yml
Requirements
Implementation details
Sorting in GridFS
When copying from GridFS to S3, the script will sort files roughly by insertion date. The script uses object's ObjectId
for sorting. As the default ObjectId
contains an insertion timestamp (rounded to seconds), the objects are sorted in 1 second precision.
For example, if your GridFS deployment contains the following files:
> db.fs.files.find( { }, { filename: 1, uploadDate: 1 } )
{ "uploadDate" : ISODate("2013-06-11T08:00:00Z"), "filename" : "1" }
{ "uploadDate" : ISODate("2013-06-11T08:00:00Z"), "filename" : "2" }
{ "uploadDate" : ISODate("2013-06-11T08:00:00Z"), "filename" : "3" }
{ "uploadDate" : ISODate("2013-06-11T08:00:00Z"), "filename" : "4" }
{ "uploadDate" : ISODate("2013-06-11T08:00:00Z"), "filename" : "5" }
{ "uploadDate" : ISODate("2013-06-11T08:00:00Z"), "filename" : "6" }
{ "uploadDate" : ISODate("2013-06-11T08:00:00Z"), "filename" : "7" }
{ "uploadDate" : ISODate("2013-06-11T08:00:00Z"), "filename" : "8" }
{ "uploadDate" : ISODate("2013-06-11T08:00:01Z"), "filename" : "9" }
{ "uploadDate" : ISODate("2013-06-11T08:00:01Z"), "filename" : "10" }
{ "uploadDate" : ISODate("2013-06-11T08:00:01Z"), "filename" : "11" }
{ "uploadDate" : ISODate("2013-06-11T08:00:01Z"), "filename" : "12" }
{ "uploadDate" : ISODate("2013-06-11T08:00:01Z"), "filename" : "13" }
{ "uploadDate" : ISODate("2013-06-11T08:00:01Z"), "filename" : "14" }
{ "uploadDate" : ISODate("2013-06-11T08:00:01Z"), "filename" : "15" }
{ "uploadDate" : ISODate("2013-06-11T08:00:01Z"), "filename" : "16" }
...the script will copy files 1-8
in any order (because they were inserted during the same second), but will upload files 1-8
before uploading files 9-16
(because the latter were inserted in the next second).
Sorting in S3
When copying from S3 to GridFS, the script will sort files in lexicographical order.
For example, if your S3 bucket contains the following files:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...these files will be copied from S3 to GridFS in the following order:
1
10
11
12
13
14
15
16
2
3
4
5
6
7
8
9
This is because lexicographical order is the only one Amazon supports for sorting S3 contents (at the time of writing).
Unit tests
Execute:
GRIDFS_TO_S3_CONFIG=gridfs-to-s3.yml prove
to run unit tests provided with the script.
The test t/copy_gridfs_to_s3_to_gridfs.t
will create and afterwards remove:
- two temporary Mongo GridFS databases with the names such as
gridfs-to-s3_testing_source_33jdgSNyNWLykIXi
andgridfs-to-s3_testing_destination_uBnqHweeexioHcXn
- one temporary Amazon S3 bucket with the name such as
gridfs-to-s3.testing.PGW22CuIuGbMQKMpTFqZsgqa5Th66yXB
. - a lock file, "last filename copied to S3" file and "last filename copied to GridFS" file, all appended with a randomly generated string.
Known bugs
recv timed out
(probably fixed) After copying a bunch of files from GridFS (MongoDB v2.2.3) to S3, one of the workers crashes with:
2013-07-03 10:51:21,402 [1105]: Backed up file '272460570'
2013-07-03 10:51:21,402 [1326]: Copying '272462172'...
2013-07-03 10:51:21,403 [1105]: Backed up file '272459804'
2013-07-03 10:51:21,403 [1325]: Copying '272462482'...
2013-07-03 10:51:21,693 [1105]: Backed up file '272461435'
2013-07-03 10:51:21,693 [1316]: Copying '272462812'...
2013-07-03 10:51:21,693 [1105]: Backed up file '272459683'
2013-07-03 10:51:21,693 [1338]: Copying '272462804'...
2013-07-03 10:51:40,279 [1105]: Job error: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,279 [1331]: Copying '272459981'...
2013-07-03 10:51:40,293 [1334]: Copying '272462362'...
Job error: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,296 [1315]: Copying '272462821'...
2013-07-03 10:51:40,300 [1310]: Copying '272461460'...
2013-07-03 10:51:40,307 [1303]: Attempt to read from '272462430' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,307 [1330]: Attempt to read from '272461745' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,307 [1303]: Retrying (1)...
2013-07-03 10:51:40,307 [1330]: Retrying (1)...
2013-07-03 10:51:40,312 [1318]: Copying '272461481'...
2013-07-03 10:51:40,316 [1302]: Copying '272462467'...
2013-07-03 10:51:40,323 [1335]: Copying '272462325'...
2013-07-03 10:51:40,323 [1313]: Attempt to read from '272461173' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,323 [1313]: Retrying (1)...
2013-07-03 10:51:40,326 [1309]: Copying '272460363'...
2013-07-03 10:51:40,328 [1314]: Attempt to read from '272462662' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,328 [1314]: Retrying (1)...
2013-07-03 10:51:40,330 [1321]: Copying '272462510'...
2013-07-03 10:51:40,332 [1306]: Attempt to read from '272461681' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,332 [1306]: Retrying (1)...
2013-07-03 10:51:40,332 [1332]: Copying '272459389'...
2013-07-03 10:51:40,334 [1311]: Copying '272461918'...
2013-07-03 10:51:40,336 [1304]: Copying '272462877'...
2013-07-03 10:51:40,337 [1329]: Copying '272460180'...
2013-07-03 10:51:40,343 [1327]: Attempt to read from '272462757' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,343 [1327]: Retrying (1)...
2013-07-03 10:51:40,343 [1320]: Copying '272461466'...
2013-07-03 10:51:40,346 [1305]: Attempt to read from '272459417' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,346 [1305]: Retrying (1)...
2013-07-03 10:51:40,347 [1307]: Copying '272462700'...
2013-07-03 10:51:40,348 [1323]: Attempt to read from '272462724' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,348 [1323]: Retrying (1)...
2013-07-03 10:51:40,349 [1319]: Copying '272461418'...
2013-07-03 10:51:40,357 [1333]: Attempt to read from '272462437' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,357 [1333]: Retrying (1)...
2013-07-03 10:51:40,389 [1337]: Attempt to read from '272462655' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,389 [1337]: Retrying (1)...
2013-07-03 10:51:40,422 [1336]: Attempt to read from '272460972' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,423 [1336]: Retrying (1)...
2013-07-03 10:51:40,427 [1326]: Attempt to read from '272462172' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,427 [1326]: Retrying (1)...
2013-07-03 10:51:40,428 [1322]: Attempt to read from '272459402' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,429 [1322]: Retrying (1)...
2013-07-03 10:51:40,438 [1325]: Attempt to read from '272462482' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,438 [1325]: Retrying (1)...
boss read from app: sysread: Connection reset by peer at <...>/GridFSToS3.pm line 208
at <...>/GridFSToS3.pm line 208
at <...>/GridFSToS3.pm line 208
2013-07-03 10:51:40,691 [1338]: Attempt to read from '272462804' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,691 [1338]: Retrying (1)...
2013-07-03 10:51:40,697 [1316]: Attempt to read from '272462812' didn't succeed because: recv timed out (19000 ms) at local/lib/perl5/x86_64-linux-thread-multi/MongoDB/Cursor.pm line 161.
2013-07-03 10:51:40,697 [1316]: Retrying (1)...
Removing the lock file and restarting the copy script works fine in this case.
MongoDB::Cursor
crashes with recv timed out
, then skips a bunch of files
That's why we don't use MongoDB::Cursor
in Storage/Iterator/GridFS.pm
.