Comments (12)
Simple answer, shouldn't be a problem.
I've used it on approx 30GB datasets without issue; there's no real limitations that I'm aware of.
Most filesystem's support this size of file without issue. I'd recommend doing the backup on the local machine rather than across the network if possible.
If it does happen to stop part way through backup then it may be worth us looking into possibilities around limiting the _all_docs API endpoint with start and end points to do the backup in batches? That way we could restart a segment, rather than needing to restart the entire batch export (ie. Splitting the backup into pieces for transfer). Not sure if that's even possible off the top of my head; but I digress :)
What's the total document count, and delete documents count? Do you use attachments in your DB?
Please do let us know how you get on (and how long it takes out of interest!)
from couchdb-dump.
PS. Be aware that the raw export file requires processing after extract- this means if your exported DB backup file is 200GB on disk, you'll need 400GB in total for processing overheads. It's probably going to take a while too!
from couchdb-dump.
PPS. Export is CPU and disk IO hungry; try and run it on an unused/unloaded node for best results.
from couchdb-dump.
Thanks a lot for your replies. I'm going to investigate using this on said database and report back.
I think being able to chunk up the backup would be very helpful, especially considering the high CPU and IO impact. Chunking it up will also allow us to introduce a short sleep between each chunk, which can lessen the impact. I do plan on running this locally however, so the risks of backups being interrupted will be minimized.
Out of curiosity, is it possible to add compression somewhere? Being such a large database, using gzip would help greatly. Of course I can always gzip it after the fact, but it will add significant time to the backup, vs. being able to compress during the backup.
Any thoughts on this?
from couchdb-dump.
Raised #32 and #33 to consider the possibilities on these two points :)
from couchdb-dump.
@hany I'm happy you find this tool useful. But we must acknowledge also @dalgibbard who's actually working a lot on this tool!
Issues #32 and #33 are great: the backup size is really an issue when dealing with huge DBs, so compression and CPU / IO load should be handled somehow.
@dalgibbard thinking in perspective, in the future we might create services over this script: e.g. a GUI (maybe a couch app?) or any other software using it (a RESTful WebService?), so it could be convenient not to code everything into one script only, but to create single scripts/segments doing unique jobs that can be piped together.
E.g.
- script to download docs (predefined chunk size, user can set custom chunk size)
- script for compression (works on single chunks)
- script to parallalize (launches N scripts1(to download), manages latencies, delays, CPU usage, IO in an adaptive way)
and then launching Script3 which pipes script1 | script2 N times...
..ok just a brainstorming about possible evolution of the script...
Any thoughts?
from couchdb-dump.
@danielebailo I noticed the credit to @dalgibbard in the script, so thank you as well!
Unfortunately, it appears that running this backup tool took way too long against our large data set, and the increased CPU and IO caused some performance issues with the server. The slowest part appears to be the sed
operations that occur after the initial dump. The multiple passes against the large .json
files seems to be adding a lot of overhead.
Is it necessary to run all those additional operations during the backup cycle after the initial dump? Since you have the source .json
files, could you not save those operations for the restoration process instead? Backups need to occur on a regular basis, but restorations are less frequent. Granted saving those operations for restoration will certainly increase the time it takes to restore, which may be during a critical period. For large data sets however, having timely backups may be more important than restoration times (at least the data was backed up).
Just my $0.02.
from couchdb-dump.
The sed statments drastically reduce the final output file on disk, as well as making it actually importable (during restore it makes sense to not mangle the input, in case someone is trying to input non-genuine backup data; else it may have an undesirable effect for example) - note that the number of threads used during the sed stage is configurable.
With regards to modularising the script; absolutely yes. It's a bit monolithic in the way that it's grown; but it does it's job :-)
I'd have concerns about how reasonable it would be to continue using bash if we get to the stage of rewriting it though; alternatives have much nicer means to edit/compress/sort the data on the fly etc, and probably with better code cleanliness too.
Being honest though- I don't see myself perusing those options much. The current code does the job within the know limitations; and I'd much rather that the CouchDB devs provide/manage a backup functionality internally... not that I see that happening though.
from couchdb-dump.
@dalgibbard for sure, and I don't disagree with you. My comments regarding the sed
operations was not to say that it shouldn't be run, but rather be split off so it can be run at another time. Even with threading, the sed
operations take about 10x longer than the actual document dump, not to mention driving up load considerably due to the extra CPU cycles.
With such a big, busy DB, our options are quite limited, our focus has been on just getting a timely backup done. I love the way this script works, but the heavy operations is making it unusable.
Some ideas:
bash couchdb-backup.sh -b -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password
Performs a base backup (essentially just running the curl
command).
bash couchdb-backup.sh -p -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password
-p
is for "prepare", where it runs stages 1, 2, 3, and 4.
bash couchdb-backup.sh -r -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password
Standard restore, however it requires the "p
rep" stage to run first.
At this point, we've resorted to using plain old tar
with gzip
compression. We've had to lower the compression level in order to allow the backup to finish in a reasonable time.
Just food for thought.
from couchdb-dump.
It's definitely do-able, but what I'd suggest we do is:
-b
does the backup and parsing as usual
-b -S
does the raw backup and skips parsing
-P
does the standalone parsing only.
The main issue with the compression stuff is that we push it to disk from curl before running any other jobs on it. I wonder what speed would be like if we did just pipe that out to the sed's followed by compression on the end? One assumes that the main bottleneck is the output from CouchDB - this may keep up at the usual dump rate without the need for additional processing (and disk IO) afterwards. Hmm.
from couchdb-dump.
Note; not -P
as that's already used to pass non-standard port numbers :) but you get the jist
from couchdb-dump.
Amended the title to more accurately represent the issue now at hand.
Task:
- Add flags around the data parsing during import to enable it to be managed as a standalone operation.
from couchdb-dump.
Related Issues (20)
- Backup / restore all databases? HOT 4
- Stuck at 'Stage 1 - Document filtering' HOT 6
- Taking backup for all revisions HOT 1
- Is it possible to set the password with a environment variable? HOT 4
- ERROR: Curl encountered an issue whilst dumping the database HOT 1
- Backup exits with exitcode 1 in silent mode HOT 1
- can dump all revisions docs HOT 4
- Error when restoring: POST body must include `docs` parameter. HOT 4
- Document update conflict upon restore HOT 3
- couchdb-dump doesn't work with busybox's grep anymore HOT 1
- Backup all databases HOT 9
- Invalid UTF-8 JSON error while restoring large file HOT 5
- Restore backup into non-empty database HOT 1
- Backup does not work with pouchdb-server
- Question: How does the restore handle duplicate data? HOT 1
- compilation_error when attempting to restore database
- CouchDB Reported error, need some advice on how to debug HOT 1
- How can exclude some databases from backup? HOT 1
- Nice to have a 'drop database' option
- Intermittent backup issue HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from couchdb-dump.