This tool looks fantastic! Thanks <a class="user-mention notranslate" data-hovercard-t

Raised <a class="issue-link js-issue-link" data-error-text="Failed to load title" data

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Allow for separating data parsing operations from backup about couchdb-dump HOT 12 OPEN

danielebailo commented on July 4, 2024

Allow for separating data parsing operations from backup

from couchdb-dump.

Comments (12)

dalgibbard commented on July 4, 2024

Simple answer, shouldn't be a problem.

I've used it on approx 30GB datasets without issue; there's no real limitations that I'm aware of.

Most filesystem's support this size of file without issue. I'd recommend doing the backup on the local machine rather than across the network if possible.

If it does happen to stop part way through backup then it may be worth us looking into possibilities around limiting the _all_docs API endpoint with start and end points to do the backup in batches? That way we could restart a segment, rather than needing to restart the entire batch export (ie. Splitting the backup into pieces for transfer). Not sure if that's even possible off the top of my head; but I digress :)

What's the total document count, and delete documents count? Do you use attachments in your DB?

Please do let us know how you get on (and how long it takes out of interest!)

from couchdb-dump.

dalgibbard commented on July 4, 2024

PS. Be aware that the raw export file requires processing after extract- this means if your exported DB backup file is 200GB on disk, you'll need 400GB in total for processing overheads. It's probably going to take a while too!

from couchdb-dump.

dalgibbard commented on July 4, 2024

PPS. Export is CPU and disk IO hungry; try and run it on an unused/unloaded node for best results.

from couchdb-dump.

hany commented on July 4, 2024

Thanks a lot for your replies. I'm going to investigate using this on said database and report back.

I think being able to chunk up the backup would be very helpful, especially considering the high CPU and IO impact. Chunking it up will also allow us to introduce a short sleep between each chunk, which can lessen the impact. I do plan on running this locally however, so the risks of backups being interrupted will be minimized.

Out of curiosity, is it possible to add compression somewhere? Being such a large database, using gzip would help greatly. Of course I can always gzip it after the fact, but it will add significant time to the backup, vs. being able to compress during the backup.

Any thoughts on this?

from couchdb-dump.

dalgibbard commented on July 4, 2024

Raised #32 and #33 to consider the possibilities on these two points :)

from couchdb-dump.

danielebailo commented on July 4, 2024

@hany I'm happy you find this tool useful. But we must acknowledge also @dalgibbard who's actually working a lot on this tool!

Issues #32 and #33 are great: the backup size is really an issue when dealing with huge DBs, so compression and CPU / IO load should be handled somehow.

@dalgibbard thinking in perspective, in the future we might create services over this script: e.g. a GUI (maybe a couch app?) or any other software using it (a RESTful WebService?), so it could be convenient not to code everything into one script only, but to create single scripts/segments doing unique jobs that can be piped together.
E.g.

script to download docs (predefined chunk size, user can set custom chunk size)
script for compression (works on single chunks)
script to parallalize (launches N scripts1(to download), manages latencies, delays, CPU usage, IO in an adaptive way)

and then launching Script3 which pipes script1 | script2 N times...

..ok just a brainstorming about possible evolution of the script...

Any thoughts?

from couchdb-dump.

hany commented on July 4, 2024

@danielebailo I noticed the credit to @dalgibbard in the script, so thank you as well!

Unfortunately, it appears that running this backup tool took way too long against our large data set, and the increased CPU and IO caused some performance issues with the server. The slowest part appears to be the sed operations that occur after the initial dump. The multiple passes against the large .json files seems to be adding a lot of overhead.

Is it necessary to run all those additional operations during the backup cycle after the initial dump? Since you have the source .json files, could you not save those operations for the restoration process instead? Backups need to occur on a regular basis, but restorations are less frequent. Granted saving those operations for restoration will certainly increase the time it takes to restore, which may be during a critical period. For large data sets however, having timely backups may be more important than restoration times (at least the data was backed up).

Just my $0.02.

from couchdb-dump.

dalgibbard commented on July 4, 2024

The sed statments drastically reduce the final output file on disk, as well as making it actually importable (during restore it makes sense to not mangle the input, in case someone is trying to input non-genuine backup data; else it may have an undesirable effect for example) - note that the number of threads used during the sed stage is configurable.

With regards to modularising the script; absolutely yes. It's a bit monolithic in the way that it's grown; but it does it's job :-)

I'd have concerns about how reasonable it would be to continue using bash if we get to the stage of rewriting it though; alternatives have much nicer means to edit/compress/sort the data on the fly etc, and probably with better code cleanliness too.

Being honest though- I don't see myself perusing those options much. The current code does the job within the know limitations; and I'd much rather that the CouchDB devs provide/manage a backup functionality internally... not that I see that happening though.

from couchdb-dump.

hany commented on July 4, 2024

@dalgibbard for sure, and I don't disagree with you. My comments regarding the sed operations was not to say that it shouldn't be run, but rather be split off so it can be run at another time. Even with threading, the sed operations take about 10x longer than the actual document dump, not to mention driving up load considerably due to the extra CPU cycles.

With such a big, busy DB, our options are quite limited, our focus has been on just getting a timely backup done. I love the way this script works, but the heavy operations is making it unusable.

Some ideas:

bash couchdb-backup.sh -b -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

Performs a base backup (essentially just running the curl command).

bash couchdb-backup.sh -p -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

-p is for "prepare", where it runs stages 1, 2, 3, and 4.

bash couchdb-backup.sh -r -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

Standard restore, however it requires the "prep" stage to run first.

At this point, we've resorted to using plain old tar with gzip compression. We've had to lower the compression level in order to allow the backup to finish in a reasonable time.

Just food for thought.

from couchdb-dump.

dalgibbard commented on July 4, 2024

It's definitely do-able, but what I'd suggest we do is:
-b does the backup and parsing as usual
-b -S does the raw backup and skips parsing
-P does the standalone parsing only.

The main issue with the compression stuff is that we push it to disk from curl before running any other jobs on it. I wonder what speed would be like if we did just pipe that out to the sed's followed by compression on the end? One assumes that the main bottleneck is the output from CouchDB - this may keep up at the usual dump rate without the need for additional processing (and disk IO) afterwards. Hmm.

from couchdb-dump.

dalgibbard commented on July 4, 2024

Note; not -P as that's already used to pass non-standard port numbers :) but you get the jist

from couchdb-dump.

dalgibbard commented on July 4, 2024

Amended the title to more accurately represent the issue now at hand.

Task:

Add flags around the data parsing during import to enable it to be managed as a standalone operation.

from couchdb-dump.

Allow for separating data parsing operations from backup about couchdb-dump HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent