Code Monkey home page Code Monkey logo

hbackup's Introduction

HBackup: a backup utility for large datasets in HDFS and S3

Features

HBackup transfers large files between HDFS and S3 and keeps them up to date. Its main features:

  • Can use S3 multipart upload, which:
  • Transfers multiple chunks of the same file in parallel
  • Bypasses the 5GB file limit to allow very large files
  • Overwrites files in the destination if their size or mtime is different from the source file
  • Uses a custom checksum to verify the integrity of backups in S3 (normal S3 checksums don't work for multipart uploads)
  • Provides a health check which can verify that all files in the destination are within some delta of the corresponding source file

Why would I choose this tool instead of Hadoop distcp?

  • Distcp requires a mapreduce cluster, and the Hadoop S3FileSystem doesn't support uploading files over 5GB since it doesn't use the multipart upload API. Depending on your use case, these factors might be important, but distcp does fine in many cases.

Usage

There are three "main" functions in HBackup:

  • The main backup tool, run com.urbanairship.hbackup.HBackup. This will do a backup according to the configuration. It will only backup files with an age greater than or equal to hbackup.mtimeAgeMillis.

  • The S3 in-place checksum verification tool, run com.urbanairship.hbackup.checksumverify.ChecksumVerify. This will compute checksums of files in S3 and compare the checksums against the their expected values, which are stored separately in S3.

  • The backup staleness/health check, run com.urbanairship.hbackup.StalenessCheck. This tool will compare the source and destination files and exit with a nonzero error code if any destination file is older than the corresponding source file by more than a given amount of time. This could be used in a nagios health check.

Configuration

Configuration can be passed in two different ways. The first way is to pass the name of a properties file as a command line argument (e.g. java com.urbanairship.hbackup.HBackup myconfig.props). The second way is to set JVM system properties (e.g. java -Dhbackup.from=hdfs://etc -Dhbackup.to=s3//etc com.urbanairship.hbackup.HBackup). Configuration values in JVM properties override the ones in properties files.

Important configuration values

hbackup.from: The source of backup data, either an HDFS URI or S3 URI of the form "hdfs://namenode:port/dir" or "s3://bucket/prefix".

hbackup.to: The destination for the backup data, either an HDFS URI or S3 URI.

hbackup.s3AccessKey and hbackup.s3Secret: The security credentials for accessing S3. If you need to use different credentials for the source, destination, or checksum storage, there are other options you can use (see below).

All configuration values

All options can be seen by running com.urbanairship.hbackup.HBackup with --usage as a command line argument:

Usage: CLASSPATH=... java -Dprop=val -Dprop=val com.urbanairship.hbackup.HBackup [resource] [resource]
The "resource"s are filenames or URLs pointing to properties files which may set config values.
You can set config values in the resource files or by setting JVM system properties with -Dprop=val.

The available config values are:
 hbackup.from                    URI of data source, e.g. hdfs:///home/bob, hdfs://reports-master-0:7050/home/bob, s3://mybucket/a/b
 hbackup.to                      URI of data sink
 hbackup.concurrentChunks        Number of file chunks to transfer at a time (default 5)
 hbackup.recursive               Recursively back up the entire source directory tree (default true)
 hbackup.from.s3AccessKey        When the source is an S3 bucket, use this to set its access key
 hbackup.from.s3Secret           When the source is an S3 bucket, use this to set its secret
 hbackup.to.s3AccessKey          When the destination is an S3 bucket, use this to set its access key
 hbackup.to.s3Secret             When the destination is an S3 bucket, use this to set its secret
 hbackup.s3.partSize             When writing to S3 using the multipart API, what size of parts should the file be split into? (default 104857600)
 hbackup.s3.multipartThreshold   When writing to S3, use the multipart API for files larger than this (default 104857600)
 hbackup.mtimecheck              If true, re-transfer files when the source and sink mtime or length differs. If false, ignore the mtime and only check the length. (default true)
 hbackup.includePathsRegex       If set, only files matching this regex will be sent. Filenames are relative to the backup directory.
 hbackup.checksumUri             Where file checksums should be stored
 hbackup.checksum.s3AccessKey    If the checksums are stored in a protected S3 bucket, specify the access key
 hbackup.checksum.s3Secret       If the checksums are stored in a protected S3 bucket, specify the secret
 hbackup.s3AccessKey             Use this for all S3 accesses, if all your S3 usage is done under the same account
 hbackup.s3Secret                Use this for all S3 accesses, if all your S3 usage is done under the same account
 hbackup.staleMillis             When checking backed-up files for staleness, a file this much older than the source is "stale"
 hbackup.mtimeAgeMillis          When checking source up files for inclusion, a file this much older than the current systemTime will be backed up.

When specifying HDFS URIs, you can leave the host part blank
(hdfs://dir/file.txt instead of hdfs://host:port/dir/file.txt) if the
classpath contains a Hadoop configuration pointing to a default filesystem.

Examples:
  CLASSPATH=hbackup-0.9-jar-with-dependencies.jar java -Dhbackup.from=hdfs:///from -Dhbackup.to=hdfs:///to com.urbanairship.hbackup.HBackup ./otherconfigs.properties

Example of doing a backup

Assuming you have a properties file named "./s3credentials.properties" that sets hbackup.s3AccessKey and hbackup.s3Secret:

CLASSPATH=hbackup-0.9-jar-with-dependencies.jar java -Dhbackup.from=hdfs:///from -Dhbackup.to=hdfs:///to com.urbanairship.hbackup.HBackup ./s3credentials.properties

Example of verifying checksums in S3

Assuming you have a properties file named "./s3credentials.properties" that sets hbackup.s3AccessKey and hbackup.s3Secret:

CLASSPATH=hbackup-0.9-jar-with-dependencies.jar java -Dhbackup.from=s3://mybucket/files -Dhbackup.checksumUri=s3://mybucket/checksums com.urbanairship.hbackup.checksumverify.ChecksumVerify

Example of a staleness health check

Assuming you have a properties file named "./s3credentials.properties" that sets hbackup.s3AccessKey and hbackup.s3Secret:

CLASSPATH=hbackup-0.9-jar-with-dependencies.jar java -Dhbackup.from=s3://mybucket/files -Dhbackup.checksumUri=s3://mybucket/checksums com.urbanairship.hbackup.checksumverify.ChecksumVerify

hbackup's People

Contributors

dcolish avatar drevell avatar vitaminmoo avatar

Stargazers

 avatar Travis Campbell avatar Alex Headley avatar Joel avatar Brent Langston avatar Alexander Daw avatar  avatar Ryan Tabora avatar Justin Shattuck avatar Matías Agustín Méndez avatar Andrei Savu avatar Dmitry Vasilenko avatar Jeff Hammerbacher avatar  avatar James Chang avatar Guillaume Gardey avatar Vijayakumar Ramdoss avatar sean head avatar Amandeep Khurana avatar  avatar  avatar Bruce Durling avatar

Watchers

Christian Metts avatar Brett Heckman avatar  avatar Lennon Day-Reynolds avatar James avatar edwardt avatar Mike Herrick avatar Derek Smith avatar john greene avatar Kermit Auel avatar Nate Putnam avatar A Burton avatar Neel Banerjee avatar thezerobit avatar  avatar  avatar Niall Kelly avatar Oskar Stephens avatar Max Piepenbrink avatar Gavin McQuillan avatar Ian Fridge avatar Andrew Winterman avatar Marc Sciglimpaglia avatar Colin Davis avatar Ramsey Kail avatar Pascal Le Levier avatar Jason Bittel avatar Greg Weinger avatar James Cloos avatar Michael McHugh avatar Jeff Green avatar Dan Reddell avatar Vijayakumar Ramdoss avatar Mele Sax-Barnett avatar  avatar Andy Quick avatar Pascal Belleville avatar Emily Johnson avatar  avatar Nate Putnam avatar  avatar Evan Jones avatar Dave Weis avatar Veronica Jimenez avatar  avatar Bryan Schlief avatar  avatar Kai Sorus avatar Tracy avatar James Chang avatar Michael Halka avatar Jonathan Waltner avatar Melissa Deziel avatar William Garrison avatar Devin Smythe avatar Peter Roberts avatar Dmitry Vasilenko avatar Kamlesh Pandey avatar  avatar Tom Raney avatar Jeffrey Meyers avatar  avatar Mister White avatar  avatar Joshua Jones avatar Connor Wallace avatar Zach Brown avatar SourC avatar gemma m avatar Daniel Ackerman avatar amanda houle avatar  avatar  avatar Michael Cold avatar Stefan Schulz avatar  avatar Simon Uhegbu avatar Justin Prostebby avatar Mike Oster avatar Misty DeGiulio avatar  avatar A Murphy avatar Andrei Stephens avatar Andy Pikul avatar Duncan Metcalfe avatar Vincent Coquart avatar Jeena Lee avatar Aaron Franco avatar  avatar Vishal Chandnani avatar  avatar Yinmeng Zhang avatar Siggy H. avatar Sean Lee avatar  avatar Vincent Marzin avatar Chan avatar Kevin Okrah avatar Sharice Siegel avatar Christine Ciandrini avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.