Code Monkey home page Code Monkey logo

hio's Introduction

hio

Command line utilities to interact with Hadoop HDFS. Think: Hadoop I/O CLI.

"hio" is a shorthand for Hadoop I/O because less typing on the CLI is better! It is also a shout-out to the hadoopio library, which we use behind the scenes.


Table of Contents


Philosophy and how it works

We strive to mimic existing Unix CLI tools as much as possible, e.g. in terms of names and semantics of hio commands and their respective parameters. That being said, in some cases we may not be able to achieve this goal due to technical reasons or upstream limitations (e.g. HDFS).

Hio is based on hadoopio, which means hio commands typically require only constant memory, i.e. even if you process 1 TB of Avro data your machine should basically never run out of memory.

Usage

Overview

Once installed the main entry point is the hio executable:

$ hio
hio: Command line tools to interact with Hadoop HDFS and other supported file systems

Commands:
  acat       - concatenates and prints Avro files
  ahead      - displays the first records of an Avro file

Tip: You can tweak many Java/JVM related settings via hio CLI options. See hio -h for details.

Examples

# Show the first two lines of the HDFS files `tweets1.avro` and `tweets2.avro`.
$ hio ahead -n 2 /hdfs/path/to/tweets/tweets1.avro /hdfs/path/to/tweets/tweets2.avro
==> /hdfs/path/to/tweets/tweets1.avro <==
{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended.  Terran is IMBA.","timestamp": 1366154481 }

==> /hdfs/path/to/tweets/tweets2.avro <==
{"username":"Zergling","tweet":"Cthulhu R'lyeh!","timestamp": 1366154399 }
{"username":"miguno","tweet":"4-Gate is the new 6-Pool.","timestamp": 1366150900 }

Tip: Use jq to pretty-print, color, or otherwise post-process the JSON output.

# Extract only the `username` field from the JSON output.
$ hio ahead -2 /hdfs/path/to/tweets/tweets1.avro | jq '.username'
"miguno"
"BlizzardCS"

# Don't like the escape quotes?  Use `--raw-output` aka `-r`.
$ hio ahead -2 /hdfs/path/to/tweets/tweets1.avro | jq --raw-output '.username'
miguno
BlizzardCS

# Let's extract the `username` and `timestamp` fields, and separate them with tabs.
$ hio ahead -2 /hdfs/path/to/tweets/tweets1.avro | jq --raw-output '"\(.username)\t\(.timestamp)"'
miguno  1366150681
BlizzardCS      1366154481

Deployment

Run-time requirements

  • Java 7, preferably Oracle JRE/JDK 1.7
  • RHEL/CentOS (required only because we package exclusively in RPM format at the moment)
  • A "compatible" HDFS cluster running Hadoop 2.x. See below for details.

A note on Hadoop versions: We bundle all required libraries (read: jar files) when packaging hio, which means that hio does not require any additional software packages or libraries. Be aware though that you will need to keep the libraries used by hio in sync with the Hadoop version of your cluster.

build.sbt lists the exact version of Hadoop that hio has been built against. At the moment, we are targeting Cloudera CDH 5.x, which is essentially Hadoop 2.5.x. Feel free to run hio against other Hadoop versions or Hadoop distributions such as HortonWorks HDP, and report back the results.

Installation

You can package hio as an RPM (see section below) and then install the package via yum, rpm, or deployment tools such as Puppet or Ansible.

Non-RHEL users: If you need different packaging formats (say, .deb for Debian or Ubuntu), please let us know!

Configuration

The most important configuration aspect is making hio aware of your Hadoop HDFS cluster. Fortunately, in most cases hio will "just work" out of the box because it follows Hadoop best practices.

If hio does not work automagically, then you must tell hio where to find your HDFS cluster. Here, hio expects to find the Hadoop configuration files -- typically named core-site.xml and hdfs-site.xml -- in the directory specified by the standard HADOOP_CONF_DIR variable. If this environment variable does not exist, hio will fall back to /etc/hadoop/conf/ (see hio.conf, which is installed to /usr/share/hio/conf/hio.conf by the RPM).

You can apply standard shell practices if you need to override the HADOOP_CONF_DIR variable for some reason:

HADOOP_CONF_DIR=/my/custom/hadoop/conf hio ...

Development

Build requirements

  • Java 7, preferably Oracle JDK 1.7

Building the code

$ ./sbt compile

Running the tests

Run the test suite:

$ ./sbt test

Packaging

For details see sbt-native-packager.

Create an RPM (preferred package format):

$ ./sbt rpm:packageBin

>>> Creates target/rpm/RPMS/noarch/hio-<VERSION>.noarch.rpm

Create a Tarball:

$ ./sbt universal:packageZipTarball

>>> Creates ./target/universal/hio-<VERSION>.tgz

Another helpful task is stage, which (quickly) e.g. generates the shell wrapper scripts but does not put them into an RPM -- so they are easier to inspect while developing:

$ ./sbt stage

>>> Creates files under ./target/universal/stage/ (e.g. bin/, conf/, lib/)

TODO

  • Support wildcards/globbing of Hadoop paths (cf. hadoopio's include patterns) so that we understand paths such as /foo/bar*.avro.

Change log

See CHANGELOG.

Contributing to this project

Code contributions, bug reports, feature requests etc. are all welcome.

If you are new to GitHub please read Contributing to a project for how to send patches and pull requests to this project.

Authors

License

Copyright © 2015 VeriSign, Inc.

See LICENSE for licensing information.

References

Alternative ways to parse CLI options for Hadoop-aware apps:

hio's People

Stargazers

Felipe Monteiro avatar Han Ju avatar

Watchers

Sean Mountcastle avatar Neel Goyal avatar James Cloos avatar Jason Cook avatar  avatar  avatar Florent Guiliani avatar Julien Charbon avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.