Code Monkey home page Code Monkey logo

teraslice_hdfs_append's Introduction

Processor - teraslice_hdfs_append

To install from the root of your teraslice instance.

npm install terascope/teraslice_hdfs_append

Description

Appends chunks of data to files in HDFS. To avoid concurrency issues the files will be given unique names associated with each worker.

Expected Inputs

Array of chunked data requests as generated by teraslice_file_chunker.

Output

Currently undefined.

Parameters

Name Description Default Required
connection Name of the terafoundation HDFS connection to use default N

Job configuration example

Simple configuration example or examples as needed.

{
  "name": "Data Generator",
  "lifecycle": "persistent",
  "workers": 1,
  "operations": [
    {
      "_op": "elasticsearch_data_generator",
      "size": 5000
    },
    {
      "_op": "teraslice_file_chunker",
      "timeseries": "daily",
      "date_field": "created",
      "directory": "/testpath/test"
    },
    {
      "_op": "teraslice_hdfs_append"
    }
  ]
}

Notes

This module is designed to be used in coordination with teraslice_file_chunker or a similar pre-processor that allocates the data into reasonable sized chunks to send to HDFS.

teraslice_hdfs_append's People

Contributors

kstaken avatar jsnoble avatar

Watchers

 avatar James Cloos avatar Peter DeMartini avatar  avatar

Forkers

jsnoble

teraslice_hdfs_append's Issues

Handle multiple consecutive occurrences of 'Error sending data to file'

It seems there are conditions where teraslice can no longer write to a file. This has been seen under conditions of high cluster load when there were many teraslice workers writing at a very high rate. All of the other workers are appending without issue while one worker continues to error on every slice with an error that looks like this:

Error sending data to file: /elasticsearch/myData-2018.01.16/host1.86

The file remains the same in each error and while the job is running and the file remains "open" it will fail an FSCK.

To fix this, I propose that teraslice should stop the worker after some default but configurable number of identical failures. Then it should restart a worker to replace the failed one. This will allow data to continue to be written.

It's possible that this will leave an unreadable file, but perhaps not since I've seen the file become readable after the job was stopped.

Needs to support connection

Supporting the 'connection' parameter should be a standard feature of all modules that read or store data.

Automate and be consistent

  • Add Travis-CI Automation.
  • Add tests.
  • Report Code Coverage to CodeCov, or a similar service.
  • Use ESLint and the latest changes from teraslice.
  • NPM Publish using the name teraslice-hdfs-append.

HDFS doesn't like double slashes in path names

If the user configures a path with a trailing slash on it we tack on the filename also with a slash creating a double slash in the path. This causes the file create to fail and the HDFS library will throw an error with the entire content of the slice included which then ends up in the log. We need to better validate the incoming config and truncate any slashes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.