embulk / embulk Goto Github PK

View Code? Open in Web Editor NEW

1.7K 107.0 199.0 7.81 MB

Embulk: Pluggable Bulk Data Loader.

Home Page: https://www.embulk.org/

License: Apache License 2.0

Ruby 7.59% Shell 0.23% Java 91.83% Batchfile 0.31% HTML 0.04%

embulk bulk-loader

embulk's Introduction

What's Embulk?

Embulk is a parallel bulk data loader that helps data transfer between various storages, databases, NoSQL and cloud services.

Embulk supports plugins to add functions. You can share the plugins to keep your custom scripts readable, maintainable, and reusable.

Embulk, an open-source plugin-based parallel bulk data loader at Slideshare

Document

Embulk documents: https://www.embulk.org/

Using plugins

You can use plugins to load data from/to various systems and file formats. Here is the list of publicly released plugins: list of plugins by category.

An example is embulk-output-command plugin. It executes an external command to output the records.

To install plugins, you can use embulk gem install <name> command:

embulk gem install embulk-output-command
embulk gem list

Embulk bundles some built-in plugins such as embulk-encoder-gzip or embulk-formatter-csv. You can use those plugins with following configuration file:

in:
  type: file
  path_prefix: "./try1/csv/sample_"
  ...
out:
  type: command
  command: "cat - > task.$INDEX.$SEQID.csv.gz"
  encoders:
    - {type: gzip}
  formatter:
    type: csv

Resuming a failed transaction

Embulk supports resuming failed transactions. To enable resuming, you need to start transaction with -r PATH option:

embulk run config.yml -r resume-state.yml

If the transaction fails, embulk stores state some states to the yaml file. You can retry the transaction using exactly same command:

embulk run config.yml -r resume-state.yml

If you give up on resuming the transaction, you can use embulk cleanup subcommand to delete intermediate data:

embulk cleanup config.yml -r resume-state.yml

Using plugin bundle

embulk mkbundle subcommand creates a isolated bundle of plugins. You can install plugins (gems) to the bundle directory instead of ~/.embulk directory. This makes it easy to manage versions of plugins. To use the bundle, add -b <bundle_dir> option to guess, preview, or run subcommand. embulk mkbundle also generates some example plugins to <bundle_dir>/embulk/*.rb directory.

See the generated <bundle_dir>/Gemfile file how to plugin bundles work.

embulk mkbundle ./embulk_bundle  # please edit ./embulk_bundle/Gemfile to add plugins. Detailed usage is written in the Gemfile
embulk guess -b ./embulk_bundle ...
embulk run   -b ./embulk_bundle ...

Use cases

Scheduled bulk data loading to Elasticsearch + Kibana 5 from CSV files

For further details, visit Embulk documentation.

Upgrading to the latest version

Following command updates embulk itself to the specific released version.

embulk selfupdate x.y.z

Embulk Development

Build

./gradlew cli  # creates pkg/embulk-VERSION.jar

You can see JaCoCo's test coverage report at ${project}/build/reports/tests/index.html You can see Findbug's report at ${project}/build/reports/findbug/main.html # FIXME coverage information is not included somehow

You can use classpath task to use bundle exec ./bin/embulk for development:

./gradlew -t classpath  # -x test: skip test
./bin/embulk

To deploy artifacts to your local maven repository at ~/.m2/repository/:

./gradlew install

To compile the source code of embulk-core project only:

./gradlew :embulk-core:compileJava

Task dependencies shows dependency tree of embulk-core project:

./gradlew :embulk-core:dependencies

Update JRuby

Modify jrubyVersion in build.gradle to update JRuby of Embulk.

Release

Prerequisite: Sonatype OSSRH

You need an account in Sonatype OSSRH, and configure it in your ~/.gradle/gradle.properties.

ossrhUsername=(your Sonatype OSSRH username)
ossrhPassword=(your Sonatype OSSRH password)

Prerequisite: PGP signatures

You need your PGP signatures to release artifacts into Maven Central, and configure Gradle to use your key to sign.

signing.keyId=(the last 8 symbols of your keyId)
signing.password=(the passphrase used to protect your private key)
signing.secretKeyRingFile=(the absolute path to the secret key ring file containing your private key)

Release

Modify version in build.gradle at a detached commit to bump Embulk version up.

git checkout --detach master
(Remove "-SNAPSHOT" in "version" in build.gradle.)
git add build.gradle
git commit -m "Release vX.Y.Z"
git tag -a vX.Y.Z
(Write the release note for vX.Y.Z in the tag annotation.)
./gradlew clean && ./gradlew release
git push -u origin vX.Y.Z

embulk's People

Contributors

Stargazers

Watchers

Forkers

xerial ceeflyer afthill thagikura yuichielectric fukkun hiroyuki-sato niku noel9109 yaggytter takei-yuya ykubota akirakw okonomi hito4t miniway nishidayuya komamitsu hata takumakanari mrorii nmorioka ssski yusuke muga sakama syohex bageljp randyh0329 yyamano siroken3 monkey-mas huydx uu59 sonots cosmo0920 yoyama mikoto2000 toyama0919 zeroro88 doubaokun hishidama rikima takkanm zoetrope junichiwatanuki rettinohusker51 stevecosupply joeharris76 mindis littleyang raceli kysnm downchuck baozhengxing robinmisfit androiddream flada-auxv jj76lee pe-suke tomykaira munepom jet2007 ysksuzuki valuecreation wakanasato rucky2013 nikolayvoronchikhin joker1007 nakanowai bizenn psyoblade kilnamkim cm-moriwaki mattehicks stvhanna tsuzuki-aiming ilyafinkelshteyn cosmok instcode yoshihara jmones zmyer threetreeslight pologood koheisakamoto shenghuahe morikawamika smdmts toru-takahashi tovin-xu ujun icaiyu josephcdonaldson leevydanomalik linearregression raghu999 tkawachi cazacugmihai takuti

embulk's Issues

Create JDBC input plugin

For transactional example

Add last_path to NextConfig returned by LocalInputPlugin#transaction(..)

The "last_path" is useful for retrying LocalFileInputPlugin to recover.

"java -jar embulk-0.1.0.jar bundle ./data" failed when coping ".bundle/config"

When I executed "java -jar embulk-0.1.0.jar bundle ./data" according to README.md, this error occurred.

$ java -jar embulk-0.1.0.jar bundle ./data
Initializing ./data...
Errno::ENOENT: No such file or directory - No such file or directory
               stat at org/jruby/RubyFile.java:848
   fu_each_src_dest at /foo/bar/embulk/embulk-0.1.0.jar!/META-INF/jruby.home/lib/ruby/1.9/fileutils.rb:1525
  fu_each_src_dest0 at /foo/bar/embulk/embulk-0.1.0.jar!/META-INF/jruby.home/lib/ruby/1.9/fileutils.rb:1541
   fu_each_src_dest at /foo/bar/embulk/embulk-0.1.0.jar!/META-INF/jruby.home/lib/ruby/1.9/fileutils.rb:1523
                 cp at /foo/bar/embulk/embulk-0.1.0.jar!/META-INF/jruby.home/lib/ruby/1.9/fileutils.rb:395
                run at file:/foo/bar/embulk/embulk-0.1.0.jar!/embulk/command/embulk_run.rb:122
               each at org/jruby/RubyArray.java:1613
                run at file:/foo/bar/embulk/embulk-0.1.0.jar!/embulk/command/embulk_run.rb:118
             (root) at classpath:embulk/command/embulk.rb:38

I looked over lib/embulk/command/embulk_run.rb.

            %w[.bundle/config embulk/input_example.rb embulk/output_example.rb examples/csv-stdout.yml examples/sample.csv.gz Gemfile Gemfile.lock].each do |file|  # TODO get file list from the jar
              url = resource_class.resource("/embulk/data/bundle/#{file}").to_s
              dst = File.join(path, file)
              FileUtils.mkdir_p File.dirname(dst)
              FileUtils.cp(url, dst)

But url seems empty and embulk-0.1.0.jar doesn't include .bundle/config.

Implement JRuby bridge for DecoderPlugin

"embulk gem install foobar --version 0.0.1" doesn't work

It shows embulk's version number...!

Gzip encoder plugin is missing

It throws an exception:

java.lang.RuntimeException: java.lang.AssertionError: OutputStreamFileOutput is not implemented yet

Make log level configurable by command line argument

I want to set log level to embulk command arguments.

Automate incrementing version number and releasing tasks

As this commit shows, creating a new release is complicated:
24cc193

increment version numbers at pom.xml and embulk-*/pom.xml
increment version number at README.md
increment version number at build.gradle
increment version number at lib/embulk/version.rb
git commit -am v0.2.1
git tag v0.2.1
./gradlew bintrayUpload
9 rake
go to bintray and upload embulk-0.2.1.jar manually
gem push pkg/embulk-0.2.1.gem
git push && git push --tags

I want to make them as like this:

run rake set_verson 0.2.1 (runs step 1,2,3,4)
git commit -am v0.2.1
git tag v0.2.1
./gradlew bintrayUpload (runs step 8, 9, 10)
gem push pkg/embulk-0.2.1.gem (if possible, bintrayUpload should also do this)
git push && git push --tags

Search and load guess plugins automatically

Unlike other plugins, users normally don't know name of guess plugins. So, they don't write name of the guess plugins to configuration files.
To make guess plugins pluggable, Embulk should find and load installed guess plugins automatically.

Guess implementation is hard. Add record guess helper library

It's a common work for input plugins to guess schema from schema-less records.
An idea here is to provide a library to make it easy.

Add filter process in preview(dry-run) mode.

Current preview command run input stage only.
I would like to check after filter mode result.

https://twitter.com/frsyuki/status/563553691475533826

''guess'' command does not work correctly if csv data encoded Shift_JIS.

https://gist.github.com/hiroyuki-sato/facc6c3f23fde7b5dc90

Could not find property 'bintray_user'

Let me know how to set bintray_user:

> ./gradlew build

FAILURE: Build failed with an exception.

* Where:
Build file '/Users/leo/work/git/embulk/build.gradle' line: 29

* What went wrong:
A problem occurred evaluating root project 'embulk'.
> Could not find property 'bintray_user' on com.jfrog.bintray.gradle.BintrayExtension_Decorated@108a46d6.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 2.691 secs

Add an option to CSV parser plugin to skip first some lines

Some CSV files have some garbage lines.
Those lines are informative for some cases but they're not a part of table.

For example,

created date,2015-03-05
created by,me
device,xyz
time,key1,key2,key3,key4
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
...

"new ruby-input" generator does not work properly.

java -jar ./embulk-0.4.1.jar new ruby-input sample

% java -jar ./embulk-0.4.1.jar new ruby-input sample
Creating embulk-input-sample/
  Creating embulk-input-sample/README.md
  Creating embulk-input-sample/LICENSE.txt
  Creating embulk-input-sample/.gitignore
  Creating embulk-input-sample/Rakefile
  Creating embulk-input-sample/Gemfile
  Creating embulk-input-sample/embulk-input-sample.gemspec
  Creating embulk-input-sample/lib/embulk/input/sample.rb

rake build (and failed.)

% cd embulk-input-sample/
% rake build 
rake aborted!
There was a SyntaxError while loading embulk-input-sample.gemspec: 
/home/embulk/embulk-input-sample/embulk-input-sample.gemspec:7: syntax error, unexpected tIDENTIFIER, expecting keyword_end
...h the output plugins by "embulk-output" keyword."
...                               ^
/home/arch/embulk-input-sample/embulk-input-sample.gemspec:7: syntax error, unexpected tSTRING_BEG, expecting keyword_do or '{' or '('
...tput plugins by "embulk-output" keyword."
...                               ^
/home/embulk/embulk-input-sample/Rakefile:1:in `<top (required)>'
(See full trace by running task with --trace)

Can't use double quote "embulk-output" in gemspec file.

  spec.description   = "Sample input plugin is an Embulk plugin that loads records from Sample so that any output plugins can receive the records. Search the output plugins by "embulk-output" keyword."

Use single quote better

  spec.description   = "Sample input plugin is an Embulk plugin that loads records from Sample so that any output plugins can receive the records. Search the output plugins by 'embulk-output' keyword."

And one more thing.

module Embulk
  module Input

    class SampleInputPlugin < InputPlugin
      Plugin.register_input(sample, self)

need single quote??

module Embulk
  module Input

    class SampleInputPlugin < InputPlugin
      Plugin.register_input('sample', self)

Plugin loader of Java plugins

Embulk can't load java-based plugins (jar files) dynamically. Users need to add the jar files to classpath. Inconvenient...

NextConfig is merged to the top-level but expected to in or out

As this comment says, shouldn't it be {"done"=>done}?
https://github.com/enukane/embulk-plugin-input-pcapng-files/pull/2/files#diff-8360174ae27f7aace992035f0fcec11eR38

Resuming failed transaction

A bulk load likely fails when we import large amount of data (say 1TB). When it fails, Embulk currently requires plugins to rollback entire transaction so that we can retry.
However, retrying the entire bulk load of the large data takes too long time. It's very good if we can resume the failed transaction from partially succeeded state.

Difficulties:

What's the API for plugins?
Is a input plugin responsible to skip tasks? Or, is a output plugin responsible to skip records?

Implement JRuby bridge for DecoderPlugin and EncoderPlugin

Implement JRuby bridge for FileInputPlugin

More Ruby plugsin!

some commands exit with status 1 even if it doesn't cause an error

embulk exits with status 1 when version(--version) or help(--help) command is executed.
It would be great if it exits with 0 in such cases.

Command-line (ssh) distributed executor

Idea:
I want to execute Embulk using this command line:

$ embulk run config.yml --execute-command "ssh %(find_host %{index}) embulk run-task %{task} %{index}"

With this command, embulk executes the given ssh command for each task. Then embulk run-task executes the given task on another host.
This is the simplest way to run embulk on distributed environment. We don't have to install Hadoop, YARN, Mesos, etc.

Some distributed execution environment such as mpich2 or Sun Grid Engine may work using this command-line.

Limit the size of quoted values on CsvParserPlugin

CsvParserPlugin has max_quoted_size_limit.
https://github.com/embulk/embulk/blob/master/embulk-standards/src/main/java/org/embulk/standards/CsvParserPlugin.java#L73

But it is not used in CsvTokenizer.
https://github.com/embulk/embulk/blob/master/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java#L30

Support array and map types

Some databases can store nested values.
Some file formats such as JSON or XM have nested values.
To transfer between those databases and file formats, embulk needs to support nested types.

Discussions are:

Array:
- Fixed type array or variable type array
- Fixed-type array: array(T) like array(long). [1, 2, 3]
- Variable-type array: array. [1, "str", 0.3]
Map:
- Fixed type or variable map
- Allow non-string keys or not
Struct:
- Fixed-length map with fixed-types for each field.
- For example, struct(int, string))

ActiveSupport::HashWithIndifferentAccess-ish instead of Hash for Ruby plugin

It's developer friendly if plugins can access 'task' definition both via Symbol and String as a key.

def self.transaction(config, &control)
  ...
  task[:foo] = 42
  yield(task, schema, threads)
  ...
  report
end

attr_reader :task
def run
  task[:foo] #=> nil because of ser-de, expecting 42
end

Writing broken records or texts to a file to store them or retry later

Data include a lot of broken records. A bulk import can skip them but we want to load them later as an exceptional case. To do it, we want to get those error records written to other files or databases.

Difficulty in terms of API design is that format of the records can be different depending on plugin types.

Encoder, Decoder, some Parser plugins
- These plugins read with buffer. They can't recognize "records". When they detect broken data, they skip the entire file
Line-based parser plugins
- Some parser plugins are based on lines (e.g. csv). They can skip a line and continue parsing from the next line.
Formatter and Output plugins
- Formatter plugins read records. They can skip a record whose schema is fixed by the previous plugins.
Filter plugins (#26)
- Filter plugins read records. They can skip a record whose schema is fixed by the previous plugins.

So, depending on plugins, error output needs to store 3 kinds of data:

a) file from a certain position
b) line
c) record with various schema

embulk-core JRuby bridge unit testing

Nothing!

Make the transaction fail if the error ratio is too large

I want to set the error ratio for rolling back the transaction.

Implement JRuby bridge for FileOutputPlugin

spi2's Embulk gets NameError by sample config.yml

@frsyuki @nahi Embulk on spi2 branch doesn't work and gets NameError: uninitialized constant Embulk::GuessCsv::TFGuess. Could you fix it or do you have any workarounds?

$ java -cp $(echo $(find */target -name "*.jar") | sed "s/ /:/g") org.embulk.cli.Embulk examples/config.yml
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/slf4j-log4j12-1.7.9.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/muga/works/workspace/embulk/embulk-core/target/dependency/slf4j-log4j12-1.7.9.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/muga/works/workspace/embulk/embulk-standards/target/dependency/slf4j-log4j12-1.7.9.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" org.jruby.exceptions.RaiseException: (NameError) uninitialized constant Embulk::GuessCsv::TFGuess
    at org.jruby.RubyModule.const_missing(org/jruby/RubyModule.java:2723)
    at RUBY.guess_type(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:159)
    at RUBY.guess_field_types(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:115)
    at org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)
    at org.jruby.RubyEnumerable.each_with_index(org/jruby/RubyEnumerable.java:977)
    at RUBY.guess_field_types(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:115)
    at org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)
    at RUBY.guess_field_types(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:114)
    at RUBY.guess_lines(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:29)
    at RUBY.guess(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_plugin.rb:105)
    at RUBY.guess(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_plugin.rb:44)
    at Embulk$$GuessPlugin$$JavaAdapter_1955955176.guess(Embulk$$GuessPlugin$$JavaAdapter_1955955176.gen:13)

Throws "Embulk::PluginLoadError: Unknown guess plugin 'gzip' " even if i have installed gzip gem.

Mac OS X 10.10.1
Oracle Java version "1.7.0_45"

Yoshida-no-MacBook-Air:Documents takumi$ java -jar embulk.jar gem list

*** LOCAL GEMS ***

ffi (1.9.3 java)
jar-dependencies (0.1.2)
jruby-openssl (0.9.5 java)
json (1.8.0 java)
krypt (0.0.2)
krypt-core (0.0.2 universal-java)
krypt-provider-jdk (0.0.2)
rake (10.1.0)
rdoc (4.1.2)
Yoshida-no-MacBook-Air:Documents takumi$ java -jar embulk.jar gem install gzip
Fetching: gzip-1.0.gem (100%)
Successfully installed gzip-1.0
1 gem installed
Yoshida-no-MacBook-Air:Documents takumi$ java -jar embulk.jar guess embulk-example/example.yml -o config.yml
2015-02-17 07:09:27,493 [INFO]: main:org.embulk.standards.LocalFileInputPlugin: Listing local files at directory '/Users/takumi/Documents/embulk-example/csv' filtering filename by prefix 'sample_'
2015-02-17 07:09:27,502 [INFO]: main:org.embulk.standards.LocalFileInputPlugin: Loading files [/Users/takumi/Documents/embulk-example/csv/sample_01.csv.gz]
Embulk::PluginLoadError: Unknown guess plugin 'gzip'. embulk/guess/gzip.rb is not installed. Run 'embulk gem search -rd embulk-guess' command to find plugins.
          lookup at file:/Users/takumi/Documents/embulk.jar!/embulk/plugin_registry.rb:30
          lookup at file:/Users/takumi/Documents/embulk.jar!/embulk/plugin.rb:188
  new_java_guess at file:/Users/takumi/Documents/embulk.jar!/embulk/plugin.rb:178
  new_java_guess at /Users/takumi/Documents/embulk.jar!/META-INF/jruby.home/lib/ruby/1.9/forwardable.rb:201
             run at file:/Users/takumi/Documents/embulk.jar!/embulk/command/embulk_run.rb:275
          (root) at classpath:embulk/command/embulk.rb:39
org/embulk/plugin/PluginManager.java:42:in `newPlugin': org.embulk.config.ConfigException: GuessPlugin 'gzip' is not found
    from org/embulk/spi/ExecSession.java:106:in `newPlugin'
    from org/embulk/spi/Exec.java:54:in `newPlugin'
    from org/embulk/exec/GuessExecutor.java:267:in `run'
    from org/embulk/spi/FileInputRunner.java:129:in `run'
    from org/embulk/exec/GuessExecutor.java:118:in `run'
    from org/embulk/spi/FileInputRunner.java:101:in `run'
    from org/embulk/exec/GuessExecutor.java:251:in `transaction'
    from org/embulk/spi/FileInputRunner.java:95:in `run'
    from org/embulk/spi/util/Decoders.java:77:in `transaction'
    from org/embulk/spi/util/Decoders.java:33:in `transaction'
    from org/embulk/spi/FileInputRunner.java:92:in `run'
    from org/embulk/exec/GuessExecutor.java:170:in `transaction'
    from org/embulk/spi/FileInputRunner.java:60:in `transaction'
    from org/embulk/exec/GuessExecutor.java:114:in `runGuessInput'
    from org/embulk/exec/GuessExecutor.java:93:in `doGuess'
    from org/embulk/exec/GuessExecutor.java:34:in `access$000'
    from org/embulk/exec/GuessExecutor.java:73:in `run'
    from org/embulk/exec/GuessExecutor.java:70:in `run'
    from org/embulk/spi/Exec.java:21:in `doWith'
    from org/embulk/exec/GuessExecutor.java:70:in `guess'
    from org/embulk/command/Runner.java:168:in `guess'
    from org/embulk/command/Runner.java:70:in `main'
    from java/lang/reflect/Method.java:606:in `invoke'
    from file:/Users/takumi/Documents/embulk.jar!/embulk/command/embulk_run.rb:275:in `run'
    from classpath:embulk/command/embulk.rb:39:in `(root)'
    from classpath_3a_embulk/command/classpath:embulk/command/embulk.rb:39:in `(root)'
    from org/embulk/cli/Main.java:13:in `main'
Caused by:
file:/Users/takumi/Documents/embulk.jar!/embulk/plugin_registry.rb:30:in `lookup': org.jruby.exceptions.RaiseException: (PluginLoadError) Unknown guess plugin 'gzip'. embulk/guess/gzip.rb is not installed. Run 'embulk gem search -rd embulk-guess' command to find plugins.
    from file:/Users/takumi/Documents/embulk.jar!/embulk/plugin.rb:188:in `lookup'
    from file:/Users/takumi/Documents/embulk.jar!/embulk/plugin.rb:178:in `new_java_guess'
    from /Users/takumi/Documents/embulk.jar!/META-INF/jruby.home/lib/ruby/1.9/forwardable.rb:201:in `new_java_guess'
    from file:/Users/takumi/Documents/embulk.jar!/embulk/command/embulk_run.rb:275:in `run'
    from classpath:embulk/command/embulk.rb:39:in `(root)'
Yoshida-no-MacBook-Air:Documents takumi$ java -jar embulk.jar gem list

*** LOCAL GEMS ***

ffi (1.9.3 java)
gzip (1.0)
jar-dependencies (0.1.2)
jruby-openssl (0.9.5 java)
json (1.8.0 java)
krypt (0.0.2)
krypt-core (0.0.2 universal-java)
krypt-provider-jdk (0.0.2)
rake (10.1.0)
rdoc (4.1.2)

Implement JRuby bridge for ParserPlugin

Embulk does not have a mascot or logo

This is an important issue!

Implement JRuby bridge for EncoderPlugin

Add InputPlugin#guess

In https://github.com/komamitsu/embulk-plugin-redis, I want to use "bin/embulk guess" command to guess a schema in Redis and generate schema information automatically as much as possible. Of cause it's not a complete schema because Redis is schema-less, though.

To achieve that, I think we need to add InputPlugin#guess method like this:

  class InputXxxxx < InputPlugin
    def self.guess(config)
      x = Xxxxx.connect(
              config.param('host', :string, :default => 'localhost'),
              config.param('port', :integer, :default => 8888))
      xs = x.get((0...x.size).select{|i| i % 10 == 0})
      // Checking the schemas of `xs'...
          :
      cols << Column.new(idx, schema_name, schema_type)
          :
      cols
    end

DataSource.merge needs a behavior option to merge arrays

When DataSource.merge finds an array, it appends new array to existent array. This is good when it merges guessed NextConfig to original ConfigSource.

However, it's not good to merge NextConfig of previous execution to the previous ConfigSource.

We need 2 merges:

merge (for next config): {"array": [1,2,3]}.merge({"array": [4,5,6]}) => {"array":[4,5,6]}
append (for guess): {"array": [1,2,3]}.merge({"array": [4,5,6]}) => {"array":[1,2,3,4,5,6]}

Implement JRuby bridge for FormatterPlugin

Generate JavaDoc

We need JavaDoc for plugin development in Java.

Write gradle script to generate javadoc. An example at Groovy is at: https://github.com/groovy/groovy-core/blob/master/gradle/docs.gradle#L103
Write gradle script to automatically publish the document to github pages
Setup domain (docs.embulk.org?)

Create HowTo page on Wiki

We can write following how-to documents:

How to write plugins?
- ruby input plugin
- ruby output plugin
- java input plugin
- java output plugin
- java file input plugin
- java file output plugin
- java parser plugin
- java formatter plugin
- java encoder plugin
- java decoder plugin
How to read and write local CSV files?
How to read and write files on S3?

Adding Executor plugin SPI

Embulk has only LocalExecutor but distributed execution is one of the main goals of Embulk.
To support various distributed execution environments, having pluggable & dynamically-loadable executor SPI is good.

Include processor index to log messages

With 3 processors, I got following log messages:

2015-02-10 17:07:42,753 [INFO]: embulk-executor-1:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,753 [INFO]: embulk-executor-0:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,753 [INFO]: embulk-executor-2:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,755 [INFO]: embulk-executor-2:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: > 0.00 seconds (loaded 10 rows in total)
2015-02-10 17:07:42,755 [INFO]: embulk-executor-0:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: > 0.00 seconds (loaded 10 rows in total)
2015-02-10 17:07:42,755 [INFO]: embulk-executor-1:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: > 0.00 seconds (loaded 10 rows in total)
2015-02-10 17:07:42,756 [INFO]: main:org.embulk.exec.LocalExecutor: {done:  3 / 3, running: 0}
2015-02-10 17:07:42,756 [INFO]: main:org.embulk.exec.LocalExecutor: {done:  3 / 3, running: 0}
2015-02-10 17:07:42,757 [INFO]: main:org.embulk.exec.LocalExecutor: {done:  3 / 3, running: 0}
2015-02-10 17:07:42,763 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: SQL: SET search_path TO "public"
2015-02-10 17:07:42,763 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: > 0.00 seconds
2015-02-10 17:07:42,763 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: SQL: DROP TABLE IF EXISTS "load01"
2015-02-10 17:07:42,765 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: > 0.00 seconds
2015-02-10 17:07:42,765 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: SQL: ALTER TABLE "load01_0000000054daab5e0b71b000_BULK_LOAD_TEMP" RENAME TO "load01"
2015-02-10 17:07:42,765 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: > 0.00 seconds

It's difficult to know which message is from which task processor thread. If the message has processor index, it will be:

2015-02-10 17:07:42,753 [INFO]: [1] embulk-executor-1:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,753 [INFO]: [2] embulk-executor-0:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,753 [INFO]: [3] embulk-executor-2:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)

Ruby Symbols are converted to String implicitly after passing them to `yield' as a parameter in InputPlugin.transaction

Ruby Symbols are converted to String implicitly after passing them to yield as a parameter in InputPlugin.transaction.

module Embulk
  class InputRedis < InputPlugin
    Plugin.register_input('redis', self)
    def self.transaction(config, &control)
      task = {
            :
        :hello => :world
      }
      commit_reports = yield(task, columns, 1)
            :


    def run
      require 'pp'
      pp @task    ==> {...., "hello"=>"world"}

I just looked over the source code. org.embulk.config.ModelManager uses com.fasterxml.jackson.databind.ObjectMapper#writeValueAsString and I guess it converts RubySymbol to just String internally and it's related to the issue.

Add number of error records and their line numbers to PreviewResult

We should add more information to PreviewResult. Because we check the result of PreviewExecutor and often determine whether we should execute run command or not. The current PreviewResult provides schema and list of ignored exceptions for users. I think that it's better to add more:

of records that PreviewExecutor read.
of error records
actual error raw records and those line number (optional)

I also think that the above information can be build by using CommitReport on https://github.com/embulk/embulk/blob/master/embulk-core/src/main/java/org/embulk/exec/PreviewExecutor.java#L106

Develop a gradle plugin to make it easy for users to develop embulk plugins

Java developers probably don't want to write Ruby code and don't know Ruby's packaging system.
But Embulk uses RubyGems for packaging.

Idea here is that Embulk provides a gradle plugin to automate the packaging.

The plugin does:

provide a task to create pkg/embulk-plugin-xyz.gem package.
- the task builds a jar file and copies dependency jars to classpath/*
- then package them into pkg/embulk-plugin-xyz.gem
- I think it can assume that developers write embulk-plugin-xyz.gemspec file
provide a task to release the gem file to RubyGems
- I think it can assume that developers write ~/.gem/credentials file
manage versions of dependencies of basic packages to not cause version conflicts with embulk-core (like parent pom file)

Remove pom.xml

Once bulid.grade file gets stabled, we don't need pom.xml any more.
Delete pom.xml and update README.md.

Investigate: disable setContextClassLoader calls within JRuby

I got the following error during executing output-elasticsearch plugin. The root cause is that a context classloader cannot find the resoruce file. The problem might occur during other Java plugins are executed.

org/elasticsearch/env/Environment.java:213:in `resolveConfig': org.elasticsearch.env.FailedToResolveConfigException: Failed to resolve config path [names.txt], tried file path [names.txt], path file [.../config/names.txt], and classpath
    from org/elasticsearch/node/internal/InternalSettingsPreparer.java:119:in `prepareSettings'
    from org/elasticsearch/client/transport/TransportClient.java:157:in `<init>'
    from org/elasticsearch/client/transport/TransportClient.java:115:in `<init>'

Add guess API to input and output plugins

GuessPlugin completes "parser" and "decoders" section of a configuration file. But we also want to make "in" section guessable.
For example, redis is a schema-less database (not file format). It's InputPlugin (not FileInputPlugin or ParserPlugin). To make the behavior deterministic, we should write schema to configuration file. But it's hard work to write the all schema manually. Plugins should be able to help it.

Adding filter plugin API

We want to apply conversion or filtering to records when:

hash: applying a hash function to sensitive columns (e.g. account_id, email)
grep: skipping a record if it matches certain condition (e.g. skip access logs if they're internal callback requests)
convert column type: convert created_at:string column to timestamp
skip column: convert comment field to nil if it's too long
remove column: remove certain columns (e.g. remove "password" column)
join: joining data with other data stores. (e.g. join email, age, and other user information to user_id)

Guess plugin is not dynamic loadable

We can't install guess plugins using RubyGems because GuessExecutor uses fixed list of guess plugins (gzip, charset, newline, csv).

Separate S3 plugins to another repository and plugin gems

aws-sdk jar is too large to include the single-package jar.

9.9M classpath/aws-java-sdk-1.5.2.jar

The idea is to remove the dependency from embulk and create embulk-plugin-s3

embulk / embulk Goto Github PK

embulk's Introduction

What's Embulk?

Document

Using plugins

Resuming a failed transaction

Using plugin bundle

Use cases

Upgrading to the latest version

Embulk Development

Build

Update JRuby

Release

Prerequisite: Sonatype OSSRH

Prerequisite: PGP signatures

Release

embulk's People

Contributors

Stargazers

Watchers

Forkers

embulk's Issues

of records that PreviewExecutor read.

of error records

Recommend Projects

Recommend Topics

Recommend Org