Code Monkey home page Code Monkey logo

embulk's Introduction

What's Embulk?

Embulk is a parallel bulk data loader that helps data transfer between various storages, databases, NoSQL and cloud services.

Embulk supports plugins to add functions. You can share the plugins to keep your custom scripts readable, maintainable, and reusable.

Embulk Embulk, an open-source plugin-based parallel bulk data loader at Slideshare

Document

Embulk documents: https://www.embulk.org/

Using plugins

You can use plugins to load data from/to various systems and file formats. Here is the list of publicly released plugins: list of plugins by category.

An example is embulk-output-command plugin. It executes an external command to output the records.

To install plugins, you can use embulk gem install <name> command:

embulk gem install embulk-output-command
embulk gem list

Embulk bundles some built-in plugins such as embulk-encoder-gzip or embulk-formatter-csv. You can use those plugins with following configuration file:

in:
  type: file
  path_prefix: "./try1/csv/sample_"
  ...
out:
  type: command
  command: "cat - > task.$INDEX.$SEQID.csv.gz"
  encoders:
    - {type: gzip}
  formatter:
    type: csv

Resuming a failed transaction

Embulk supports resuming failed transactions. To enable resuming, you need to start transaction with -r PATH option:

embulk run config.yml -r resume-state.yml

If the transaction fails, embulk stores state some states to the yaml file. You can retry the transaction using exactly same command:

embulk run config.yml -r resume-state.yml

If you give up on resuming the transaction, you can use embulk cleanup subcommand to delete intermediate data:

embulk cleanup config.yml -r resume-state.yml

Using plugin bundle

embulk mkbundle subcommand creates a isolated bundle of plugins. You can install plugins (gems) to the bundle directory instead of ~/.embulk directory. This makes it easy to manage versions of plugins. To use the bundle, add -b <bundle_dir> option to guess, preview, or run subcommand. embulk mkbundle also generates some example plugins to <bundle_dir>/embulk/*.rb directory.

See the generated <bundle_dir>/Gemfile file how to plugin bundles work.

embulk mkbundle ./embulk_bundle  # please edit ./embulk_bundle/Gemfile to add plugins. Detailed usage is written in the Gemfile
embulk guess -b ./embulk_bundle ...
embulk run   -b ./embulk_bundle ...

Use cases

For further details, visit Embulk documentation.

Upgrading to the latest version

Following command updates embulk itself to the specific released version.

embulk selfupdate x.y.z

Embulk Development

Build

./gradlew cli  # creates pkg/embulk-VERSION.jar

You can see JaCoCo's test coverage report at ${project}/build/reports/tests/index.html You can see Findbug's report at ${project}/build/reports/findbug/main.html # FIXME coverage information is not included somehow

You can use classpath task to use bundle exec ./bin/embulk for development:

./gradlew -t classpath  # -x test: skip test
./bin/embulk

To deploy artifacts to your local maven repository at ~/.m2/repository/:

./gradlew install

To compile the source code of embulk-core project only:

./gradlew :embulk-core:compileJava

Task dependencies shows dependency tree of embulk-core project:

./gradlew :embulk-core:dependencies

Update JRuby

Modify jrubyVersion in build.gradle to update JRuby of Embulk.

Release

Prerequisite: Sonatype OSSRH

You need an account in Sonatype OSSRH, and configure it in your ~/.gradle/gradle.properties.

ossrhUsername=(your Sonatype OSSRH username)
ossrhPassword=(your Sonatype OSSRH password)

Prerequisite: PGP signatures

You need your PGP signatures to release artifacts into Maven Central, and configure Gradle to use your key to sign.

signing.keyId=(the last 8 symbols of your keyId)
signing.password=(the passphrase used to protect your private key)
signing.secretKeyRingFile=(the absolute path to the secret key ring file containing your private key)

Release

Modify version in build.gradle at a detached commit to bump Embulk version up.

git checkout --detach master
(Remove "-SNAPSHOT" in "version" in build.gradle.)
git add build.gradle
git commit -m "Release vX.Y.Z"
git tag -a vX.Y.Z
(Write the release note for vX.Y.Z in the tag annotation.)
./gradlew clean && ./gradlew release
git push -u origin vX.Y.Z

embulk's People

Contributors

civitaspo avatar cosmo0920 avatar dependabot[bot] avatar dmikurube avatar flada-auxv avatar frsyuki avatar hata avatar hiroyuki-sato avatar hishidama avatar hito4t avatar huydx avatar jca02266 avatar joker1007 avatar kamatama41 avatar kiyoto avatar ksss avatar mikoto2000 avatar muga avatar nishidayuya avatar sakama avatar shroman avatar smdmts avatar sonots avatar suzukaze avatar syohex avatar takuti avatar uu59 avatar yaggytter avatar ykubota avatar yyamano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

embulk's Issues

"java -jar embulk-0.1.0.jar bundle ./data" failed when coping ".bundle/config"

When I executed "java -jar embulk-0.1.0.jar bundle ./data" according to README.md, this error occurred.

$ java -jar embulk-0.1.0.jar bundle ./data
Initializing ./data...
Errno::ENOENT: No such file or directory - No such file or directory
               stat at org/jruby/RubyFile.java:848
   fu_each_src_dest at /foo/bar/embulk/embulk-0.1.0.jar!/META-INF/jruby.home/lib/ruby/1.9/fileutils.rb:1525
  fu_each_src_dest0 at /foo/bar/embulk/embulk-0.1.0.jar!/META-INF/jruby.home/lib/ruby/1.9/fileutils.rb:1541
   fu_each_src_dest at /foo/bar/embulk/embulk-0.1.0.jar!/META-INF/jruby.home/lib/ruby/1.9/fileutils.rb:1523
                 cp at /foo/bar/embulk/embulk-0.1.0.jar!/META-INF/jruby.home/lib/ruby/1.9/fileutils.rb:395
                run at file:/foo/bar/embulk/embulk-0.1.0.jar!/embulk/command/embulk_run.rb:122
               each at org/jruby/RubyArray.java:1613
                run at file:/foo/bar/embulk/embulk-0.1.0.jar!/embulk/command/embulk_run.rb:118
             (root) at classpath:embulk/command/embulk.rb:38

I looked over lib/embulk/command/embulk_run.rb.

            %w[.bundle/config embulk/input_example.rb embulk/output_example.rb examples/csv-stdout.yml examples/sample.csv.gz Gemfile Gemfile.lock].each do |file|  # TODO get file list from the jar
              url = resource_class.resource("/embulk/data/bundle/#{file}").to_s
              dst = File.join(path, file)
              FileUtils.mkdir_p File.dirname(dst)
              FileUtils.cp(url, dst)

But url seems empty and embulk-0.1.0.jar doesn't include .bundle/config.

Gzip encoder plugin is missing

It throws an exception:

java.lang.RuntimeException: java.lang.AssertionError: OutputStreamFileOutput is not implemented yet

Automate incrementing version number and releasing tasks

As this commit shows, creating a new release is complicated:
24cc193

  1. increment version numbers at pom.xml and embulk-*/pom.xml
  2. increment version number at README.md
  3. increment version number at build.gradle
  4. increment version number at lib/embulk/version.rb
  5. git commit -am v0.2.1
  6. git tag v0.2.1
  7. ./gradlew bintrayUpload
    9 rake
  8. go to bintray and upload embulk-0.2.1.jar manually
  9. gem push pkg/embulk-0.2.1.gem
  10. git push && git push --tags

I want to make them as like this:

  1. run rake set_verson 0.2.1 (runs step 1,2,3,4)
  2. git commit -am v0.2.1
  3. git tag v0.2.1
  4. ./gradlew bintrayUpload (runs step 8, 9, 10)
  5. gem push pkg/embulk-0.2.1.gem (if possible, bintrayUpload should also do this)
  6. git push && git push --tags

Search and load guess plugins automatically

Unlike other plugins, users normally don't know name of guess plugins. So, they don't write name of the guess plugins to configuration files.
To make guess plugins pluggable, Embulk should find and load installed guess plugins automatically.

Could not find property 'bintray_user'

Let me know how to set bintray_user:

> ./gradlew build

FAILURE: Build failed with an exception.

* Where:
Build file '/Users/leo/work/git/embulk/build.gradle' line: 29

* What went wrong:
A problem occurred evaluating root project 'embulk'.
> Could not find property 'bintray_user' on com.jfrog.bintray.gradle.BintrayExtension_Decorated@108a46d6.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 2.691 secs

Add an option to CSV parser plugin to skip first some lines

Some CSV files have some garbage lines.
Those lines are informative for some cases but they're not a part of table.

For example,

created date,2015-03-05
created by,me
device,xyz
time,key1,key2,key3,key4
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
2015-03-05 20:40:14 -0800,a,b,c,d
...

"new ruby-input" generator does not work properly.

"new ruby-input" generator does not work properly.

java -jar ./embulk-0.4.1.jar new ruby-input sample

% java -jar ./embulk-0.4.1.jar new ruby-input sample
Creating embulk-input-sample/
  Creating embulk-input-sample/README.md
  Creating embulk-input-sample/LICENSE.txt
  Creating embulk-input-sample/.gitignore
  Creating embulk-input-sample/Rakefile
  Creating embulk-input-sample/Gemfile
  Creating embulk-input-sample/embulk-input-sample.gemspec
  Creating embulk-input-sample/lib/embulk/input/sample.rb

rake build (and failed.)

% cd embulk-input-sample/
% rake build 
rake aborted!
There was a SyntaxError while loading embulk-input-sample.gemspec: 
/home/embulk/embulk-input-sample/embulk-input-sample.gemspec:7: syntax error, unexpected tIDENTIFIER, expecting keyword_end
...h the output plugins by "embulk-output" keyword."
...                               ^
/home/arch/embulk-input-sample/embulk-input-sample.gemspec:7: syntax error, unexpected tSTRING_BEG, expecting keyword_do or '{' or '('
...tput plugins by "embulk-output" keyword."
...                               ^
/home/embulk/embulk-input-sample/Rakefile:1:in `<top (required)>'
(See full trace by running task with --trace)

Can't use double quote "embulk-output" in gemspec file.

  spec.description   = "Sample input plugin is an Embulk plugin that loads records from Sample so that any output plugins can receive the records. Search the output plugins by "embulk-output" keyword."

Use single quote better

  spec.description   = "Sample input plugin is an Embulk plugin that loads records from Sample so that any output plugins can receive the records. Search the output plugins by 'embulk-output' keyword."

And one more thing.

module Embulk
  module Input

    class SampleInputPlugin < InputPlugin
      Plugin.register_input(sample, self)

need single quote??

module Embulk
  module Input

    class SampleInputPlugin < InputPlugin
      Plugin.register_input('sample', self)

Plugin loader of Java plugins

Embulk can't load java-based plugins (jar files) dynamically. Users need to add the jar files to classpath. Inconvenient...

Resuming failed transaction

A bulk load likely fails when we import large amount of data (say 1TB). When it fails, Embulk currently requires plugins to rollback entire transaction so that we can retry.
However, retrying the entire bulk load of the large data takes too long time. It's very good if we can resume the failed transaction from partially succeeded state.

Difficulties:

  • What's the API for plugins?
  • Is a input plugin responsible to skip tasks? Or, is a output plugin responsible to skip records?

Command-line (ssh) distributed executor

Idea:
I want to execute Embulk using this command line:

$ embulk run config.yml --execute-command "ssh %(find_host %{index}) embulk run-task %{task} %{index}"

With this command, embulk executes the given ssh command for each task. Then embulk run-task executes the given task on another host.
This is the simplest way to run embulk on distributed environment. We don't have to install Hadoop, YARN, Mesos, etc.

Some distributed execution environment such as mpich2 or Sun Grid Engine may work using this command-line.

See also: #29

Support array and map types

Some databases can store nested values.
Some file formats such as JSON or XM have nested values.
To transfer between those databases and file formats, embulk needs to support nested types.

Discussions are:

  • Array:
    • Fixed type array or variable type array
    • Fixed-type array: array(T) like array(long). [1, 2, 3]
    • Variable-type array: array. [1, "str", 0.3]
  • Map:
    • Fixed type or variable map
    • Allow non-string keys or not
  • Struct:
    • Fixed-length map with fixed-types for each field.
    • For example, struct(int, string))

Writing broken records or texts to a file to store them or retry later

Data include a lot of broken records. A bulk import can skip them but we want to load them later as an exceptional case. To do it, we want to get those error records written to other files or databases.

Difficulty in terms of API design is that format of the records can be different depending on plugin types.

  • Encoder, Decoder, some Parser plugins
    • These plugins read with buffer. They can't recognize "records". When they detect broken data, they skip the entire file
  • Line-based parser plugins
    • Some parser plugins are based on lines (e.g. csv). They can skip a line and continue parsing from the next line.
  • Formatter and Output plugins
    • Formatter plugins read records. They can skip a record whose schema is fixed by the previous plugins.
  • Filter plugins (#26)
    • Filter plugins read records. They can skip a record whose schema is fixed by the previous plugins.

So, depending on plugins, error output needs to store 3 kinds of data:

a) file from a certain position
b) line
c) record with various schema

spi2's Embulk gets NameError by sample config.yml

@frsyuki @nahi Embulk on spi2 branch doesn't work and gets NameError: uninitialized constant Embulk::GuessCsv::TFGuess. Could you fix it or do you have any workarounds?

$ java -cp $(echo $(find */target -name "*.jar") | sed "s/ /:/g") org.embulk.cli.Embulk examples/config.yml
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/slf4j-log4j12-1.7.9.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/muga/works/workspace/embulk/embulk-core/target/dependency/slf4j-log4j12-1.7.9.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/muga/works/workspace/embulk/embulk-standards/target/dependency/slf4j-log4j12-1.7.9.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" org.jruby.exceptions.RaiseException: (NameError) uninitialized constant Embulk::GuessCsv::TFGuess
    at org.jruby.RubyModule.const_missing(org/jruby/RubyModule.java:2723)
    at RUBY.guess_type(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:159)
    at RUBY.guess_field_types(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:115)
    at org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)
    at org.jruby.RubyEnumerable.each_with_index(org/jruby/RubyEnumerable.java:977)
    at RUBY.guess_field_types(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:115)
    at org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)
    at RUBY.guess_field_types(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:114)
    at RUBY.guess_lines(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_csv.rb:29)
    at RUBY.guess(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_plugin.rb:105)
    at RUBY.guess(file:/Users/muga/works/workspace/embulk/embulk-cli/target/dependency/embulk-core-0.1.0-SNAPSHOT.jar!/embulk/guess_plugin.rb:44)
    at Embulk$$GuessPlugin$$JavaAdapter_1955955176.guess(Embulk$$GuessPlugin$$JavaAdapter_1955955176.gen:13)

Throws "Embulk::PluginLoadError: Unknown guess plugin 'gzip' " even if i have installed gzip gem.

Mac OS X 10.10.1
Oracle Java version "1.7.0_45"

Yoshida-no-MacBook-Air:Documents takumi$ java -jar embulk.jar gem list

*** LOCAL GEMS ***

ffi (1.9.3 java)
jar-dependencies (0.1.2)
jruby-openssl (0.9.5 java)
json (1.8.0 java)
krypt (0.0.2)
krypt-core (0.0.2 universal-java)
krypt-provider-jdk (0.0.2)
rake (10.1.0)
rdoc (4.1.2)
Yoshida-no-MacBook-Air:Documents takumi$ java -jar embulk.jar gem install gzip
Fetching: gzip-1.0.gem (100%)
Successfully installed gzip-1.0
1 gem installed
Yoshida-no-MacBook-Air:Documents takumi$ java -jar embulk.jar guess embulk-example/example.yml -o config.yml
2015-02-17 07:09:27,493 [INFO]: main:org.embulk.standards.LocalFileInputPlugin: Listing local files at directory '/Users/takumi/Documents/embulk-example/csv' filtering filename by prefix 'sample_'
2015-02-17 07:09:27,502 [INFO]: main:org.embulk.standards.LocalFileInputPlugin: Loading files [/Users/takumi/Documents/embulk-example/csv/sample_01.csv.gz]
Embulk::PluginLoadError: Unknown guess plugin 'gzip'. embulk/guess/gzip.rb is not installed. Run 'embulk gem search -rd embulk-guess' command to find plugins.
          lookup at file:/Users/takumi/Documents/embulk.jar!/embulk/plugin_registry.rb:30
          lookup at file:/Users/takumi/Documents/embulk.jar!/embulk/plugin.rb:188
  new_java_guess at file:/Users/takumi/Documents/embulk.jar!/embulk/plugin.rb:178
  new_java_guess at /Users/takumi/Documents/embulk.jar!/META-INF/jruby.home/lib/ruby/1.9/forwardable.rb:201
             run at file:/Users/takumi/Documents/embulk.jar!/embulk/command/embulk_run.rb:275
          (root) at classpath:embulk/command/embulk.rb:39
org/embulk/plugin/PluginManager.java:42:in `newPlugin': org.embulk.config.ConfigException: GuessPlugin 'gzip' is not found
    from org/embulk/spi/ExecSession.java:106:in `newPlugin'
    from org/embulk/spi/Exec.java:54:in `newPlugin'
    from org/embulk/exec/GuessExecutor.java:267:in `run'
    from org/embulk/spi/FileInputRunner.java:129:in `run'
    from org/embulk/exec/GuessExecutor.java:118:in `run'
    from org/embulk/spi/FileInputRunner.java:101:in `run'
    from org/embulk/exec/GuessExecutor.java:251:in `transaction'
    from org/embulk/spi/FileInputRunner.java:95:in `run'
    from org/embulk/spi/util/Decoders.java:77:in `transaction'
    from org/embulk/spi/util/Decoders.java:33:in `transaction'
    from org/embulk/spi/FileInputRunner.java:92:in `run'
    from org/embulk/exec/GuessExecutor.java:170:in `transaction'
    from org/embulk/spi/FileInputRunner.java:60:in `transaction'
    from org/embulk/exec/GuessExecutor.java:114:in `runGuessInput'
    from org/embulk/exec/GuessExecutor.java:93:in `doGuess'
    from org/embulk/exec/GuessExecutor.java:34:in `access$000'
    from org/embulk/exec/GuessExecutor.java:73:in `run'
    from org/embulk/exec/GuessExecutor.java:70:in `run'
    from org/embulk/spi/Exec.java:21:in `doWith'
    from org/embulk/exec/GuessExecutor.java:70:in `guess'
    from org/embulk/command/Runner.java:168:in `guess'
    from org/embulk/command/Runner.java:70:in `main'
    from java/lang/reflect/Method.java:606:in `invoke'
    from file:/Users/takumi/Documents/embulk.jar!/embulk/command/embulk_run.rb:275:in `run'
    from classpath:embulk/command/embulk.rb:39:in `(root)'
    from classpath_3a_embulk/command/classpath:embulk/command/embulk.rb:39:in `(root)'
    from org/embulk/cli/Main.java:13:in `main'
Caused by:
file:/Users/takumi/Documents/embulk.jar!/embulk/plugin_registry.rb:30:in `lookup': org.jruby.exceptions.RaiseException: (PluginLoadError) Unknown guess plugin 'gzip'. embulk/guess/gzip.rb is not installed. Run 'embulk gem search -rd embulk-guess' command to find plugins.
    from file:/Users/takumi/Documents/embulk.jar!/embulk/plugin.rb:188:in `lookup'
    from file:/Users/takumi/Documents/embulk.jar!/embulk/plugin.rb:178:in `new_java_guess'
    from /Users/takumi/Documents/embulk.jar!/META-INF/jruby.home/lib/ruby/1.9/forwardable.rb:201:in `new_java_guess'
    from file:/Users/takumi/Documents/embulk.jar!/embulk/command/embulk_run.rb:275:in `run'
    from classpath:embulk/command/embulk.rb:39:in `(root)'
Yoshida-no-MacBook-Air:Documents takumi$ java -jar embulk.jar gem list

*** LOCAL GEMS ***

ffi (1.9.3 java)
gzip (1.0)
jar-dependencies (0.1.2)
jruby-openssl (0.9.5 java)
json (1.8.0 java)
krypt (0.0.2)
krypt-core (0.0.2 universal-java)
krypt-provider-jdk (0.0.2)
rake (10.1.0)
rdoc (4.1.2)

Add InputPlugin#guess

In https://github.com/komamitsu/embulk-plugin-redis, I want to use "bin/embulk guess" command to guess a schema in Redis and generate schema information automatically as much as possible. Of cause it's not a complete schema because Redis is schema-less, though.

To achieve that, I think we need to add InputPlugin#guess method like this:

  class InputXxxxx < InputPlugin
    def self.guess(config)
      x = Xxxxx.connect(
              config.param('host', :string, :default => 'localhost'),
              config.param('port', :integer, :default => 8888))
      xs = x.get((0...x.size).select{|i| i % 10 == 0})
      // Checking the schemas of `xs'...
          :
      cols << Column.new(idx, schema_name, schema_type)
          :
      cols
    end

DataSource.merge needs a behavior option to merge arrays

When DataSource.merge finds an array, it appends new array to existent array. This is good when it merges guessed NextConfig to original ConfigSource.

However, it's not good to merge NextConfig of previous execution to the previous ConfigSource.

We need 2 merges:

  • merge (for next config): {"array": [1,2,3]}.merge({"array": [4,5,6]}) => {"array":[4,5,6]}
  • append (for guess): {"array": [1,2,3]}.merge({"array": [4,5,6]}) => {"array":[1,2,3,4,5,6]}

Create HowTo page on Wiki

We can write following how-to documents:

  • How to write plugins?
    • ruby input plugin
    • ruby output plugin
    • java input plugin
    • java output plugin
    • java file input plugin
    • java file output plugin
    • java parser plugin
    • java formatter plugin
    • java encoder plugin
    • java decoder plugin
  • How to read and write local CSV files?
  • How to read and write files on S3?

Adding Executor plugin SPI

Embulk has only LocalExecutor but distributed execution is one of the main goals of Embulk.
To support various distributed execution environments, having pluggable & dynamically-loadable executor SPI is good.

Include processor index to log messages

With 3 processors, I got following log messages:

2015-02-10 17:07:42,753 [INFO]: embulk-executor-1:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,753 [INFO]: embulk-executor-0:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,753 [INFO]: embulk-executor-2:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,755 [INFO]: embulk-executor-2:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: > 0.00 seconds (loaded 10 rows in total)
2015-02-10 17:07:42,755 [INFO]: embulk-executor-0:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: > 0.00 seconds (loaded 10 rows in total)
2015-02-10 17:07:42,755 [INFO]: embulk-executor-1:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: > 0.00 seconds (loaded 10 rows in total)
2015-02-10 17:07:42,756 [INFO]: main:org.embulk.exec.LocalExecutor: {done:  3 / 3, running: 0}
2015-02-10 17:07:42,756 [INFO]: main:org.embulk.exec.LocalExecutor: {done:  3 / 3, running: 0}
2015-02-10 17:07:42,757 [INFO]: main:org.embulk.exec.LocalExecutor: {done:  3 / 3, running: 0}
2015-02-10 17:07:42,763 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: SQL: SET search_path TO "public"
2015-02-10 17:07:42,763 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: > 0.00 seconds
2015-02-10 17:07:42,763 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: SQL: DROP TABLE IF EXISTS "load01"
2015-02-10 17:07:42,765 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: > 0.00 seconds
2015-02-10 17:07:42,765 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: SQL: ALTER TABLE "load01_0000000054daab5e0b71b000_BULK_LOAD_TEMP" RENAME TO "load01"
2015-02-10 17:07:42,765 [INFO]: main:org.embulk.output.jdbc.JdbcOutputConnection: > 0.00 seconds

It's difficult to know which message is from which task processor thread. If the message has processor index, it will be:

2015-02-10 17:07:42,753 [INFO]: [1] embulk-executor-1:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,753 [INFO]: [2] embulk-executor-0:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)
2015-02-10 17:07:42,753 [INFO]: [3] embulk-executor-2:org.embulk.output.postgresql.PostgreSQLCopyBatchInsert: Loading 10 rows (290 bytes)

Ruby Symbols are converted to String implicitly after passing them to `yield' as a parameter in InputPlugin.transaction

Ruby Symbols are converted to String implicitly after passing them to yield as a parameter in InputPlugin.transaction.

module Embulk
  class InputRedis < InputPlugin
    Plugin.register_input('redis', self)
    def self.transaction(config, &control)
      task = {
            :
        :hello => :world
      }
      commit_reports = yield(task, columns, 1)
            :


    def run
      require 'pp'
      pp @task    ==> {...., "hello"=>"world"}

I just looked over the source code. org.embulk.config.ModelManager uses com.fasterxml.jackson.databind.ObjectMapper#writeValueAsString and I guess it converts RubySymbol to just String internally and it's related to the issue.

Add number of error records and their line numbers to PreviewResult

We should add more information to PreviewResult. Because we check the result of PreviewExecutor and often determine whether we should execute run command or not. The current PreviewResult provides schema and list of ignored exceptions for users. I think that it's better to add more:

  • of records that PreviewExecutor read.

  • of error records

  • actual error raw records and those line number (optional)

I also think that the above information can be build by using CommitReport on https://github.com/embulk/embulk/blob/master/embulk-core/src/main/java/org/embulk/exec/PreviewExecutor.java#L106

Develop a gradle plugin to make it easy for users to develop embulk plugins

Java developers probably don't want to write Ruby code and don't know Ruby's packaging system.
But Embulk uses RubyGems for packaging.

Idea here is that Embulk provides a gradle plugin to automate the packaging.

The plugin does:

  • provide a task to create pkg/embulk-plugin-xyz.gem package.
    • the task builds a jar file and copies dependency jars to classpath/*
    • then package them into pkg/embulk-plugin-xyz.gem
    • I think it can assume that developers write embulk-plugin-xyz.gemspec file
  • provide a task to release the gem file to RubyGems
    • I think it can assume that developers write ~/.gem/credentials file
  • manage versions of dependencies of basic packages to not cause version conflicts with embulk-core (like parent pom file)

See also: #28

Remove pom.xml

Once bulid.grade file gets stabled, we don't need pom.xml any more.
Delete pom.xml and update README.md.

Investigate: disable setContextClassLoader calls within JRuby

I got the following error during executing output-elasticsearch plugin. The root cause is that a context classloader cannot find the resoruce file. The problem might occur during other Java plugins are executed.

org/elasticsearch/env/Environment.java:213:in `resolveConfig': org.elasticsearch.env.FailedToResolveConfigException: Failed to resolve config path [names.txt], tried file path [names.txt], path file [.../config/names.txt], and classpath
    from org/elasticsearch/node/internal/InternalSettingsPreparer.java:119:in `prepareSettings'
    from org/elasticsearch/client/transport/TransportClient.java:157:in `<init>'
    from org/elasticsearch/client/transport/TransportClient.java:115:in `<init>'

Add guess API to input and output plugins

GuessPlugin completes "parser" and "decoders" section of a configuration file. But we also want to make "in" section guessable.
For example, redis is a schema-less database (not file format). It's InputPlugin (not FileInputPlugin or ParserPlugin). To make the behavior deterministic, we should write schema to configuration file. But it's hard work to write the all schema manually. Plugins should be able to help it.

Adding filter plugin API

We want to apply conversion or filtering to records when:

  • hash: applying a hash function to sensitive columns (e.g. account_id, email)
  • grep: skipping a record if it matches certain condition (e.g. skip access logs if they're internal callback requests)
  • convert column type: convert created_at:string column to timestamp
  • skip column: convert comment field to nil if it's too long
  • remove column: remove certain columns (e.g. remove "password" column)
  • join: joining data with other data stores. (e.g. join email, age, and other user information to user_id)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.