Code Monkey home page Code Monkey logo

pb's Introduction

#S5

"Because your data is probably small enough!" -- Author

S5 to directly export data from SQL-like databases to Amazon S3. Although the same could be achieved by gluing existing tools, S5 is built to:

  • Be standalone, have no runtime dependencies and be easy to run and deploy.
  • Have predictable operational behavior.
  • Leave no trace, and be friendly to the health of your systems.

The name S5 borrows its name from its previous name SqlS3Streamer which has five Ss in total.

Use cases

This tool is made to export small data stored in SQL-like databases to a compressed, field and record-delimited format, such as TSV or CSV, to Amazon S3. This is primarily made with Amazon Redshift in mind that imports such formats natively.

This tool is only useful if your data is small enough that given your database and network resources, could be exported in reasonable amount of time.

You can export select number of columns from your tables and even apply simple transformations, as far as your SQL database allows.

How it works

S5 runs an arbitrary SQL query against a database, streams the result in chunks and uses Amazon S3's multipart upload API to store the data on S3.

The result-set could be arbitrarily large, such as 10s of GB of data and the memory consumption will remain constant.

###Output Format

Currently S5 only knows how to serialize records into field/record-delimited formats such as TSV and CSV.

###Compression

S5 can compress the exported data in a way that the resulting file, stitched together by Amazon S3 multipart API, forms a valid gzip file.

Pipelining

S5 works by reading chunks of data and uploading them until all the data is consumed. In order to speed up the process, reading and uploading happens in stages; meaning that while a part is being uploaded, the next is being read, serialized and compressed.

##Running

You can run S5 by supplying the arguments:

s5  --sql.user=user                               \
    --sql.password=pass                           \
    --sql.database=mydb                           \
    --s3.region us-east-1                         \
    --s3.accesskey S3ACCESSKEY                    \
    --s3.secretkey S3SECRETKEY                    \
    --s3.bucket mydbs                             \
    --s3.path '2015/07/27/foos.gz'                \
    --sql.query="select * from foos"              \
    --compress                                    \
    --chunksizemb 100

For obvious explanation of command line arguments try:

s5 --help

When a multipart S3 upload fails, the chunks are not automatically removed by S3. They won't even show in any file hierarchy on S3 but you'll be charged for them. In order to clean those up, I wrote this little tool which you can schedule to run at an appropriate time, not interfering with uploads:

s3multicleanup --s3.region us-east-1              \
               --s3.accesskey S3ACCESSKEY         \
               --s3.secretkey S3SECRETKEY         \
               --s3.bucket mydbs

Considerations

S5 will pass whatever query you provide to the SQL database. If you run operations such as JOIN, GROUP BY or DISTINCT etc.. or even DROP TABLE, they'll be passed directly to your database!

Besides, I suggest avoiding any operation beyond sequential sweeps. There are myriads of tools our there to carry out the T part of your _ETL_s.

Depending on the storage engine, your query may not result in a consistent snapshot of your data.

##Development

S5 is written in Go and can be built and tested by golang's standard toolchain:

go test ./...
go build ./...

Future development

Pull requests are welcome.

TODO

  • More comments for packages and some functions.
  • Add tests for the S3 package.
  • Packaging for Debian.
  • Pluggable output format where CSV/TSV is not sufficient.
  • Configurable compression.

Credits

  • This tool is originally made for Brainly

pb's People

Contributors

omidaladini avatar

Watchers

Yann Autissier avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.