Code Monkey home page Code Monkey logo

scio's Introduction

Scio

Build Status codecov.io GitHub license Maven Central Scaladoc Scala Steward badge

Scio Logo

Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow inspired by Apache Spark and Scalding.

Scio 0.3.0 and future versions depend on Apache Beam (org.apache.beam) while earlier versions depend on Google Cloud Dataflow SDK (com.google.cloud.dataflow). See this page for a list of breaking changes.

Features

  • Scala API close to that of Spark and Scalding core APIs
  • Unified batch and streaming programming model
  • Fully managed service*
  • Integration with Google Cloud products: Cloud Storage, BigQuery, Pub/Sub, Datastore, Bigtable
  • JDBC, TensorFlow TFRecords, Cassandra, Elasticsearch and Parquet I/O
  • Interactive mode with Scio REPL
  • Type safe BigQuery
  • Integration with Algebird and Breeze
  • Pipeline orchestration with Scala Futures
  • Distributed cache

* provided by Google Cloud Dataflow

Quick Start

Download and install the Java Development Kit (JDK) version 8.

Install sbt.

Use our giter8 template to quickly create a new Scio job repository:

sbt new spotify/scio.g8

Switch to the new repo (default scio-job) and build it:

cd scio-job
sbt stage

Run the included word count example:

target/universal/stage/bin/scio-job --output=wc

List result files and inspect content:

ls -l wc
cat wc/part-00000-of-00004.txt

Documentation

Getting Started is the best place to start with Scio. If you are new to Apache Beam and distributed data processing, check out the Beam Programming Guide first for a detailed explanation of the Beam programming model and concepts. If you have experience with other Scala data processing libraries, check out this comparison between Scio, Scalding and Spark. Finally check out this document about the relationship between Scio, Beam and Dataflow.

Example Scio pipelines and tests can be found under scio-examples. A lot of them are direct ports from Beam's Java examples. See this page for some of them with side-by-side explanation. Also see Big Data Rosetta Code for common data processing code snippets in Scio, Scalding and Spark.

Artifacts

Scio includes the following artifacts:

  • scio-core: core library
  • scio-test: test utilities, add to your project as a "test" dependency
  • scio-avro: add-on for Avro, can also be used standalone
  • scio-google-cloud-platform: add-on for Google Cloud IO's: BigQuery, Bigtable, Pub/Sub, Datastore, Spanner
  • scio-cassandra*: add-ons for Cassandra
  • scio-elasticsearch*: add-ons for Elasticsearch
  • scio-extra: extra utilities for working with collections, Breeze, etc., best effort support
  • scio-jdbc: add-on for JDBC IO
  • scio-neo4j: add-on for Neo4J IO
  • scio-parquet: add-on for Parquet
  • scio-tensorflow: add-on for TensorFlow TFRecords IO and prediction
  • scio-redis: add-on for Redis
  • scio-smb: add-on for Sort Merge Bucket operations
  • scio-repl: extension of the Scala REPL with Scio specific operations

License

Copyright 2021 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

scio's People

Contributors

andrewsmartin avatar andrisnoko avatar anish749 avatar benfradet avatar clairemcginty avatar dependabot[bot] avatar elpicador avatar fallonchen avatar farzad-sedghi avatar i-maravic avatar jbigred1 avatar jto avatar kanterov avatar kellen avatar martinbomio avatar mrkm4ntr avatar nevillelyh avatar psobot avatar ravwojdyla avatar regadas avatar rustedbones avatar samschlegel avatar scala-steward avatar shnapz avatar sisidra avatar spkrka avatar spotify-steward[bot] avatar stormy-ua avatar syodage avatar yonromai avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.