Code Monkey home page Code Monkey logo

stream-processing's Introduction

Learn How To Develop And Test Stateful Streaming Data Pipelines

CI codecov

Shared modules:

  • stream-processing-shared - shared utilities for developing stateful streaming data pipelines
  • stream-processing-infrastructure - infrastructure layer with IOs for BigQuery, Pubsub and Cloud Storage
  • stream-processing-test - shared utilities for testing stateful streaming data pipelines

Use cases:

  • toll-application, toll-domain, toll-infrastructure - sample application for toll data processing, see blog post
  • word-count - fixed window example, see blog post
  • session-window - session window example, see blog post

stream-processing's People

Contributors

dependabot[bot] avatar github-actions[bot] avatar mkuthan avatar mkuthan-scala-steward[bot] avatar scala-steward avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

stream-processing's Issues

IO for GCS

  • Define IO for writing bounded and unbounded collections to GCS
  • Reuse IO in Dead Letter Queue (in a similar way the Diagnostic reuse BigQuery IO)

Unify diagnostic output for IOs

Unify diagnostic output for all IOs. Typical use-case: handling corrupted Pubsub messages if the message couldn't be decoded into raw case class. Currently handled explicitly in toll-application:

    SCollection
      .unionAll(
        Seq(
          boothEntriesRawDlq.map(x => IoDiagnostic(x.id, x.error)),
          boothExitsRawDlq.map(x => IoDiagnostic(x.id, x.error)),
          vehicleRegistrationsRawUpdatesDlq.map(x => IoDiagnostic(x.id, x.error))
        )
      )
      .keyBy(_.key)
      .writeDiagnosticToBigQuery(IoDiagnosticTableIoId, config.ioDiagnosticTable)

Re-think diagnostic output

Re-think diagnostic output:

  • how to corelate with specific domain case in type safe manner?
  • what kind of statistics it should provide: counter, histograms, etc?
  • how to model BigQuery table schema for convinient use?

Override IO transform for JobTest

  1. Scio CustomOutput doesn't work because it returns ClosedTap and we need dead letters.
  2. TransformOverride is a hack but how to easily test output?
  3. In general all methods in the infrastructure module should be implemented as single transformation, easy to stub in the job tests.

Enhance vehicle registrations use-case

  • add historical vehicle registrations (BigQuery)
  • add event time to realtime vehicle registration updates (Pubsub) - does it make any sense if vehicle registration is defined as side input and joined using processing time?
  • add invalid realtime vehicle registration event
  • add corrupted realtime vehicle registration event
  • consider side input in finite window (e.g. 1 day) instead of global window

Calculate expired vehicle registrations

Report vehicles with expired registration using side input as registration lookup table.

SELECT EntryStream.EntryTime, EntryStream.LicensePlate, EntryStream.TollId, Registration.RegistrationId
FROM EntryStream TIMESTAMP BY EntryTime
JOIN Registration
ON EntryStream.LicensePlate = Registration.LicensePlate
WHERE Registration.Expired = '1'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.