Code Monkey home page Code Monkey logo

snowplow-elasticsearch-loader's Introduction

Snowplow Elasticsearch Loader

Build Status GithubRelease License

Introduction

The Snowplow Elasticsearch Loader consumes Snowplow enriched events or failed events from an Amazon Kinesis stream or NSQ topic, transforms them to JSON, and writes them to Elasticsearch. Events which cannot be transformed or which are rejected by Elasticsearch are written to a separate Kinesis stream.

Building

Assuming you already have SBT installed:

$ git clone git://github.com/snowplow/snowplow-elasticsearch-loader.git
$ sbt compile

Usage

The Snowplow Elasticsearch Loader has the following command-line interface:

snowplow-elasticsearch-loader 2.1.2

Usage: snowplow-elasticsearch-loader [options]

  --config <filename>

Running

Create your own config file:

$ cp config/config.kinesis.reference.hocon my.conf

Update the configuration to fit your needs.

Next, start the loader, making sure to specify your new config file:

$ java -jar snowplow-elasticsearch-loader-2.1.2.jar --config my.conf

Find out more

Technical Docs Setup Guide Roadmap Contributing
i1 i2 i3 i4
Technical Docs Setup Guide Roadmap Contributing

Copyright and license

Copyright 2014-2023 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

snowplow-elasticsearch-loader's People

Contributors

aldemirenes avatar alexanderdean avatar benfradet avatar benjben avatar chuwy avatar colmsnowplow avatar fblundun avatar fridiculous avatar istreeter avatar jbeemster avatar spenes avatar szareiangm avatar zayec77 avatar zcei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snowplow-elasticsearch-loader's Issues

Store Kinesis checkpoints in Elasticsearch

from snowplow/snowplow#2456:

The idea here is:
When writing data to ES, we also store the Kinesis shard checkpoints alongside the data
These checkpoints will be backed up alongside the event data each night
In the case we need to do a restore, we will copy the checkpoints from ES back to DynamoDB before restarting the ES SInk
Doing this should mean we can recover our ES and restart drip feeding without data loss/duplication.
Open questions: how transactional is the ES backup - is there a risk of drift between data loaded and checkpoints stored during the S3 backup?

Note: this idea is borrowed from the Kafka guys, who suggest co-locating checkpoints alongside data in a storage target

Investigate case where sink stops getting events from the stream

from snowplow/snowplow#1840:

Hey @fblundun @alexanderdean have noticed a weird case whereby the sink stops fetching any events from the input stream. No errors are thrown but some of the Kinesis Connectors Magic seems to be able to break without warning.
It feels like somehow the app stops being in sync with its shard and then it just stops pulling events into the buffer. A restart of the application immediately resolved the issue.

will investigate

Allow empty custom contexts

from snowplow/snowplow#2927:

We published javascript-tracker-core 0.4.0 two years ago. Until now it wasn't used in mainstream tracker because included one significant change: it allowed empty custom contexts which would break compatibility with Scala Hadoop Shred < 0.9.14.
But turns out Kinesis Elasticsearch Sink bears this chunk of code and prohibits empty context (and also Scala Analytics SDK.
This all means we should either:
Allow empty contexts here and in Analytics SDK and wait two years more when these versions will be obsolete.
Abandon com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1 Schema and use old 1-0-0, which makes more sense to me.

Add better detection of failed storage target

from snowplow/snowplow#2918:

Currently the Kinesis Elasticsearch Sink is far to quick to decide that the data is at fault and will then simply pass it of to the bad path. We need two tiers of detection:
On entering the emitter ascertain whether the storage target is available for bulk event indexing.
On any failure to bulk index we need to ascertain exactly what caused that failure.

Automatic rotation of index names

from snowplow/snowplow#1886:

Please add an option to the kinesis-elasticsearch-sink so every day/week/month year can start with a new index. Like a logrotate.
Config example:
location {
index: "snowplow"
type: "enriched"
rotate: "day" (options: none/day/week/month/year)
}
The index name can be like this:
"snowplow_20150708" in case of rotate=day
"snowplow_2015_w13" in case of rotate=week
"snowplow_2015_m7" in case of rotate=month
"snowplow_2015" in case of rotate=year
"snowploy" in case of rotate=none (or not configured)

Migrate CHANGELOG from snowplow/snowplow

  • Use KES versions for the prior release titles, not Snowplow Rxx releases
  • Remove the unnecessary "Kinesis Elasticsearch Sink: " prefixes
  • Prefix the ticket numbers with snowplow/snowplow

Thus:

...

0.8.0 (2016-10-07)
------------------
Bump to 0.8.0 (snowplow/snowplow#2885)
Bump Scala Tracker to 0.3.0 (snowplow/snowplow#2899)
Allow parametrized timeouts for jest client (snowplow/snowplow#2897)
Does not take into account buffer configurations (snowplow/snowplow#2895)
Error messages are not helpful (snowplow/snowplow#2896)
Ensure field names do not contain any dots (snowplow/snowplow#2894)
Add support for Elasticsearch 2.x (snowplow/snowplow#2525)
Call Config.resolve() to resolve environment variables in hocon (snowplow/snowplow#2880)

...

Add ability to filter fields

from snowplow/snowplow#3195:

At the moment the index is comprised of many many fields that are not used as a side effect of mapping the atomic definition to Elasticsearch. To reduce the size of the index it would be nice to be able to control what fields we care about storing in the index.

Add support for writing raw JSONs

This would enable Snowplow ES Loader to go beyond pure-Snowplow usecases (similar to how Kinesis S3 already does).

This has a few advantages:

  • Increases the utility of the Snowplow Elasticsearch Loader
  • Increases the number of possible contributors to this project
  • Allows us (Snowplow) to use this for getting customers' non-Snowplow JSON data into Elasticsearch, alongside their Snowplow events

Uncaught exception when first record in a buffer exceeds the buffer byte size limit

from snowplow/snowplow#3019:

The buffer splitting routing doesn't correctly handle the case where the first record exceeds the buffer byte size limit and throws an IndexOutOfBoundsException:
java.lang.IndexOutOfBoundsException: 1
at com.snowplowanalytics.snowplow.storage.kinesis.elasticsearch.SnowplowElasticsearchEmitter.splitBufferRec$1(SnowplowElasticsearchEmitter.scala:187)
at com.snowplowanalytics.snowplow.storage.kinesis.elasticsearch.SnowplowElasticsearchEmitter.splitBuffer(SnowplowElasticsearchEmitter.scala:213)
at com.snowplowanalytics.snowplow.storage.kinesis.elasticsearch.SnowplowElasticsearchEmitter.sendToElasticsearch(SnowplowElasticsearchEmitter.scala:145)
at com.snowplowanalytics.snowplow.storage.kinesis.elasticsearch.SnowplowElasticsearchEmitter.emit(SnowplowElasticsearchEmitter.scala:128)
at com.snowplowanalytics.snowplow.storage.kinesis.elasticsearch.SnowplowElasticsearchEmitterSpec$$anonfun$1$$anonfun$apply$24.apply(SnowplowElasticsearchEmitterSpec.scala:118)
at com.snowplowanalytics.snowplow.storage.kinesis.elasticsearch.SnowplowElasticsearchEmitterSpec$$anonfun$1$$anonfun$apply$24.apply(SnowplowElasticsearchEmitterSpec.scala:101)
(This is very difficult to see because the Amazon KCL suppresses the exception at runtime!)

review fix in snowplow/snowplow#3020

Invalid app-name causes strange errors

from snowplow/snowplow#3174:

While attempting to set up the Elastic sink, I was getting errors when it attempted to create the DynamoDB table.
[main] DEBUG com.amazonaws.request - Received error response: com.amazonaws.AmazonServiceException: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
The Canonical String for this request should have been
'POST
/
[SNIP]
After much hand-wringing, and hair pulling, I noticed my configuration specified the app-name with a leading space.
The app-name should be validated before passing to AWS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.