Code Monkey home page Code Monkey logo

enrich's Introduction

Release License Coverage Status CI

Snowplow Enrich

Snowplow Enrich is a set of applications and libraries for processing raw Snowplow events into validated and enriched Snowplow events. It consists of following modules:

  • Snowplow Common Enrich - a core library, containing all validation and transformation logic. Published on Maven Central
  • Snowplow Stream Enrich - a set of applications working with Kinesis, Kafka and NSQ. Each asset published as Docker image on DockerHub
  • Snowplow Enrich PubSub - an application for a GCP pipeline that does not require a distributed computing framework. Published as Docker image on DockerHub

Snowplow Enrich provides record-level enrichment only: feeding in 1 raw Snowplow event will yield exactly 1 record out, where a record may be an enriched Snowplow event or a reported bad record.

Find out more

Technical Docs Setup Guide Roadmap Contributing
i1 i2 i3 i4
Technical Docs Setup Guide Roadmap coming soon

Copyright and license

Copyright (c) 2023-present Snowplow Analytics Ltd. All rights reserved.

Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)

enrich's People

Contributors

aalekh avatar aldemirenes avatar alexanderdean avatar benfradet avatar benjben avatar bogaert avatar chuwy avatar dilyand avatar fblundun avatar istreeter avatar jbeemster avatar kazjote avatar khalidjaz avatar knservis avatar lmath avatar lukeindykiewicz avatar matus-tomlein avatar miike avatar misterpig avatar ninjabear avatar oguzhanunlu avatar peel avatar pondzix avatar ronnyml avatar rzats avatar spenes avatar stanch avatar szareiangm avatar voropaevp avatar yalisassoon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

enrich's Issues

Common: IP Enrichment failing with 'OpenWeatherMap ErrorResponse New city' and caching transient response

I've been able to reproduce this by making a OWM request as the weather enrichment would between [approximately] midnight and 1AM UTC -- I placed the start timestamp less than one hour in the past. The 'New city' response is given, and no weather data is supplied. Since the start and end of day are used as parameters from the weather enrichment, this would leave requests during the first hour failing. Additionally, since start and end parameters are used in the cache key, the new response is cached for all responses for all coordinates within the failing time period and [potentially] kept the entire day (in the case of steam enrichment).

Common: add support for the adtiming GA event type

payload looks like"

v=1
_v=j76
a=596423924
t=adtiming
_s=2
dl=link
dr=link
ul=es-es
de=UTF-8
dt=text
sd=24-bit
sr=1280x720
vp=1263x609
je=0
plt=7177
pdt=89
dns=60
rrt=266
srt=403
tcp=285
dit=4287
clt=4288
_gst=2747
_gbt=3274
_cst=1985
_cbt=2642
_utma=
_utmz=
_utmht=
_u=
jid=
gjid=
cid=
uid=
tid=
_gid=
gtm=
cd1=t1
cd2=t2
z=number

Common: improve derived_tstamp/implement better approximation of actual time of event

Currently, the derived timestamp logic is as follows:

derived_tstamp = collector_tstamp - (dvce_sent_tstamp - dvce_created_tstamp)

As I understand it, the logic is that the collector timestamp is the first time we confidently establish a measurement of actual time, so we take that as the basis, and subtract lag between creating and sending the event.

This leads to unusual values in the case of events which are sent late due to connectivity issues.

A suggested improvement on this logic is as follows:

  • Assume a reasonable time between collector_tstamp and dvce_sent_tstamp (I think this would be within a second or a few seconds)
  • Calculate difference between collector_tstamp and dvce_sent_tstamp, if it's above that number - this gives you how far off the actual time of day the devices clock is.
  • Offset the dvce_created_tstamp by that delta to produce a more consistently accurate measurement of actual time of event.

I believe this is better because:

  • no matter what, the dvce_created_tstamp always gives you when the events happened in relation to each other.
  • Collector and dvce_sent timestamps should be very very close to each other. The difference between these values is therefore presumed to be an accurate measure of the device clock's offset from actual time UTC
  • Therefore the dvce_created_tstamp adjusted by this offset returns a good approximation of the actual time of event.

There's still scope for inaccuracies in the rare case that the device clock changes between events, but avoids strange values resulting from lag between sending consecutive events.

Common: deprecate and remove CloudfrontLoader

AWS is changing its Cloudfront format.

We are contacting you because your AWS account contains at least one CloudFront web distribution with logging enabled. Starting December 12, 2019, Amazon CloudFront is adding the following seven new fields to access logs:

• c-port – The port number of the request from the viewer.
• time-to-first-byte – The number of seconds between receiving the request and writing the first byte of the response, as measured on the server.
• x-edge-detailed-result-type – When the result type is an error, this field contains the specific type of error.
• sc-content-type – The value of the HTTP Content-Type header of the response.
• sc-content-len – The value of the HTTP Content-Length header of the response.
• sc-range-start – When the response contains the HTTP Content-Range header, this field contains the range start value.
• sc-range-end – When the response contains the HTTP Content-Range header, this field contains the range end value.

Scala Common Enrich: mock history.openweathermap.org in the test to avoid transient errors

The test com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.WeatherEnrichmentSpec

depends on an external service (history.openweathermap.org). This can fail for various reasons. We should ensure that the test doesn't fail if the service is unavailable (or the query quota is exceeded). The way to do that is more than likely to mock that service.

Common: iglu Webhook Adapter: Accept array POST requests

Currently sending a JSON-array to the webhook adapter causes the data to fail validation if sent like so:

curl http://acme.com/com.snowplowanalytics.iglu/v1 -H "Content-Type: application/json" -d '[{
    "schema":"iglu:com.acme/event/jsonschema/1-0-0",
    "data": {
      "key":"value"
    }
  }, {
    "schema":"iglu:com.acme/event/jsonschema/1-0-0",
    "data": {
      "key":"value"
    }
  }, {
    "schema":"iglu:com.acme/event/jsonschema/1-0-0",
    "data": {
      "key":"value"
    }
  }]'

Validation error message:

"message": "Iglu event failed: detected SelfDescribingJson but schema key is missing"

Sending a malformed array with no square braces, however, is successful:

curl http://acme.com/com.snowplowanalytics.iglu/v1 -H "Content-Type: application/json" -d '{
    "schema":"iglu:com.acme/event/jsonschema/1-0-0",
    "data": {
      "key":"value"
    }
  }, {
    "schema":"iglu:com.acme/event/jsonschema/1-0-0",
    "data": {
      "key":"value"
    }
  }, {
    "schema":"iglu:com.acme/event/jsonschema/1-0-0",
    "data": {
      "key":"value"
    }
  }'

Iglu webhook adapter should be able to handle a legitimate JSON array of data.

Common: add preference for successful skip to IAB Bots and Spiders Enrichment

The IAB enrichment only accepts IPv4 as an input, which according to the comment in this issue appears to be down to the underlying Java library.

There are instances where IPv6 addresses are populated in the user_ipaddress field - commonly where a DNS service like cloudflare is used, but I'm sure there are also use cases where IPv6 is preferred for other reasons.

In this scenario, we don't have control over the IP format, so enabling the IAB enrichments causes all such data to go to bad rows - the only solution appears to be to disable cloudflare (which may be depended upon for other reasons), or disable the IAB enrichment.

In this issue here, a good solution is proposed for the API enrichment - if it were possible to configure the enrichment to skip such rows, we don't risk failing data.

I'm not sure of the viability of this but I think it would be a great feature (for all enrichments potentially).

Common: add timeout to API Request Enrichment

Project: Scala Common Enrich

Version: 0.36.0

Expected behavior: Support timeout

Actual behavior: There is no code that would handle timeout

Steps to reproduce: did a code search and turned out that timeout was not passed to ScalaJ.

I was reading the code for API Request Enrichment to understand how it would handle the timeout to reuse the same code for HTTP Remote Adapter It turned out the settings are passed to com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.apirequest.HttpApi
but likely it is not passed to the Scalaj library in perform function.

This is a quick change, so I will add it after my current PR. Before that, I have one question. What was the intention for this timeout setting? Was it connectionTimeout or readTimeout?

I could not find it in the documentations but I prefer to have both because they could handle different cases.

CC: @Nafid @dmatth

Common: enable YAUAA enrichment to reference the engine and ruleset by http/s3 URI

Thanks for integrating the YAUAA as a new enrichment. However, there is already a newer version of the engine plus ruleset released. Can you enable the enrichment config to reference the engine jar and the ruleset file to be refrenced by an HTTP or S3 URI, like the ip-lookup or the iab enrichment? Both those files need to be automatically downloaded and added the classpath. We made it happen locally in a pretty hacky way. But ideally it is properly supported by snowplow.

Thank you!

Common: get rid of dependencies not published on Maven Central

Project:
Stream Enrich

Version:
com.snowplowanalytics:snowplow-common-enrich_2.11:0.36.0

Expected behavior:
While importing lib com.snowplowanalytics:snowplow-common-enrich_2.11:0.36.0, it should automatically resolve the additional deps. These should be available in maven repo(bintray/maven central)

Actual behavior:
Can't resolve
com.snowplowanalytics:collector-payload-1:0.0.0
com.snowplowanalytics:referer-parser_2.11:0.3.0
com.snowplowanalytics:schema-sniffer-1:0.0.0
com.snowplowanalytics:snowplow-thrift-raw-event:0.1.0

Steps to reproduce:
Create an empty Gradle project and add dependency as com.snowplowanalytics:snowplow-common-enrich_2.11:0.36.0

Will it be possible to set up a publish artifact to bintray ci for those additional snowplow deps. This would be really helpful with the new architecture.

Though I would suggest rather than splitting all bases in different repos, would be awesome if we aggregate all base(*common) project in one repo and publish reusable artifacts from there.

Then in snowplow code, we only have enrichers codes. I understand that this would make development lengthy but will have abstraction advantages.

Common: support IPv6 addresses in IAB enrichment

We're seeing following bad row produced by IAB enrichment.

"errors":[{"level":"error","message":"Unexpected error processing events: java.lang.IllegalArgumentException: Could not parse [Redacted IPv6]\n\tat org.apache.commons.net.util.SubnetUtils.toInteger(SubnetUtils.java:287)\n\tat org.apache.commons.net.util.SubnetUtils.access$400(SubnetUtils.java:27)\n\tat org.apache.commons.net.util.SubnetUtils$SubnetInfo.isInRange(SubnetUtils.java:125)\n\tat com.snowplowanalytics.iab.spidersandrobotsclient.lib.internal.IpRanges.belong(IpRanges.java:78)\n\tat com.snowplowanalytics.iab.spidersandrobotsclient.IabClient.checkAt(IabClient.java:80)\n\tat com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.IabEnrichment.performCheck(IabEnrichment.scala:198)\n\tat com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.IabEnrichment.getIab(IabEnrichment.scala:233)\n\tat com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.IabEnrichment.getIabContext(IabEnrichment.scala:218)\n\tat com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$.enrichEvent(EnrichmentManager.scala:286)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3.apply(EtlPipeline.scala:92)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3.apply(EtlPipeline.scala:91)\n\tat scalaz.NonEmptyList$class.map(NonEmptyList.scala:23)\n\tat scalaz.NonEmptyListFunctions$$anon$4.map(NonEmptyList.scala:207)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2.apply(EtlPipeline.scala:91)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2.apply(EtlPipeline.scala:88)\n\tat scalaz.Validation$class.map(Validation.scala:112)\n\tat scalaz.Success.map(Validation.scala:345)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$$anonfun$1$$anonfun$apply$1.apply(EtlPipeline.scala:88)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$$anonfun$1$$anonfun$apply$1.apply(EtlPipeline.scala:85)\n\tat scala.Option.map(Option.scala:146)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$$anonfun$1.apply(EtlPipeline.scala:85)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$$anonfun$1.apply(EtlPipeline.scala:82)\n\tat scalaz.Validation$class.map(Validation.scala:112)\n\tat scalaz.Success.map(Validation.scala:345)\n\tat com.snowplowanalytics.snowplow.enrich.common.EtlPipeline$.processEvents(EtlPipeline.scala:82)\n\tat com.snowplowanalytics.snowplow.enrich.beam.Enrich$.com$snowplowanalytics$snowplow$enrich$beam$Enrich$$enrich(Enrich.scala:204)\n\tat com.snowplowanalytics.snowplow.enrich.beam.Enrich$$anonfun$enrichEvents$1$$anonfun$12.apply(Enrich.scala:139)\n\tat com.snowplowanalytics.snowplow.enrich.beam.Enrich$$anonfun$enrichEvents$1$$anonfun$12.apply(Enrich.scala:139)\n\tat com.snowplowanalytics.snowplow.enrich.beam.utils$.timeMs(utils.scala:123)\n\tat com.snowplowanalytics.snowplow.enrich.beam.Enrich$$anonfun$enrichEvents$1.apply(Enrich.scala:138)\n\tat com.snowplowanalytics.snowplow.enrich.beam.Enrich$$anonfun$enrichEvents$1.apply(Enrich.scala:135)\n\tat com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)\n\tat com.spotify.scio.util.Functions$$anon$7$DoFnInvoker.invokeProcessElement(Unknown Source)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)\n\tat com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)\n\tat com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)\n\tat com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)\n\tat com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)\n\tat com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)\n\tat com.spotify.scio.util.Functions$$anon$7$DoFnInvoker.invokeProcessElement(Unknown Source)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)\n\tat com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)\n\tat com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)\n\tat com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)\n\tat com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)\n\tat org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)\n\tat org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:128)\n\tat org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)\n\tat com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)\n\tat com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)\n\tat com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)\n\tat com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:200)\n\tat com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:158)\n\tat com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:75)\n\tat com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1227)\n\tat com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:135)\n\tat com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:966)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:745)\n"}],"failure_tstamp":"2019-01-21T13:00:24.222Z"}

Common: check PII event size

When stream enrich sends a "good" event to the enriched-good stream it checks it's size and if it's beyond the preconfigured size limit it sends it to bad.

With the addition of the PII stream we have decided to only use the "smallEnough" events to generate PII, without checking the size of the PII event, assuming that they will always be smaller than the parent event (which seems reasonable) but that is not the only option.

Read here for more info.

Update platform field so that it can accept many more (any string) value

Currently I believe that the platform field is an enum that can only be one of the following values:

Platform p value
Web (including Mobile Web) web
Mobile/Tablet mob
Desktop/Laptop/Netbook pc
Server-Side App srv
General App app
Connected TV tv
Games Console cnsl
Internet of Things iot

This is too limited now Snowplow is a general event data platform e.g. for tracking emails none of the above platforms fit the bill...

Can we update the validation so that any string value is accepted, giving users much more flexibility to set this field appropriately?

Common: support multiple instances of API request enrichment

The goal is to have multiple instances of one enrichment. This could be first step toward more complex enrichment process.

This issue is more focused on on the popular API request enrichment and can be extended to the other ones, as soon as the design is streamline with everybody's thoughts.

I have created this issue as per guidance from Ben in #2260

Common: deduplicate events during enrich step

I'm wondering if we ever tested the assumption that enrich never produces a bunch of duplicates from a single tracker's POST payload.

If enrich accepts TrackerPayload that contains multiple events, then splits it into aList[Event] how confident we are that there's no duplicates in that list? Or in other words, how confident we're that trackers never send duplicates in a same payload?

Cross-batch deduplication in RDB Shredder (and Snowflake Transformer) allows events to pass the step if they have same etl_tstamp, because it means that probably Shredder just re-processes the event. And this deduplication is not 100% proof for Stream Enrich because transformers allow events with same etl_tstamp to pass. I believe the reason is either that enrich can produce duplicates or event.

Another possible reason is that etl_tstamp generated is so frequently that we're getting collisions between different payloads. And those payloads also happen to contain duplicates, but I see it is something very unlikely.

Common: encrypt original values in PII Enrichment

The motivation for this ticket is to help users of piinguin and piinguin relay to better secure access to the original data on piinguin without having to focus on securing access to piinguin within an organisation.

The way to achieve that is to have one (or more) public keys with which all the original values will be encrypted. The new configuration will look like this:

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/pii_enrichment_config/jsonschema/3-0-0",
  "data": {
    "vendor": "com.snowplowanalytics.snowplow.enrichments",
    "name": "pii_enrichment_config",
    "emitEvent": true,
    "enabled": true,
    "parameters": {
      "pii": [
        {
          "pojo": {
            "field": "user_id",
            "encryptionKeyName": "other-key"
          }
        },
        {
          "pojo": {
            "field": "user_fingerprint"
            # No encryption
          }
        },
        {
          "json": {
            "field": "unstruct_event",
            "schemaCriterion": "iglu:com.mailchimp/subscribe/jsonschema/1-*-*",
            "jsonPath": "$.data.['email', 'ip_opt']",
            "encryptionKeyName": "email-key"
          }
        }
      ],
      "strategy": {
        "pseudonymize": {
          "hashFunction": "SHA-1",
          "salt": "pepper123"
        }
      },
      "encryption": [
        {
          "keyName": "email-key",
          "key": "some-rsa-publickey"
        },
        {
          "keyName": "other-key",
          "key": "some-rsa-publickey-2"
        }
      ]
    }
  }
}

The emitted event will also be changed (value is encrypted and base64 encoded, the actual implementation will need to be finalised):

{
  "schema": "iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0",
  "data": {
    "schema": "iglu:com.snowplowanalytics.snowplow/pii_transformation/jsonschema/2-0-0",
    "data": {
      "pii": {
        "pojo": [
          {
            "fieldName": "user_fingerprint",
            "originalValue": "its_you_again!",
            "modifiedValue": "27abac60dff12792c6088b8d00ce7f25c86b396b8c3740480cd18e21068ecff4"
          },
          {
            "fieldName": "user_ipaddress",
            "originalValue": "eZDx1Y1SMIcP0vIzkNsx3xMZ4twdyqqU5bqNPkLNYElDNcUhD/8NH0Xb8vYPLvy5NZmm5XuMzInQ7xRHr4kB9q4kvRwtCwUGSS4OSR/QlPQWMz6NzMAep7oQ10crpdxQcXH5LxvMTMROndxOnV5Aglepd4zuSMRj+q3u9uH6zZmiMjS/1xcxC4dRdD3NtrR9IpNjaqkx9BrQ2S1ClsVntU/UGLZEAle5H+Uy+qvXYczbQsmVVwYLdgv4S4Om0QPW+T48pu2VGXVwNnJUwdAFqL+snAFrOfyGa1oDcwoTGcbhR3YJO2Gv7NzvMyDtPaNLaYgrzDJcDV1qLt1W12h2Bg==",
            "modifiedValue": "dd9720903c89ae891ed5c74bb7a9f2f90f6487927ac99afe73b096ad0287f3f5",
            "encryptionKeyName": "other-key"
          },
          {
            "fieldName": "user_id",
            "originalValue": "eZDx1Y1SMIcP0vIzkNsx3xMZ4twdyqqU5bqNPkLNYElDNcUhD/8NH0Xb8vYPLvy5NZmm5XuMzInQ7xRHr4kB9q4kvRwtCwUGSS4OSR/QlPQWMz6NzMAep7oQ10crpdxQcXH5LxvMTMROndxOnV5Aglepd4zuSMRj+q3u9uH6zZmiMjS/1xcxC4dRdD3NtrR9IpNjaqkx9BrQ2S1ClsVntU/UGLZEAle5H+Uy+qvXYczbQsmVVwYLdgv4S4Om0QPW+T48pu2VGXVwNnJUwdAFqL+snAFrOfyGa1oDcwoTGcbhR3YJO2Gv7NzvMyDtPaNLaYgrzDJcDV1qLt1W12h2Bg==",
            "modifiedValue": "7d8a4beae5bc9d314600667d2f410918f9af265017a6ade99f60a9c8f3aac6e9",
            "encryptionKeyName": "other-key"
          }
        ],
        "json": [
          {
            "fieldName": "unstruct_event",
            "originalValue": "eZDx1Y1SMIcP0vIzkNsx3xMZ4twdyqqU5bqNPkLNYElDNcUhD/8NH0Xb8vYPLvy5NZmm5XuMzInQ7xRHr4kB9q4kvRwtCwUGSS4OSR/QlPQWMz6NzMAep7oQ10crpdxQcXH5LxvMTMROndxOnV5Aglepd4zuSMRj+q3u9uH6zZmiMjS/1xcxC4dRdD3NtrR9IpNjaqkx9BrQ2S1ClsVntU/UGLZEAle5H+Uy+qvXYczbQsmVVwYLdgv4S4Om0QPW+T48pu2VGXVwNnJUwdAFqL+snAFrOfyGa1oDcwoTGcbhR3YJO2Gv7NzvMyDtPaNLaYgrzDJcDV1qLt1W12h2Bg==",
            "modifiedValue": "269c433d0cc00395e3bc5fe7f06c5ad822096a38bec2d8a005367b52c0dfb428",
            "jsonPath": "$.ip",
            "schema": "iglu:com.mailgun/message_clicked/jsonschema/1-0-0",
            "encryptionKeyName": "email-key"
          },
          {
            "fieldName": "contexts",
            "originalValue": "eZDx1Y1SMIcP0vIzkNsx3xMZ4twdyqqU5bqNPkLNYElDNcUhD/8NH0Xb8vYPLvy5NZmm5XuMzInQ7xRHr4kB9q4kvRwtCwUGSS4OSR/QlPQWMz6NzMAep7oQ10crpdxQcXH5LxvMTMROndxOnV5Aglepd4zuSMRj+q3u9uH6zZmiMjS/1xcxC4dRdD3NtrR9IpNjaqkx9BrQ2S1ClsVntU/UGLZEAle5H+Uy+qvXYczbQsmVVwYLdgv4S4Om0QPW+T48pu2VGXVwNnJUwdAFqL+snAFrOfyGa1oDcwoTGcbhR3YJO2Gv7NzvMyDtPaNLaYgrzDJcDV1qLt1W12h2Bg==",
            "modifiedValue": "1c6660411341411d5431669699149283d10e070224be4339d52bbc4b007e78c5",
            "jsonPath": "$.data.emailAddress2",
            "schema": "iglu:com.acme/email_sent/jsonschema/1-1-0",
            "encryptionKeyName": "email-key"
          },
          {
            "fieldName": "contexts",
            "originalValue": "eZDx1Y1SMIcP0vIzkNsx3xMZ4twdyqqU5bqNPkLNYElDNcUhD/8NH0Xb8vYPLvy5NZmm5XuMzInQ7xRHr4kB9q4kvRwtCwUGSS4OSR/QlPQWMz6NzMAep7oQ10crpdxQcXH5LxvMTMROndxOnV5Aglepd4zuSMRj+q3u9uH6zZmiMjS/1xcxC4dRdD3NtrR9IpNjaqkx9BrQ2S1ClsVntU/UGLZEAle5H+Uy+qvXYczbQsmVVwYLdgv4S4Om0QPW+T48pu2VGXVwNnJUwdAFqL+snAFrOfyGa1oDcwoTGcbhR3YJO2Gv7NzvMyDtPaNLaYgrzDJcDV1qLt1W12h2Bg==",
            "modifiedValue": "72f323d5359eabefc69836369e4cabc6257c43ab6419b05dfb2211d0e44284c6",
            "jsonPath": "$.emailAddress",
            "schema": "iglu:com.acme/email_sent/jsonschema/1-0-0",
            "encryptionKeyName": "email-key"
          }
        ]
      },
      "strategy": {
        "pseudonymize": {
          "hashFunction": "SHA-256"
        }
      }
    }
  }
}

An incidental benefit coming out of this is that the values in kinesis pii are also encrypted.

Common: make IP lookups enrichment optional for some events

GeoIP lookup might not be necessary for all of the events that we receive i.e. page pings or webhook events coming from other systems. Reducing the size of GeoIP duplicate data will help a lot for efficiency and clarity of the data.

I am proposing that we add an optional list field to its setting schema to specify the list of event.event_name that we are interested have geo ip lookup. Then we just have GeoIP lookup for those specific events. The other change would be an if in EnrichmentManager to limit the the events for GeoIP lookup if the optional list is set in configuration.

CC: @Nafid

Common: add timeout to API Request Enrichment

Project: Scala Common Enrich

Version: 0.36.0

Expected behavior: Support timeout

Actual behavior: There is no code that would handle timeout

Steps to reproduce: did a code search and turned out that timeout was not passed to ScalaJ.

I was reading the code for API Request Enrichment to understand how it would handle the timeout to reuse the same code for HTTP Remote Adapter It turned out the settings are passed to com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.apirequest.HttpApi
but likely it is not passed to the Scalaj library in perform function.

This is a quick change, so I will add it after my current PR. Before that, I have one question. What was the intention for this timeout setting? Was it connectionTimeout or readTimeout?

I could not find it in the documentations but I prefer to have both because they could handle different cases.

CC: @Nafid @dmatth

Common: explore the possibility of fragmenting large Enriched Events

In stream enrich if the event is too large, it cannot be transmitted and it is dropped. Further more we assume that PII events are small enough when their parent event is small enough (Read more here and here.

One alternative would be to create an event fragmentation and reassembly protocol.

We will need to remember that most products operate on single events so that may not be practicable.

Stream: support private Kinesis endpoints

Project: Stream Enrich
Version: 0.21.0

Error: Received error response: com.amazonaws.services.kinesis.model.AmazonKinesisException: Credential should be scoped to a valid region, not ‘vpce’ (Service: AmazonKinesis; Status Code: 400; Error Code: InvalidSignatureException.

Debug logs shows that credentialscope for AWS4 Canonical Request coming as 20190611/vpce/kinesis/aws4_request. This is coming for only enricher not for collector.

Steps to reproduce:
Providing customendpoint as private kinesis endpoint
customEndpoint = vpce-XXXXXXXXXXX-XXXXXXXXXXX.kinesis.eu-west-1.vpce.amazonaws.com

More Info.
https://discourse.snowplowanalytics.com/t/snowplow-enrcihment-not-working-for-private-kinesis-endpoints/2923

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.