Code Monkey home page Code Monkey logo

dbt-snowplow-media-player's Introduction

Snowplow logo

Release Release activity Latest release Docker pulls Discourse posts License


As of January 8, 2024, Snowplow is introducing the Snowplow Limited Use License Agreement, and we will be releasing new versions of our core behavioral data pipeline technology under this license.

Our mission to empower everyone to own their first-party customer behavioral data remains the same. We value all of our users and remain dedicated to helping our community use Snowplow in the optimal capacity that fits their business goals and needs.

We reflect on our Snowplow origins and provide more information about these changes in our blog post here → https://eu1.hubs.ly/H06QJZw0


Overview

Snowplow is a developer-first engine for collecting behavioral data. In short, it allows you to:

Thousands of organizations like Burberry, Strava, and Auto Trader rely on Snowplow to collect, manage, and operationalize real-time event data from their central data platform to uncover deeper customer journey insights, predict customer behaviors, deliver differentiated customer experiences, and detect fraudulent activities.

Table of contents

Why Snowplow?

  • 🏔️ “Glass-box” technical architecture capable of processing billions of events per day.
  • 🛠️ Over 20 SDKs to collect data from web, mobile, server-side, and other sources.
  • ✅ A unique approach based on schemas and validation ensures your data is as clean as possible.
  • 🪄 Over 15 enrichments to get the most out of your data.
  • 🏭 Stream data to your data warehouse/lakehouse or SaaS destinations of choice — Snowplow fits nicely within the Modern Data Stack.

➡ Where to start? ⬅️

Snowplow Community Edition Snowplow Behavioral Data Platform
Community Edition equips you with everything you need to start creating behavioral data in a high-fidelity, machine-readable way. Head over to the Quick Start Guide to set things up. Looking for an enterprise solution with a console, APIs, data governance, workflow tooling? The Behavioral Data Platform is our managed service that runs in your AWS, Azure or GCP cloud. Book a demo.

The documentation is a great place to learn more.

Would rather dive into the code? Then you are already in the right place!


Snowplow technology 101

Snowplow architecture

The repository structure follows the conceptual architecture of Snowplow, which consists of six loosely-coupled sub-systems connected by five standardized data protocols/formats.

To briefly explain these six sub-systems:

  • Trackers fire Snowplow events. Currently we have 15 trackers, covering web, mobile, desktop, server and IoT
  • Collector receives Snowplow events from trackers. Currently we have one official collector implementation with different sinks: Amazon Kinesis, Google PubSub, Amazon SQS, Apache Kafka and NSQ
  • Enrich cleans up the raw Snowplow events, enriches them and puts them into storage. Currently we have several implementations, built for different environments (GCP, AWS, Apache Kafka) and one core library
  • Storage is where the Snowplow events live. Currently we store the Snowplow events in a flat file structure on S3, and in the Redshift, Postgres, Snowflake and BigQuery databases
  • Data modeling is where event-level data is joined with other data sets and aggregated into smaller data sets, and business logic is applied. This produces a clean set of tables which make it easier to perform analysis on the data. We officially support data models for Redshift, Snowflake and BigQuery.
  • Analytics are performed on the Snowplow events or on the aggregate tables.

For more information on the current Snowplow architecture, please see the Technical architecture.


About this repository

This repository is an umbrella repository for all loosely-coupled Snowplow components and is updated on each component release.

Since June 2020, all components have been extracted into their dedicated repositories (more info here) and this repository serves as an entry point for Snowplow users and as a historical artifact.

Components that have been extracted to their own repository are still here as git submodules.

Trackers

A full list of supported trackers can be found on our documentation site. Popular trackers and use cases include:

Web Mobile Gaming TV Desktop & Server
JavaScript Android Unity Roku Command line
AMP iOS C++ iOS .NET
React Native Lua Android Go
Flutter React Native Java
Node.js
PHP
Python
Ruby
Scala
C++
Rust
Lua

Loaders

Iglu

Data modeling

Web

Mobile

Media

Retail

Testing

Parsing enriched event


Community

We want to make it super easy for Snowplow users and contributors to talk to us and connect with one another, to share ideas, solve problems and help make Snowplow awesome. Join the conversation:

  • Meetups. Don’t miss your chance to talk to us in person. We are often on the move with meetups in Amsterdam, Berlin, Boston, London, and more.
  • Discourse. Our forum for all Snowplow users: engineers setting up Snowplow, data modelers structuring the data, and data consumers building insights. You can find guides, recipes, questions and answers from Snowplow users and the Snowplow team. All questions and contributions are welcome!
  • GitHub. If you spot a bug, please raise an issue in the GitHub repository of the component in question. Likewise, if you have developed a cool new feature or an improvement, please open a pull request, we’ll be glad to integrate it in the codebase! For brainstorming a potential new feature, Discourse is the best place to start.
  • Email. If you want to talk to Snowplow directly, email is the easiest way. Get in touch at [email protected].

Copyright and license

Snowplow is copyright 2012-2023 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

dbt-snowplow-media-player's People

Contributors

agnessnowplow avatar emielver avatar georgewoodhead avatar matus-tomlein avatar paulboocock avatar rlh1994 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-snowplow-media-player's Issues

Add automated testing for the media player model

Describe the feature

We should add automated tests to the repository such that on every PR the tests run to ensure that we are not making breaking changes.

Who will this benefit?

This will benefit all developers of future versions of the media player package, as well as all reviewers as there will be some checks in place to ensure that changes don't deviate from expected behaviour

Handle exception when duration equals zero

Describe the feature

Currently the model does not expect media duration to be equaling to 0, which may happen due to incorrect tracking which causes the model to fail due to division by 0 error in case none of the events coming from the same media_id have the expected duration equaling a number other than 0. This should be handled in the model together with a dbt test to to alert in case it happens.

playback_quality_field macro attempts to coalesce strings and ints when whatwg isn't enabled

Describe the bug

If you don't have whatwg media and video enabled and are using Snowflake or BigQuery then the snowplow_media_player_base_events_this_run model will fail. In this scenario the playback_quality_field macro casts null as the video_width dtype, which in this case is integer, (see this line in the macro) which conflicts with the other fields in the coalesce which are correctly a string.

This is the failed Snowflake compiled SQL snippet:

coalesce(
    
a.contexts_com_snowplowanalytics_snowplow_media_player_2[0]:quality::varchar
,
    
          cast(null as varchar),
    
          cast(null as integer)
    ,
    'N/A'
   )
 as playback_quality

We should perhaps always cast this null as a string instead but need to check the reasoning why we did cast(null as {{ video_width.get('dtype', 'string') }}) vs cast(null as string). Or make this line only for Databricks targets. Also we should improve our test coverage by doing a run with this configuration.

Rename session_stats for custom yml

The snowplow_media_player_custom.yml refers to the snowplow_media_player_session_stats as snowplow_media_player_plays_by_session, which should be updated.

column mp.current_time does not exist issue while running in redshift

Describe the bug

mp_context as (

select
root_id,
root_tstamp,
duration,
playback_rate,
current_time,
percent_progress,
muted,
is_live,
loop,
volume,
row_number() over (partition by root_id order by root_tstamp) dedupe_index

from {{ var('snowplow__media_player_context') }}

where root_tstamp between {{ lower_limit }} and {{ upper_limit }}

)

in this query, the current_time variable is the default redshift function so that column will be generated like as timetz

so the dbt process will be failed

Completed with 1 error and 0 warnings:
06:52:02
06:52:02 Database Error in model snowplow_media_player_interactions_this_run (models/web/scratch/interactions_this_run/redshift_postgres/snowplow_media_player_interactions_this_run.sql)
06:52:02 column mp.current_time does not exist
06:52:02 compiled Code at target/run/snowplow_media_player/models/web/scratch/interactions_this_run/redshift_postgres/snowplow_media_player_interactions_this_run.sql
06:52:02

Steps to reproduce

I can fix this issue used to create a custom variable like "current_time"
select
root_id,
root_tstamp,
duration,
playback_rate,
"current_time",
percent_progress,
muted,
is_live,
loop,
volume,
row_number() over (partition by root_id order by root_tstamp) dedupe_index

from {{ var('snowplow__media_player_context') }}

where root_tstamp between {{ lower_limit }} and {{ upper_limit }}

Expected results

Actual results

Screenshots and log output

System information

The contents of your packages.yml file:

# contents goes here

Which database are you using dbt with?

  • redshift

The output of dbt --version:

<output goes here>

The operating system you're using:

The output of python --version:

Additional context

Are you interested in contributing towards the fix?

Move macros to snowplow_utils

Describe the feature

The macros being used in the package should be migrated to snowplow_utils as they would be valuable to be used across different packages.

Add support for picture-in-picture playback of videos across multiple pages/screens

Is your feature request related to a problem? Please describe.

A video played in picture-in-picture (PIP) mode may produce events events with multiple different page view IDs. Currently we assume that a playback belongs to only a single page view ID.

Describe the solution you'd like

Allow for the play to cover multiple page view IDs and add a column in the media base table that lists all of the page view IDs.

Drop support for dbt versions below 1.3

Describe the feature

dbt-core v1.3.x is a decent shift away from previous versions of dbt in terms of functionality and macros, so we need to prepare our media-player package to be able to support this version and drop support for lower versions.

Dependency

We need to wait for dbt-utils v1.0 to be released, then update snowplow-utils to support that version of dbt-utils, and then build on top of that later snowplow-utils release

Change incremental logic for media_stats

Describe the bug

Currently the time window to wait for a page_view to be processed (snowplow__max_media_pv_window) for the media_stats table is capped at 10 hours by default to allow for any late arriving data and is calculated from current date and not from the event's derived_tstamp, which means that the filter will not work as indented, with sessions being potentially processed while they are ongoing. Although the current logic could still work if the value would be changed to a negative value, aka -10 from the default 10, it would be confusing for users, therefore it is best to turn the logic around and add the 10 hours to the event's timestamp.

Remove deduplication for Databricks for get_string_agg macro

Describe the bug

In case of databricks/spark, the get_string_agg macro currently uses collect_set, which removes potential duplicates when percent progresses are concatenated within the base model. For other warehouses the macro aggregates all percent progresses, and the same number would be added twice if the player watched the same part of the video twice during the same page_view. It should be using collect_list instead for the intended use and play_time_sec calculation.

Fix duplicate media_id causing failures in the media_stats table

Describe the bug

This problem occurs in case of either:

  1. The same media label being used for two different media contents
  2. Or some events being tracked with a different media_type or media_player_type than other media events for the same content (the properties being set later in the tracking)

This causes the media_stats table to break because it has a unique key on the media_id while also grouping by the media_label, media_type and media_player_type (see here).

Steps to reproduce

Generate events which have different media_type tracked for the same media_label.

Expected results

We don't want to hide this problem as it signals an issue in the tracking. But we also don't want the model to break. Instead, it would be better if the media_stats table contained multiple rows for each of the tracked property combinations.

Actual results

dbt jobs fail in this case.

Potential solutions

A couple of solutions are possible:

  1. Add media_type and media_player_type to the surrogate key when generaing media_id (here) – this would be a breaking change.
  2. Change the unique key for the media_stats table to instead be a combined version of the media_id, media_label, media_type, and media_player_type (or a surrogate key for them).

Improve the GitHub workflow

Standardise the GitHub workflow with issue creation and PRs to be in line with the web and mobile packages

Fix counting impressions based using distinct play_id instead of page_view_id

Describe the bug

Impressions in the media base table are counted as the number of distinct page views. However, this is imprecise as there may be multiple videos per page, which then leads to the play rate being higher than 1 (the number of plays is more than the number of page views).

Expected results

A better solution would be to use the number of distinct play_id as the number of impressions. This would take into account all the loaded videos on the page.

Optimize performance in Databricks for incremental models

Describe the feature

We need to optimize performance in Databricks. We can leverage the optimizeWrite and optimizeCompact table properties in Databricks to achieve this, and we should focus on all incremental (Snowplow or otherwise) tables that are being generated from the Snowplow models.

Fields mismatch in snowplow_media_player_custom.yml and snowplow_media_player_session_stats.sql

Describe the bug

There is a mismatch between following fields in models/custom/snowplow_media_player_custom.yml and models/custom/snowplow_media_player_session.sql

.yml .sql
play_time_secs play_time_mins
play_time_muted_secs play_time_muted_mins
avg_play_time_sec avg_play_time_mins

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

The contents of your packages.yml file:

# contents goes here

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

<output goes here>

The operating system you're using:

The output of python --version:

Additional context

Are you interested in contributing towards the fix?

Replace snowplow_web with a base that can be compatible with mobile events

Is your feature request related to a problem? Please describe.

Prepare to add support for mobile events.

Describe the solution you'd like

Take out the snowplow_web package as the base and use a custom one taken from the ecommerce package. The motivation for this is to be able to process mobile events as well.

Additional context

We won't yet support mobile events in this issue. We can only generate mobile events from the new media schemas which are not yet supported by the package, so we will add support for mobile later, when adding support for the new media schemas. This issue just prepares for this in advance.

Make Media Player choice optional

Describe the bug

Currently the model will only work as expected if both HTML5 and YouTube video tracking is enabled.

Steps to reproduce

Running the snowplow_media_player_interactions_this_run table when any of the context tables do not exist in the database:

    from {{ ref("snowplow_web_base_events_this_run") }} as e

    inner join {{ source('atomic', 'com_snowplowanalytics_snowplow_media_player_event_1') }} as mpe
    on mpe.root_id = e.event_id and mpe.root_tstamp = e.collector_tstamp

    left join {{ source('atomic', 'com_snowplowanalytics_snowplow_media_player_1') }} as mp
    on mp.root_id = e.event_id and mp.root_tstamp = e.collector_tstamp

    left join {{ source('atomic', 'com_youtube_youtube_1') }} as y
    on  y.root_id = e.event_id and y.root_tstamp = e.collector_tstamp

    left join {{ source('atomic', 'org_whatwg_media_element_1') }} as me
    on me.root_id = e.event_id and me.root_tstamp = e.collector_tstamp

    left join {{ source('atomic', 'org_whatwg_video_element_1') }} as ve
    on ve.root_id = e.event_id and ve.root_tstamp = e.collector_tstamp

Incremental update for snowplow_media_player_media_stats

Describe the bug

Error creating incremental update to snowplow_media_player_media_stats

Steps to reproduce

Run:

dbt build --select snowplow_media_player_media_stats

Expected results

The command is run successfully.

Actual results

The command fails with:

Database Error in model snowplow_media_player_media_stats (models/web/snowplow_media_player_media_stats.sql)
  division by zero: 0 / 0
  compiled Code at target/run/snowplow_media_player/models/web/snowplow_media_player_media_stats.sql

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: snowplow/snowplow_web
    version: [">=0.12.0", "<0.13.0"]

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

1.5.0

The operating system you're using:

Cloud

The output of python --version:

Additional context

We are on the latest version. Adding the full-refresh flag allows the build command to complete succesfully.

Are you interested in contributing towards the fix?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.