Code Monkey home page Code Monkey logo

dbt_census_utils's Introduction

Census Utils dbt Package (Docs)

🎁 What does this dbt package do?

  • Adds a number of macros that are useful when transforming data to be synced via reverse ETL (with Census or your own pipelines).
  • Adds documentation for the macros at: dbt docs site.

πŸ‘©β€πŸ’» How do I use the dbt package?

Step 1: Prerequisites

To use this dbt package, you must have the following:

  • A BigQuery, Snowflake, or Redshift data warehouse.

Step 2: Install the package

Include the following census_utils package version in your packages.yml file, then run 'dbt deps':

TIP: Check dbt Hub for the latest installation instructions or read the dbt docs for more information on installing packages.

packages:
  - package: sutrolabs/census_utils
    version: [">=0.1.0", "<1.0.0"]

Step 3: Run dbt seed

This package uses seeds for macros such as converting country codes to country names. Run 'dbt seed' after 'dbt deps' to materialize these seeds in your data warehouse.

(Optional) Step 4: Define internal user variables

The is_internal macro (source) identifies internal users based off of several potential methods: their email address domain, an existing list of email addresses, or an existing list of IP Addresses. If you want to use the is_internal macro, you'll need to specify at least one of these approaches. Add variables to your root dbt_project.yml file to reflect the domain of your company and the relations and columns where internal users are tracked. These relations can be a dbt seed or a dbt model. Common methods of maintaining this relation include: a seed file of internal users and IP address, a dbt model that identifies internal users directly from your application database, or a dbt model referencing a Google Sheet of internal users. For example, if your company used the domains sawtelleanalytics.com and sawtelleanalytics.co.uk, and you have a dbt seed called 'my_internal_users' with an email_address column for the emails of internal users and an ip_address column for the IPs of internal users, you would add this to your vars:

vars:
  internal_domain: ('sawtelleanalytics.com', 'sawtelleanalytics.co.uk')
  internal_email_relation: 'my_internal_users'
  internal_email_column: 'email_address'
  internal_ip_relation: 'my_internal_users'
  internal_ip_column: 'ip_address'

(Optional) Step 5: Sync your dbt models to destinations with Census

Expand for details

Census lets you sync your dbt models to destinations such as Salesforce, Hubspot, Zendesk, Facebook, and Google. Learn how to sync your data.

List of macros:

parse_ga4_client_id (source)

This macro takes a Google Analytics 4 client ID and returns either the unique ID part, the timestamp part, or the entire ID without any 'GA1.2' type prefix before it.

Args:

  • client_id (required): The raw GA 4 client ID to parse.
  • extract_value (required): Specifies what to extract from the raw client ID. Should be either 'unique_id', 'timestamp', or 'client_id'.

Usage:

select 
    ga4_client_id,
    {{ census_utils.parse_ga4_client_id('ga4_client_id', 'unique_id') }} as unique_id,
    {{ census_utils.parse_ga4_client_id('ga4_client_id', 'timestamp') }} as timestamp,
    {{ census_utils.parse_ga4_client_id('ga4_client_id', 'client_id') }} as client_id
from ga4_client

clean (source)

This macro cleans fields so that they will be accepted by APIs such as Facebook Ads or Google Ads.

Args:

  • field (required): The name to be cleaned.
  • destination (required): The destination the field will be sent to. Currently supports 'facebook' or 'google'.
  • type (required): The type of field, currently supports 'name' (for first or last name), 'email', 'city', 'country', or 'zip'. Non-US postal codes should not be formatted by this macro as requirements for those vary.

Usage:

select 
    {{ census_utils.clean('fn',destination='facebook',type='name') }} as cleaned_facebook_fn,
    {{ census_utils.clean('ln',destination='facebook',type='name') }} as cleaned_facebook_ln,
    {{ census_utils.clean('country',destination='facebook',type='country') }} as cleaned_facebook_country,
    {{ census_utils.clean('city',destination='facebook',type='city') }} as cleaned_facebook_city,
    case when {{ census_utils.clean('country_to_clean','facebook','country') }} = 'us' then {{ census_utils.clean('zip','facebook','zip') }} else zip end as cleaned_facebook_zip,
    {{ census_utils.clean('email_address','facebook','email') }} as cleaned_facebook_email

is_internal (source)

This macro reports whether a user is an internal user based on their email domain, email address, or IP address. Relies on at least one variable being set in dbt_project.yml.

Args:

  • email (optional): The email address of the user.
  • ip_address (optional): The IP address of the user.

Usage:

select 
    email_address
    , ip_address
    , {{ census_utils.is_internal(email='email_address',ip_address='ip_address') }} as is_internal_user
    , {{ census_utils.is_internal(email='email_address') }} as is_internal_email
    , {{ census_utils.is_internal(ip_address='ip_address') }} as is_internal_ip
from
    users

extract_email_domain (source)

This macro extracts the domain from an email address.

Args:

  • email (required): The email address of the user.

Usage:

select 
    email_address,
    {{ census_utils.extract_email_domain('email_addresses') }} as email_domain

is_personal_email (source)

This macro determines whether an email address is personal, based on a list of common personal email domains.

Args:

  • email (required): The email address of the user.

Usage:

select 
    email_address,
    {{ census_utils.is_personal_email('email_addresses') }} as is_personal_email

get_country_code (source)

This macro converts a country name to a ISO 3166 country code. Args:

  • country_name (required): The country name to be converted to a code.

Usage:

select 
    country_name,
    {{ census_utils.get_country_code('country_name') }} as country_code

🎒 Does this package have dependencies?

This dbt package is dependent on the following dbt packages. Please be aware that these dependencies are installed by default within this package. For more information on the following packages, refer to the dbt hub site.

IMPORTANT: If you have any of these dependent packages in your own packages.yml file, we recommend that you remove them from your root packages.yml to avoid package version conflicts.

packages:
    - package: dbt-labs/dbt_utils
      version: [">=.9.0", "<2.0.0"]

🀝 How is this package maintained and can I contribute?

Package Maintenance

The Census team maintaining this package only maintains the latest version of the package. We highly recommend you stay consistent with the latest version of the package and refer to the changelog and release notes for more information on changes across versions.

Contributions

We highly encourage and welcome contributions to this package. Check out this dbt Discourse article on the best workflow for contributing to a package!

🧭 How can I get help or make suggestions?

  • If you have questions or want to reach out for help, please refer to the GitHub Issue section to find the right avenue of support for you.
  • If you would like to provide feedback to the dbt package team at Census or would like to request a new dbt package, please join the Operational Analytics Slack.

dbt_census_utils's People

Contributors

amit-omaze avatar amittiwari1979 avatar jeffsloan avatar stephen986 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dbt_census_utils's Issues

Change the name of seed files to ensure no conflicts

I just tested the package in a dbt project and got a conflict between the project's "country_codes.csv" seed file and the one in dbt_census_utils. We should make sure all of our seed files are prefixed with "census_utils_" so that there are no conflicts.

Macro for parsing GA 4 client IDs

One of Donny at Census' suggestions was to write a macro for parsing GA 4 client IDs. These client IDs are slightly different from Universal Analytics ones because they all include a prefix of GA 1.1 and also end with the Unix timestamp of the user's first interaction. Capturing the last 10 digits in the Client ID tells you the timestamp of first user interaction, which analysts may want to use.

GA 4 client ID is accessible as 'user_pseudo_id' if the user is using the BigQuery export. GA 4 does not automatically have the client ID in its UI the way UA did. There are a few popular guides on the internet for how to pass it from Google Tag Manager to GA 4... this one recommends adding a '.' at the end so that GA 4 doesn't convert it to a number in scientific notation. Whereas this guide just sends the cooke ID minus the first 6 characters. I think we can probably trust people to append a '.' at the end on their own if they used the first guide and want to make them match, but I'm noting it in case we get any feedback on that later.

Usage might look like this:

--GA 4 client ID is GA1.2.1213337569.1677539575 . I got this by going to cloudflare.com and looking at my _ga cookie in developer tools.

select
{{parse_ga4_client_id (client_id, 'unique_id') }} as unique_id, --returns 1213337569
{{parse_ga4_client_id (client_id, 'timestamp') }} as first_visit_ts, -- returns 1677539575
{{parse_ga4_client_id (client_id, 'client_id') }} as client_id --returns 1213337569.1677539575

Acceptance Criteria:
-GA4 macro can parse out the first, unique identifier part of a GA 4 cookie.
-GA 4 macro can parse out the second, timestamp part of the GA 4 cookie.
-GA 4 macro can parse out both parts of the GA 4 cookie, just slicing off the prefix.

Macro for extracting e-mail domains

A very useful macro is one that easily extracts the domain from an e-mail address, so that you can note which people signing up for your product are coworkers. Here's a simple way to do it:

{% macro extract_email_domain(email_address) %}

{# This is the SQL to extract email domain in the Snowflake SQL: https://docs.snowflake.com/en/sql-reference/functions/regexp_substr #}

regexp_substr(lower({{ emai_addressl }}), '@(.*)', 1, 1, 'e',1)

{# This is the SQL to extract email domain in BigQuery SQL, which has 2 less parameters: https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr #}

regexp_substr(lower({{ emai_addressl }}), '@(.*)', 1, 1)

{# Need SQL to extract email domain in Redshift SQL #}

{% endmacro %}

Note: this doesn't currently check for improperly formatted emails, like periods at the end of the domain or whitespaces.

Acceptance Criteria:
-Extract e-mail domain from e-mail addresses.
-Handle improperly formatted e-mails.
-Macro should work in Snowflake, BigQuery, and Redshift.

[Suggestion] Use a subquery to get country codes instead of generating a massive case statement

Describe the issue

https://github.com/sutrolabs/dbt_census_utils/blob/main/macros/get_country_code.sql

Having a 200-line case statement feels pretty icky! Instead, could you do a subquery along these lines?

select 
user_id, 
{{ dbt_census_utils.get_country_code('country_name') }}
from {{ ref('whatever') }}
select 
user_id,
(select country_code from {{ ref('country_codes') }} where country_name = lower(country_name))
from {{ ref('wherever') }}

Relevant error or model output

No response

Expected behavior

Subquery above

dbt Project configurations

N/A

Package versions

Have just eyeballed this, haven't used it in prod

What database are you using dbt with?

snowflake

dbt Version

1.5.x

Are you interested in contributing this feature?

  • Yes.
  • No.

Host dbt docs on Github pages

dbt recommends hosting dbt docs for packages on Github pages. Here's an example of Fivetran doing that.

We should do this later in the project when there's more documentation to actually host.

Acceptance Criteria:
-Generate dbt docs.
-Host them on Github pages.

Name standardization macro for FB/Google/etc. audiences

Some Census destinations are very picky about names. For example, Facebook requires that location names be:
lowercased, have spaces and special characters removed.

We should create a macro that removes and replaces characters for different destinations. If different destinations need different types of formatting, we can use variables to clean them in different ways. ie:

{{ census_utils.clean_name(city_name, 'facebook') }}

Acceptance Criteria:
-Names are stripped of spaces and other characters that Facebook won't accept.
-Macro is written in a way that it can support other standards, such as Google.
-Macro works on Snowflake, Redshift, and BigQuery (testing environments for SF/RS to be provided later).

Macro for country name to country code

When sending geographic data to a destination, it's important that all of the country info be consistent and what the destination expects. If the destination requires country code, or if your DW has a mix of country names and country codes, it's important to convert all of the country names into codes. We can accomplish this with a seed file mapping names to codes (the same seed can also help with the reverse transformation).

Acceptance Criteria:
-The macro checks for country names, and if it finds a match, converts them to codes. Everything else is left as is (this way, if it's already a code it will stay that way).
-The macro lowercases country names so that capitalization doesn't affect whether it finds a match. The country code that is output should be uppercase though.
-If possible, the macro should check for any countries that have been renamed in the past 50 years, ie Burma/Myanmar. This would be handled by having multiple rows for them in the seed file.

Macro for identifying internal e-mails and internal IP addresses

When modeling customer behavior, it's important to distinguish real customers from employees testing the system. I've done this 3 different ways in the past:

  1. Checking if their email domain matches the company's e-mail domain.
  2. Checking if their email is in a list of internal e-mails in a seed file or gsheet that's being ingested into the data warehouse. This is useful if internal users are signing up with their g-mail accounts in order to have a bunch of different accounts.
  3. Checking if their IP address matches a list of internal IPs in a seed file or gsheet. This is useful for web data that doesn't have an e-mail associated with it.

We can implement a macro for this using 2 project variables:
1 variable that sets what the company email domain is (we'll default this to getcensus.com and then customers can overwrite it).
A variable that sets the location of the table/view and column name for internal e-mails
A variable that sets the location of the table/view and column name for internal IP addresses.

Acceptance Criteria:
-Identify email addresses that use the company e-mail domain.
-Identify email addresses that are in a table/view of internal e-mails.
-Identify IP addresses that are in a table/view of internal IP addresses.
-Macro should work in Snowflake, BigQuery, and Redshift.

Create readme

Create a readme, inspired by Fivetran package's excellent installation instructions and dbt_utils' clear explanation of macros.

Acceptance Criteria:
-Readme explains how to install the package.
-Readme explains how to use each macro.

Macro for identifying work vs personal email addresses

There should be a macro to check if an e-mail address is from a personal account. We could set the list of personal e-mails at the top of the macro, but it would probably be better to make a seed file of personal e-mails. I've attached a list we could use

Macro would look something like this:
{% macro is_personal_email(email_address) %}

{% endmacro %}
free_email_provider_domains.txt

Acceptance Criteria:
-Macro returns true if an e-mail address is personal based on the email domain (we can use the extract email domain macro for this).
-Macro returns false if an e-mail address is not personal.
-Macro works on Snowflake, Redshift, and BigQuery.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.