Code Monkey home page Code Monkey logo

dbt_fivetran_utils's Introduction

Fivetran Utilities for dbt

⚠️ Warning! ⚠️

You are viewing a deprecated branch, which is no longer maintained and not recommended for use. This branch remains the default branch to prevent breaking changes to any existing dbt projects that use this package.

To view the current release branch, please refer to the release list.

This package includes macros that are used in Fivetran's dbt packages.

Macros

_get_utils_namespaces (source)

This macro allows for namespacing macros throughout a dbt project. The macro currently consists of two namespaces:

  • dbt_utils
  • fivetran_utils

add_pass_through_columns (source)

This macro creates the proper name, datatype, and aliasing for user defined pass through column variable. This macro allows for pass through variables to be more dynamic and allow users to alias custom fields they are bringing in. This macro is typically used within staging models of a fivetran dbt source package to pass through user defined custom fields.

Usage:

{{ fivetran_utils.add_pass_through_columns(base_columns=columns, pass_through_var=var('hubspot__deal_pass_through_columns')) }}

Args:

  • base_columns (required): The name of the variable where the base columns are contained. This is typically columns.
  • pass_through_var (required): The variable which contains the user defined pass through fields.

array_agg (source)

This macro allows for cross database field aggregation. The macro contains the database specific field aggregation function for BigQuery, Snowflake, Redshift, and Postgres. By default a comma , is used as a delimiter in the aggregation.

Usage:

{{ fivetran_utils.array_agg(field_to_agg="teams") }}

Args:

  • field_to_agg (required): Field within the table you are wishing to aggregate.

ceiling (source)

This macro allows for cross database use of the ceiling function. The ceiling function returns the smallest integer greater than, or equal to, the specified numeric expression. The ceiling macro is compatible with BigQuery, Redshift, Postgres, and Snowflake.

Usage:

{{ fivetran_utils.ceiling(num="target/total_days") }}

Args:

  • num (required): The integer field you wish to apply the ceiling function.

collect_freshness (source)

This macro overrides dbt's default collect_freshness macro that is called when running dbt source snapshot-freshness. It allows you to incorporate model enabling/disabling variables into freshness tests, so that, if a source table does not exist, dbt will not run (and error on) a freshness test on the table. Any package that has a dependency on fivetran_utils will use this version of the macro. If no meta.is_enabled field is provided, the collect_freshness should run exactly like dbt's default version.

Usage:

# in the sources.yml
sources:
  - name: source_name
    freshness:
      warn_after: {count: 84, period: hour}
      error_after: {count: 168, period: hour}
    tables:
      - name: table_that_might_not_exist
        meta:
          is_enabled: "{{ var('package__using_this_table', true) }}"

Args (sorta):

  • meta.is_enabled (optional): The variable(s) you would like to reference to determine if dbt should include this table in freshness tests.

dummy_coalesce_value (source)

This macro creates a dummy coalesce value based on the data type of the field. See below for the respective data type and dummy values:

  • String = 'DUMMY_STRING'
  • Boolean = null
  • Int = 999999999
  • Float = 999999999.99
  • Timestamp = cast("2099-12-31" as timestamp)
  • Date = cast("2099-12-31" as date) Usage:
{{ fivetran_utils.dummy_coalesce_value(column="user_rank") }}

Args:

  • column (required): Field you are applying the dummy coalesce.

empty_variable_warning (source)

This macro checks a declared variable and returns an error message if the variable is empty before running the models within the dbt_project.yml file.

Usage:

on-run-start: '{{ fivetran_utils.empty_variable_warning(variable="ticket_field_history_columns", downstream_model="zendesk_ticket_field_history") }}'

Args:

  • variable (required): The variable you want to check if it is empty.
  • downstream_model (required): The downstream model that is affected if the variable is empty.

enabled_vars_one_true (source)

This macro references a set of specified boolean variable and returns true if any variable value is equal to true.

Usage:

{{ fivetran_utils.enabled_vars_one_true(vars=["using_department_table", "using_customer_table"]) }}

Args:

  • vars (required): Variable(s) you are referencing to return the declared variable value.

enabled_vars (source)

This macro references a set of specified boolean variable and returns false if any variable value is equal to false.

Usage:

{{ fivetran_utils.enabled_vars(vars=["using_department_table", "using_customer_table"]) }}

Args:

  • vars (required): Variable you are referencing to return the declared variable value.

fill_pass_through_columns (source)

This macro is used to generate the correct sql for package staging models for user defined pass through columns.

Usage:

{{ fivetran_utils.fill_pass_through_columns(pass_through_variable='hubspot__contact_pass_through_columns') }}

Args:

  • pass_through_variable (required): Name of the variable which contains the respective pass through fields for the staging model.

fill_staging_columns (source)

This macro is used to generate the correct SQL for package staging models. It takes a list of columns that are expected/needed (staging_columns) and compares it with columns in the source (source_columns).

Usage:

select

    {{
        fivetran_utils.fill_staging_columns(
            source_columns=adapter.get_columns_in_relation(ref('stg_twitter_ads__account_history_tmp')),
            staging_columns=get_account_history_columns()
        )
    }}

from source

Args:

  • source_columns (required): Will call the get_columns_in_relation macro as well requires a ref() or source() argument for the staging models within the _tmp directory.
  • staging_columns (required): Created as a result of running the generate_columns_macro for the respective table.

first_value (source)

This macro returns the value_expression for the first row in the current window frame with cross db functionality. This macro ignores null values. The default first_value calculation within the macro is the first_value function. The Redshift first_value calculation is the first_value function, with the inclusion of a frame_clause {{ partition_field }} rows unbounded preceding.

Usage:

{{ fivetran_utils.first_value(first_value_field="created_at", partition_field="id", order_by_field="created_at", order="asc") }}

Args:

  • first_value_field (required): The value expression which you want to determine the first value for.
  • partition_field (required): Name of the field you want to partition by to determine the first_value.
  • order_by_field (required): Name of the field you wish to sort on to determine the first_value.
  • order (optional): The order of which you want to partition the window frame. The order argument by default is asc. If you wish to get the last_value, you may change the argument to desc.

generate_columns_macro (source)

This macro is used to generate the macro used as an argument within the fill_staging_columns macro which will list all the expected columns within a respective table. The macro output will contain name and datatype; however, you may add an optional argument for alias if you wish to rename the column within the macro.

The macro should be run using dbt's run-operation functionality, as used below. It will print out the macro text, which can be copied and pasted into the relevant macro directory file within the package.

Usage:

dbt run-operation fivetran_utils.generate_columns_macro --args '{"table_name": "promoted_tweet_report", "schema_name": "twitter_ads", "database_name": "dbt-package-testing"}'

Output:

{% macro get_admin_columns() %}

{% set columns = [
    {"name": "email", "datatype": dbt_utils.type_string()},
    {"name": "id", "datatype": dbt_utils.type_string(), "alias": "admin_id"},
    {"name": "job_title", "datatype": dbt_utils.type_string()},
    {"name": "name", "datatype": dbt_utils.type_string()},
    {"name": "_fivetran_deleted", "datatype": "boolean"},
    {"name": "_fivetran_synced", "datatype": dbt_utils.type_timestamp()}
] %}

{{ return(columns) }}

{% endmacro %}

Args:

  • table_name (required): Name of the schema which the table you are running the macro for resides in.
  • schema_name (required): Name of the schema which the table you are running the macro for resides in.
  • database_name (optional): Name of the database which the table you are running the macro for resides in. If empty, the macro will default this value to target.database.

get_columns_for_macro (source)

This macro returns all column names and datatypes for a specified table within a database and is used as part of the generate_columns_macro.

Usage:

{{ fivetran_utils.get_columns_for_macro(table_name="team", schema_name="my_teams", database_name="my_database") }}

Args:

  • table_name (required): Name of the table you are wanting to return column names and datatypes.
  • schema_name (required): Name of the schema where the above mentioned table resides.
  • database_name (optional): Name of the database where the above mentioned schema and table reside. By default this will be your target.database.

json_extract (source)

This macro allows for cross database use of the json extract function. The json extract allows the return of data from a json object. The data is returned by the path you provide as the argument. The json_extract macro is compatible with BigQuery, Redshift, Postgres, and Snowflake.

Usage:

{{ fivetran_utils.json_extract(string="value", string_path="in_business_hours") }}

Args:

  • string (required): Name of the field which contains the json object.
  • string_path (required): Name of the path in the json object which you want to extract the data from.

json_parse (source)

This macro allows for cross database use of the json extract function, specifically used to parse and extract a nested value from a json object. The data is returned by the path you provide as the list within the string_path argument. The json_parse macro is compatible with BigQuery, Redshift, Postgres, Snowflake and Databricks.

Usage:

{{ fivetran_utils.json_parse(string="receipt", string_path=["charges","data",0,"balance_transaction","exchange_rate"]) }}

Args:

  • string (required): Name of the field which contains the json object.
  • string_path (required): List of item(s) that derive the path in the json object which you want to extract the data from.

pivot_json_extract (source)

This macro builds off of the json_extract macro in order to extract a list of fields from a json object and pivot the fields out into columns. The pivot_json_extract macro is compatible with BigQuery, Redshift, Postgres, and Snowflake.

Usage:

{{ fivetran_utils.pivot_json_extract(string="json_value", list_of_properties=["field 1", "field 2"]) }}

Args:

  • string (required): Name of the field which contains the json object.
  • list_of_properties (required): List of the fields that you want to extract from the json object and pivot out into columns. Any spaces will be replaced by underscores in column names.

max_bool (source)

This macro allows for cross database use of obtaining the max boolean value of a field. This macro recognizes true = 1 and false = 0. The macro will aggregate the boolean_field and return the max boolean value. The max_bool macro is compatible with BigQuery, Redshift, Postgres, and Snowflake.

Usage:

{{ fivetran_utils.max_bool(boolean_field="is_breach") }}

Args:

  • boolean_field (required): Name of the field you are obtaining the max boolean record from.

percentile (source)

This macro is used to return the designated percentile of a field with cross db functionality. The percentile function stems from percentile_cont across db's. For Snowflake and Redshift this macro uses the window function opposed to the aggregate for percentile.

Usage:

{{ fivetran_utils.percentile(percentile_field='time_to_close', partition_field='id', percent='0.5') }}

Args:

  • percentile_field (required): Name of the field of which you are determining the desired percentile.
  • partition_field (required): Name of the field you want to partition by to determine the designated percentile.
  • percent (required): The percent necessary for percentile_cont to determine the percentile. If you want to find the median, you will input 0.5 for the percent.

remove_prefix_from_columns (source)

This macro removes desired prefixes from specified columns. Additionally, a for loop is utilized which allows for adding multiple columns to remove prefixes.

Usage:

{{ fivetran_utils.remove_prefix_from_columns(columns="names", prefix='', exclude=[]) }}

Args:

  • columns (required): The desired columns you wish to remove prefixes.
  • prefix (optional): The prefix the macro will search for and remove. By default the prefix = ''.
  • exclude (optional): The columns you wish to exclude from this macro. By default no columns are excluded.

snowflake_seed_data (source)

This macro is intended to be used when a source table column is a reserved keyword in Snowflake, and Circle CI is throwing a fit. It simply chooses which version of the data to seed (the Snowflake copy should capitalize and put three pairs of quotes around the problematic column).

*Usage:

    # in integration_tests/dbt_project.yml
    vars:
        table_name: "{{ fivetran_utils.snowflake_seed_data(seed_name='user_data') }}"

Args:

  • seed_name (required): Name of the seed that has separate snowflake seed data.

seed_data_helper (source)

This macro is intended to be used when a source table column is a reserved keyword in a warehouse, and Circle CI is throwing a fit. It simply chooses which version of the data to seed. Also note the warehouses argument is a list and multiple warehouses may be added based on the number of warehouse specific seed data files you need for integration testing.

*Usage:

    # in integration_tests/dbt_project.yml
    vars:
        table_name: "{{ fivetran_utils.seed_data_helper(seed_name='user_data', warehouses=['snowflake', 'postgres']) }}"

Args:

  • seed_name (required): Name of the seed that has separate postgres seed data.
  • warehouses (required): List of the warehouses for which you want CircleCi to use the helper seed data.

staging_models_automation (source)

This macro is intended to be used as a run-operation when generating Fivetran dbt source package staging models/macros. This macro will receive user input to create all necessary (bash commands) appended with && so they may all be ran at once. The output of this macro within the CLI will then be copied and pasted as a command to generate the staging models/macros. Usage:

dbt run-operation staging_models_automation --args '{package: asana, source_schema: asana_source, source_database: database-source-name, tables: ["user","tag"]}'

CLI Output:

source dbt_modules/fivetran_utils/columns_setup.sh '../dbt_asana_source' stg_asana dbt-package-testing asana_2 user && 
source dbt_modules/fivetran_utils/columns_setup.sh '../dbt_asana_source' stg_asana dbt-package-testing asana_2 tag

Args:

  • package (required): Name of the package for which you are creating staging models/macros.
  • source_schema (required): Name of the source_schema from which the bash command will query.
  • source_database (required): Name of the source_database from which the bash command will query.
  • tables (required): List of the tables for which you want to create staging models/macros.

string_agg (source)

This macro allows for cross database field aggregation and delimiter customization. Supported database specific field aggregation functions include BigQuery, Snowflake, Redshift.

Usage:

{{ fivetran_utils.string_agg(field_to_agg="issues_opened", delimiter='|') }}

Args:

  • field_to_agg (required): Field within the table you are wishing to aggregate.
  • delimiter (required): Character you want to be used as the delimiter between aggregates.

timestamp_add (source)

This macro allows for cross database addition of a timestamp field and a specified datepart and interval for BigQuery, Redshift, Postgres, and Snowflake.

Usage:

{{ fivetran_utils.timestamp_add(datepart="day", interval="1", from_timestamp="last_ticket_timestamp") }}

Args:

  • datepart (required): The datepart you are adding to the timestamp field.
  • interval (required): The interval in relation to the datepart you are adding to the timestamp field.
  • from_timestamp (required): The timestamp field you are adding the datepart and interval.

timestamp_diff (source)

This macro allows for cross database timestamp difference calculation for BigQuery, Redshift, Postgres, and Snowflake.

Usage:

{{ fivetran_utils.timestamp_diff(first_date="first_ticket_timestamp", second_date="last_ticket_timestamp", datepart="day") }}

Args:

  • first_date (required): The first timestamp field for the difference calculation.
  • second_date (required): The second timestamp field for the difference calculation.
  • datepart (required): The date part applied to the timestamp difference calculation.

union_relations (source)

This macro unions together an array of Relations, even when columns have differing orders in each Relation, and/or some columns are missing from some relations. Any columns exclusive to a subset of these relations will be filled with null where not present. An new column (_dbt_source_relation) is also added to indicate the source for each record.

Usage:

{{ dbt_utils.union_relations(
    relations=[ref('my_model'), source('my_source', 'my_table')],
    exclude=["_loaded_at"]
) }}

Args:

  • relations (required): An array of Relations.
  • aliases (optional): An override of the relation identifier. This argument should be populated with the overwritten alias for the relation. If not populated relations will be the default.
  • exclude (optional): A list of column names that should be excluded from the final query.
  • include (optional): A list of column names that should be included in the final query. Note the include and exclude parameters are mutually exclusive.
  • column_override (optional): A dictionary of explicit column type overrides, e.g. {"some_field": "varchar(100)"}.``
  • source_column_name (optional): The name of the column that records the source of this row. By default this argument is set to none.

union_data (source)

This macro unions together tables of the same structure so that users can treat data from multiple connectors as the 'source' to a package. Depending on which macros are set, it will either look for schemas of the same name across multiple databases, or schemas with different names in the same database.

If the var with the name of the schema_variable argument is set, the macro will union the table_identifier tables from each respective schema within the target database (or source database if set by a variable). If the var with the name of the database_variable argument is set, the macro will union the table_identifier tables from the source schema in each respective database.

When using this functionality, every _tmp table should use this macro as described below.

Usage:

{{
    fivetran_utils.union_data(
        table_identifier='customer', 
        database_variable='shopify_database', 
        schema_variable='shopify_schema', 
        default_database=target.database,
        default_schema='shopify',
        default_variable='customer_source'
    )
}}

Args:

  • table_identifier: The name of the table that will be unioned.
  • database_variable: The name of the variable that users can populate to union data from multiple databases.
  • schema_variable: The name of the variable that users can populate to union data from multiple schemas.
  • default_database: The default database where source data should be found. This is used when unioning schemas.
  • default_schema: The default schema where source data should be found. This is used when unioning databases.
  • default_variable: The name of the variable that users should populate when they want to pass one specific relation to this model (mostly used for CI)

source_relation (source)

This macro creates a new column that signifies with database/schema a record came from when using the union_data macro above. It should be added to all non-tmp staging models when using the union_data macro.

Usage:

{{ fivetran_utils.source_relation() }}

Bash Scripts

columns_setup.sh (source)

This bash file can be used to setup or update packages to use the fill_staging_columns macro above. The bash script does the following three things:

  • Creates a .sql file in the macros directory for a source table and fills it with all the columns from the table.
    • Be sure your dbt_project.yml file does not contain any Warnings or Errors. If warnings or errors are present, the messages from the terminal will be printed above the macro within the .sql file in the macros directory.
  • Creates a ..._tmp.sql file in the models/tmp directory and fills it with a select * from {{ var('table_name') }} where table_name is the name of the source table.
  • Creates or updates a .sql file in the models directory and fills it with the filled out version of the fill_staging_columns macro as shown above. You can then write whatever SQL you want around the macro to finishing off the staging file.

The usage is as follows, assuming you are executing via a zsh terminal and in a dbt project directory that has already imported this repo as a dependency:

source dbt_modules/fivetran_utils/columns_setup.sh "path/to/directory" file_prefix database_name schema_name table_name

As an example, assuming we are in a dbt project in an adjacent folder to dbt_marketo_source:

source dbt_modules/fivetran_utils/columns_setup.sh "../dbt_marketo_source" stg_marketo "digital-arbor-400" marketo_v3 deleted_program_membership

In that example, it will:

  • Create a get_deleted_program_membership_columns.sql file in the macros directory, with the necessary macro within it.
  • Create a stg_marketo__deleted_program_membership_tmp.sql file in the models/tmp directory, with select * from {{ var('deleted_program_membership') }} in it.
  • Create or update a stg_marketo__deleted_program_membership.sql file in the models directory with the pre-filled out fill_staging_columns macro.

dbt_fivetran_utils's People

Contributors

alex-ilyichov avatar dylanbaker avatar fivetran-catfritz avatar fivetran-jamie avatar fivetran-joemarkiewicz avatar kristin-bagnall avatar lizdeika avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt_fivetran_utils's Issues

[FEATURE] - Apply more rigorous testing to macros

Feature Description

A lot of effort has been made over the past year to include full coverage of testing for the fivetran_utils package to test against all Fivetran dbt packages which reference this package as a dependency. The full coverage feature was achieved within PR #118. This full coverage test essentially runs a dbt compile and dbt run against every single Fivetran dbt package. This is great, and ensures we are testing the packages compile and run with the updates to this package; however, we have found that there are still notable gaps in this design. For example:

  • We should be able to ensure that when we update a macro it will result in the expected behavior across all destinations
  • When creating a new macro we should be able to ensure it works as expected and does not conflict with any other macros
  • We should be able to test basic assertions related to the behavior of the macros and how they should behave in the individual packages.

This feature request is to apply the above tests to the Fivetran Utils package in order to ensure more stable releases and more frequent updates as well.

[FEATURE] - Add package-agnostic docs definitions only once (here)

Just tested this locally -- if we create a models folder to this package and add a docs.md file with definitions, all other packages that depend on this one will be able to access those descriptions. We can also put the docs file in the macros folder according to this.

There are plenty of fields we use across the board:

  • source_relation
  • _fivetran_deleted
  • is_deleted (hubspot has good definitions for this)
  • _fivetran_id
  • _fivetran_synced

And maybe some common types of fields that we can generalize the definition to:

  • created_at
  • updated_at
  • index

[FEATURE] - Audit and mark macros for deprecation

We should review our macros and mark the outdated ones for deprecation. This will also require an audit of our packages to see which are using the old macros, or if they should be using a macro in place of package-specific logic or macros.

Create a seed data helper for identifier variables

There are some instances where we need to leverage multiple seed files to handle the nuances of the various platforms. For this, we have a variable seed_data_helper and the older snowflake_data_helper to account for these seed files.

However, this macro does not work when leveraged with the identifier. As such, we should create a similar macro that does the same operation for identifiers and not references. This will ensure the docs are consistent across package implementations.

BUG: v0.3.6 causing compilation error in dbt 1.0.4

Hello team,

Our team is using the following packages inside of our dbt Cloud instance:

  • package: fivetran/zendesk_source
    version: [">=0.6.0", "<0.7.0"]

  • package: fivetran/zendesk
    version: [">=0.8.0", "<0.9.0"]

With the release of v0.3.6 of the dbt_fivetran_utils, we're getting a compilation error for the new get_column_names_only macro. Manually reverting to the v0.3.5 package via our packages.yml resolves the error in the meantime.

Error message from our dbt runs:

15:28:01 Running with dbt=1.0.4
15:28:01 Partial parse save file not found. Starting full parse.
Compilation Error
In dispatch: No macro named 'get_column_names_only' found
Searched for: 'snowflake__get_column_names_only', 'default__get_column_names_only'

Timestamp_Diff Name Change for v0.1.x

An update made to the timestamp_diff macro during the dbt v0.20.0 upgrade was applied that fixed the namespace from datediff to timestamp_diff. This is causing data quality issues within some previous versions of our packages.

To address this, we will want to update the latest v0.1.x release of this package to use the correct namespace and cut a new release for v0.1.2 which will be used in our older packages.

File we need to update: timestamp_diff for v0.1.1

Task: update datediff in the namespaces to timestamp_diff

upgrade dbt_utils dependency

dbt_utils is currently on version 0.9.2 but fivetran_utils supports only version < 0.9.0:

packages:

  - package: dbt-labs/dbt_utils
    version: [">=0.8.0", "<0.9.0"]

We at re_data use fiveran_utils and already got a couple of requests to upgrade to a newer version of dbt_utils, but as for now, we cannot do it because of dependency on fivetran_utils. Is there a timeline to upgrade the version of dbt_utils used? I imagine this is also causing problems for existing Fivetran packages users too, so it shouldn't be a problem only for us :)

[FEATURE] Add Full Coverage of Dependent Packages in Integration Testing

It is quite alarming and surprising that this package does not take advantage of our standard integration testing procedures. This is a huge risk and needs to be addressed.

This feature is intended to have the team apply integration testing to the package. This should be performed following the completion of FR #76

make union_data more robust

see fivetran/dbt_shopify_source#57

When no sources in a union_schema or union_database array have a table (or if the first schema/db in the array doesn't have it?) our models fail, because we are not doing the cast(null as <data_type>) as field_in_table_we_dont_have thing. I think we should add a parameter to the union_data macro that is a dictionary/json very similar to what we have in the get_<table>_columns macros. then union_data could handle the null stuff

Union data macro not working as expected in integration tests

It has become apparent that the union data macro is not working as expected when leveraged via our integration tests. You can see in the latest Social Media PR that every table is not being found and is being filled with an empty table. This should not be the case as the union data should be able to pick up the identifiers.

We will need to explore more of this and figure out how to get this macro to work with our integration tests.

Duplication check for add_pass_through_columns macro

Currently add_pass_through_columns macro does not check for duplication in columns during append. This will cause errors if we are appending a column of the same name to the existing list of columns.

Current solution: manually remove all columns from macro that are not included in the current stg_* query.
Long term desired solution: leverage dbt release of python modules to test for duplication in get_<table>_columns.sql.
A couple of potential methods in order of preference:

  1. Add "current columns" list + "columns to append" list and find the set of the added columns therefore returning one set of unique columns/values
  2. Perform list-to-list string matching yielding one list of unique columns/values

[Deprecate] dummy_coalesce_value

This macro is used only in dbt_marketo. Since this macro will soon be updated within in dbt_marketo, we should deprecate it here after it is moved.

Break out Pipeline.yml into separate components for easier maintenance

The pipeline.yml file within the buildkite folder is used to orchestrate the buildkite job to ensure the version of fivetran_utils works across all Fivetran dependent packages. That being said, it is easy to see at first glance how unruly this file is. It contains every single package execution order for the major supported warehouses. As such, maintenance for this can quickly get out of hand.

Therefore, we should explore breaking this pipeline.yml file into separate components or more reusable components. This way maintenance of this pipeline will not have to be as daunting as it currently is.

Add does_table_exist macro

The stripe and fivetran log packages use this macro, we should lift it from it being individually located within each package to being part of dbt_fivetran_utils

[FEATURE] - update_persist_pass_through_columns to include field aliasing

Update the titular macro with this as the default. This is to allow for the case where we wish to append anything to any pass through fields. We ran into this in salesforce where we were joining 1 table twice, and ran into problems with ambiguous columns. The below is Joe's fix for specifying which table a field is coming from.


{% macro custom_persist_pass_through_columns(pass_through_variable, identifier=none, transform='', append_string='') %}

{% if var(pass_through_variable, none) %}
    {% for field in var(pass_through_variable) %}
        , {{ transform ~ '(' ~ (identifier ~ '.' if identifier else '') ~ (field.alias if field.alias else field.name) ~ ')' }} as {{ field.alias if field.alias else field.name }}{{ append_string }}
    {% endfor %}
{% endif %}

{% endmacro %}

[FEATURE] Break out package contents into sub packages

Issue at Hand

The Fivetran Utils package has become larger than we originally intended. Additionally, this package is referenced and utilized across all of our packages. This can be problematic if a bug is introduced as it will most likely cause an error on anyone using our packages. Therefore, it would be best to break this package out into separate components. To mitigate the risk of introducing errors into our users dbt projects.

Possible Break Out

  • Utility macros: Will remain in dbt_fivetran_utils
  • Automation macros and scripts: Will exist in their own project/package
  • Other non automation and utility macros: tbd, but will be separated.

[FEATURE] - Deprecate collect_freshness macro in v0.5.0 release

The collect_freshness is a macro used to ensure that the source freshness test doesn't fail for cases when a Fivetran dbt package has disabled a source. Previously, if a source was disabled in a package, the freshness test would still fail. This macro ensured that it would not fail.

However, in the recent versions of dbt-core there is now a native feature (see example here) which allows sources to be enabled/disabled based on a tag. Therefore, this macro will no longer be necessary once we move to the v0.5.0 release of fivetran_utils and only support dbt-core versions of v1.8.0 or greater.

Once we move to v0.5.0 of Fivetran Utils and update all Fivetran dbt packages to support this version, we will fully remove the collect_freshness macro from the fivetran_utils dbt package.

[BUG] `collect_freshness` macro override throws warning in `dbt source snapshot-freshness` command

When running the dbt source snapshot-freshness command, we are getting the following warning:

[WARNING]: Deprecated functionality
The 'collect_freshness' macro signature has changed to return the full query
result, rather than just a table of values. See the v1.5 migration guide for
details on how to update your custom macro:
https://docs.getdbt.com/guides/migration/versions/upgrading-to-v1.5

I believe this is due to the fact that the fivetran_utils package overrides the collect_freshness dbt macro (see this dbt Slack thread as reference: https://getdbt.slack.com/archives/C03SAHKKG2Z/p1683127164993329), and that it is expecting this
{{ return(load_result('collect_freshness')) }}
instead of this
{{ return(load_result('collect_freshness').table) }}

This is not blocking us, but it is not very pleasant to get warnings in each dbt source snapshot-freshness command since it creates undesired clutter.

Am I right in assuming this is the cause of this warning? If so, are there any plans to update the macro in fivetran_utils?

I actually recreated the macro in our repo, got rid of the .table part, and no warnings were raised when I ran the dbt source snapshot-freshness command.

My current dbt versions:

  • dbt-core==1.6.5
  • dbt-snowflake==1.6.4

FEATURE - Update integration tests to support dbt version >= 1.7.0

As of dbt 1.7.0 there have been a few changes with how dbt projects install package dependencies. For example, there is now a package-lock.yml file and a dependencies.yml file that allow users to have more control over the packages being installed in their projects.

This is great for users, but not ideal for how we have setup the integration tests for this package. This package tests packages in the following way:

  1. Install all Fivetran dbt packages
  2. cd into the respective package that we are testing integration tests folder (ie. dbt_packages/quickbooks/integration_tests/)
  3. Install the dependent packages within the integration tests
  4. Override the dbt_fivetran_utils version in the dbt_packages folder to be the branch which we are testing

This works for all versions of dbt <1.7.0. However, the introduction of the package-lock.yml file complicates this as the lock file ensures there is consistency with the packages being installed. Again, this is great for users, but bad for this use case. We need to update the integration tests for this package to ensure we can test all packages in the way outlined above. For the time being we are using dbt 1.6.x to test.

Update try_cast default dispatch to be consistent with macro name

The try_cast macro has an inconsistent dispatch format where the default dispatch is different from the name of the macro.

{% macro default__safe_cast(field, type) %}

This default dispatch should instead be named default__try_cast

This does not cause failures for any of our supported warehouses (because we offer individual dispatches for all other supported warehouses), but for future builds we will want to ensure the default value is properly used. This will ensure the compilation state succeeds and does not fail on the dispatch for the macro not being found.

Maybe? Move dbt-date and dbt-expectations dependencies here

Currently, some of our packages have dependencies on calogica's dbt-date and dbt-expectations packages. When there is a breaking change update to one of these packages, we need to update the version ranges across multiple packages' packages.yml files. For example, we have these 3 issues to increase the range of dbt-date:

and I expect more issues to come in from:

  • intercom
  • shopify

It would be more efficient if we only had to make this update in one place (ie in fivetran_utils). However, this would mean that every single package would have a dependency on these external packages, even if they don't need them. The main con of this would be that we would either need to

a. update the Does this package have dependencies? section of every single package's README to include dbt-date and dbt-expectations (and to uptick the package version ranges everytime we upgrade), OR

b. rethink the Does this package have dependencies? section of the README. Perhaps this section would just link to the dependency matrix in the fivtetran utils README.

So maybe it's not worth it to do this, but maybe it is...

fill_staging_columns not applying type

Hi there, I have a couple of questions around the method to generate staging models. There used to be a fair bit of automation to generate code for staging models for example: https://github.com/fivetran/dbt_fivetran_utils/blob/master/macros/staging_models_automation.sql

[FEATURE] - Support for BigQuery JSON and Redshift SUPER datatypes

Currently all our JSON macros in this package are built under the assumption that the JSON fields are strings formatted as JSON. However, it has come to our attention that there may be JSON datatypes in BigQuery and SUPER in Redshift that are not strings and could cause potential issues in the macros within this package.

Therefore, we should update this package to have the macros allow for the handling of these new datatypes.

Providing reserved keyword support for `fivetran_utils.add_passthrough_columns()`

Currently, we have trouble bringing through reserved keywords using the add_passthrough_columns macro.

This issue was surfaced from an attempt to bring through the group field in the entity source table in Netsuite2, as it treats group as if it should be used in a group by statement.

Error message:

Syntax error: Expected keyword BY but got keyword AS at [373:2]
  compiled Code at target/run/netsuite_source/models/netsuite2/stg_netsuite2__entities.sql

See this Height ticket.

Optimize Integration Tests to Only Run on Relevant Changes

Currently the integration tests pipeline is configured to run all packages when a single change is committed to the repo. That being said, we may be making a change that is not exactly relevant to all dependent Fivetran packages. As such, we are incurring a lot of costs to run the pipeline across all packages even if the code change impacts just one dependent package (or even none 😱).

Therefore, this feature proposal is requesting that the pipeline is optimized to only run for dependent packages that the change is relevant to. If a package is not impacted by the change, then we could see just a simple success and move on. However, if the dependent package is reliant on the changes then it will properly run and be tested.

Changes to dispatch in dbt v0.20

Hi team! I wanted to give you a heads up about a change coming in dbt v0.20. Let me know if you have questions, I'm more than happy to help.

Required code change

We've made a change to adapter.dispatch: Instead of passing a _get_namespaces() macro to the packages arg, you should pass the name of this package (as a string) to the macro_namespace arg, which is now the second positional argument.

In practice, this just looks like switching each macro like:

{% macro ceiling(num) -%}
{{ adapter.dispatch('ceiling', packages = fivetran_utils._get_utils_namespaces()) (num) }}
{%- endmacro %}

To:

{% macro ceiling(num) -%}

{{ adapter.dispatch('ceiling', 'fivetran_utils') (num) }}

{%- endmacro %} 

I hope this could be as simple as Find + Replace for packages = fivetran_utils._get_utils_namespaces()'fivetran_utils'.

If you prefer more explicit syntax, you could specify keyword names for each argument. What's below is exactly the same as what's above:

{% macro ceiling(num) -%}

{{ adapter.dispatch(macro_name = 'ceiling', macro_namespace = 'fivetran_utils') (num) }}

{%- endmacro %} 

For the README

If a user wishes to override/shim this package, instead of defining a var named fivetran_utils_dispatch_list, they should now define a config in dbt_project.yml.

There's one crucial caveat: Your _get_utils_namespaces macro works just a little differently from everyone else's today. In order to maintain the current behavior of searching in dbt_utils and then fivetran_utils, all users of Fivetran packages will need to specify at least this much:

dispatch:
  - macro_namespace: fivetran_utils
    search_order: ['dbt_utils', 'fivetran_utils']

They could add other packages to that list, too, for instance spark_utils or <name_of_their_project> (if they wish to override/reimplement certain macros).

One thought: It doesn't look like the fivetran_utils package currently requires dbt_utils in its packages.yml today. It really should, if you want to maintain the behavior of dispatching to dbt_utils before dispatching to fivetran_utils. If the dbt_utils package isn't installed, that will raise an error.

Notes

This change is in dbt v0.19.2 as well. Both v0.19.2 and v0.20.0 have backwards compatibility for the old syntax, so there's no huge rush to upgrade. The old syntax will start raising a deprecation warning in v0.20. As soon as you do upgrade to the new syntax, you'll need to require dbt >= 0.19.2 (or just >=0.20.0, for simplicity, since you're already making compatibility changes in #69).

See also:

[FEATURE] - Union CTEs capability

Currently, we can't use the dbt_utils.union_relations macro to union CTEs. This FR is to add in a macro within our fivetran_utils package to union ctes. We can see an example of this macro in Ad Reporting V1.0.0. However, more complexity around datatypes, column orders etc may need to be enforced.

Update Union Data to properly work with identifiers when not unioning

Currently the union_data macro is equipped to work properly with identifiers when unioning schemas. You can see how this is handled in this block.

identifier=source(schema, table_identifier).identifier if var('has_defined_sources', false) else table_identifier

However, in the section of the macro where there is no unioning we do not have this logic

identifier=var(default_schema ~ '_' ~ table_identifier ~ '_' ~ 'identifier', table_identifier)) -%}

This feature request would be to update the macro so all sections may leverage the identifier variable properly.

[FEATURE] - Check for enabled variables

In Ad Reporting V1.0.0, we've added two macros to check for enabled variables:

  • get_enabled_packages
  • is_enabled

This FR is for a similar macro to be added into fivetran_utils that can serve a broader application of the Ad Reporting V1.0.0 versions. Essentially what the macro(s) should be utilized for is based on certain variable configurations and values, determine whether a model should be enabled.

[FEATURE] - Add `persist_pass_through_columns` -capability for non-mapping passthrough variables

Currently when a non-mapped passthrough/list-style variable is passed to persist_pass_through_columns macro, no error occurs but blank fields are generated. This macro should be updated to handle either list- or dict- style variables. The logic used in fill_pass_through_columns is an example of how this could be done.

This will allow this macro to be brought into older packages, that might not yet use this macro and therefore are expecting list-style, with less friction on the user's end as they will not have to change their variables.

`json_extract` should use double quotes on Snowflake

I'm using the Stripe package and have set project variables for fields I want to extract from metadata columns. It works on most values but isn't extracting anything when the key contains a full stop, e.g. plan.id.

I have fixed this locally by adding double quotes to the Snowflake macro:

{% macro snowflake__json_extract(string, string_path) %}

  json_extract_path_text(try_parse_json( {{string}} ), {{ "'\"" ~ string_path ~ "\"'" }} )

{% endmacro %}

Happy to make a PR but I'm not sure whether the problem applies to other databases.

Feature - Adjust Union Relations to use sources or references

Are you a current Fivetran customer?
Fivetran created PR

Describe the bug
Currently the union_data macro does not utilize sources when unioning the data. This can cause a number of issues, one in part being that it does not effectively test freshness (as stated in Issue #52); however, another issue is that within Fivetran Transformations for dbt Core this union feature will not effectively work since Fivetran expects a corresponding source when orchestrating transformations. Therefore, we will want to ensure the union_data macro can be built off a source reference.

However, if we do this, all previous implementations of the union_data macro will break. We will want to find a way for the union_data macro to leverage the current way but also leverage the sources if a users defines the sources. This is something I could see us wanting to fold into the releases/v0.4.latest version of Fivetran Utils.

Additional Details
I was able to get a working version implemented within PR #82. However, it has not been fully evaluated and would need to be reworked in order to be scalable and implemented properly without breaking all existing implementations of the macro.

Bump up integration_tests/packages.yml versioning to ensure full package coverage

We have recently upped the latest version of our dbt_salesforce package to 1.0.0, which is outside the range of the current integration_tests/packages.yml of our latest dbt_fivetran_utils release. We will want to extend the range of all our packages.yml to account for this and bundle this in a future release.

Example bump:

  • package: fivetran/salesforce
    version: [">=0.1.0", "<1.0.0"]

to

  • package: fivetran/salesforce
    version: [">=0.1.0", "<2.0.0"]

BUG - Freshness tests fail when `union_schemas` or `union_databases` declared in Fivetran plugins

Are you a current Fivetran customer?
Pawel Ngei, Pattern Brands

Describe the bug
If you're using one of the Fivetran DBT plugins which supports multiple schemas / databases, like Shopify, Klaviyo, (soon) Google Ads or Facebook Ads, the freshness tests defined in _source package fail as there is no {{ var ('klaviyo_schema', 'klaviyo') }} defined.

Steps to reproduce

  1. Set up a project with multiple schemas
  2. Use one of the _source plugins from Fivetran, like dbt_klaviyo_source.
  3. Provide schemas via union_schemas in your dbt_project.yml, with no single klaviyo_schema and no schema called just klaviyo. Let's make them just klaviyo_1 and klaviyo_2.
  4. Run dbt source snapshot-freshness or dbt source freshness.
  5. The Klaviyo plugin freshness test will fail

Expected behavior
The Klaviyo freshness plugin will test the freshness of multiple schemas about to be unioned.

Project variables configuration

vars:
  klaviyo_source:
    klaviyo_database: 'RAW'
    union_schemas: [
      'KLAVIYO_1',
      'KLAVIYO_2',
      'KLAVIYO_3',
      'KLAVIYO_4'
    ]

Package Version

packages:
  - package: fivetran/klaviyo_source
    version: 0.3.0

Warehouse

  • BigQuery
  • Redshift
  • Snowflake
  • Postgres
  • Databricks
  • Other (provide details below)

Additional context
I believe that this bug can be solved using this utils repository, will describe it in a comment below.

Please indicate the level of urgency
Will require this feature before we can call our setup "production-ready", so effectively just after Google Ads and Facebook Ads unioning PRs are merged.

Are you interested in contributing to this package?

  • Yes, I can do this and open a PR for your review.
  • Possibly, but I'm not quite sure how to do this. I'd be happy to do a live coding session with someone to get this fixed.
  • No, I'd prefer if someone else fixed this. I don't have the time and/or don't know what the root cause of the problem is.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.