Comments (6)
Hey @pawelngei
I was able to think over this further and chat with @DylanBaker on the source freshness bug and how we could feasibly address the limitation in the Fivetran packages that take advantage of the union source feature.
I believe some of the proposed edits to the collect_freshness
macro and the expanding of the sources within the dbt packages is pushing dbt to a limit that I don't feel is intended to be addressed within our packages. For a larger override of the source and collect_freshness
features, I believe this is best to be brought up with the folks at dbt-labs within dbt-core. I want to ensure our packages are scalable into the future and I fear editing the collect_freshness
macro further will reach into more involved modifications that will pose more of a risk than addressing the identified issue by ourselves.
However, @DylanBaker has come up with a pretty neat workaround solution! Instead of attempting to test the freshness on the source (where this issue presents itself due to the union functionality of the package), we can create a custom test as a macro and use that test within the staging models. These tests can be called simply with dbt test
and will highlight where the freshness failed based on the interval you have assigned within the test. The additional upside of this solution is the test groups by the source_relation
. Therefore, if the test fails you may run the compiled code and see exactly which source failed. See below how I did this on my dbt_facebook_pages
datasets.
Implementing this solution would be pretty easy as well. We would add the test identified above to this package and we could then leverage the test in the stg.yml
throughout all Fivetran dbt packages that utilize the union feature.
Looking back over the original issue described, this solution will handle the concerns you raised. Albeit the solution is addressed in a bit different way, but still addresses the issue. I am curious what your thoughts are. Anything beyond this solution I will say it outside of the scope of the work our team will do to address the issue. However, I would be happy to comment on a FR within dbt-core if you decide to raise this to the dbt-labs team.
from dbt_fivetran_utils.
I see that you re-declare the dbt-core
macro collect_freshness
at https://github.com/fivetran/dbt_fivetran_utils/blob/master/macros/collect_freshness.sql . I believe the easiest way to fix it is to add the multiple-schema functionality to this macro.
Looking at the following lines:
select
{% if is_enabled %}
max({{ loaded_at_field }})
{% else %}
{{ current_timestamp() }} {% endif %} as max_loaded_at,
{{ current_timestamp() }} as snapshotted_at
{% if is_enabled %}
from {{ source }}
{% if filter %}
where {{ filter }}
{% endif %}
{% endif %}
I think we could modify this code, outputting a freshness tests for each source
declared in src_[plugin].yml
, to essentially iterate through all the schemas / databases available in union_schemas
and union_databases
variables.
Currently I lack the exact DBT / Jinja skills to write this well, but the algorithm would be similar to this:
- IF there is a
union_schemas
orunion_database
variable THEN - when generating Freshness test SQL iterate
- schema in union_schemas
- override the
{{ source }}
with the schema, even if it means just deleting everything after the last dot and concatenating - union tables by
min()
to check which one is broken - return result
The obvious disadvantage would be that it generates only a single freshness tests for multiple schemas / databases, and if one fails, the user isn't informed which one was it. Bypassing this might be possible by changing how the collect_freshness
is called in the first place, to call it once for each source, but currently I don't know if it's possible to implement it with just a macro.
The advantage: Once implemented here, it would affect all the Fivetran DBT packages and all the unioning in the future.
Some possible files to look into:
collect_freshness
definition indbt-core
: https://github.com/dbt-labs/dbt-core/blob/178f74b753aab0c40d03799b3287a13f35740248/core/dbt/include/global_project/macros/adapters/common.sql#L280-L296calculate_freshness
Python function indbt-core
: https://github.com/dbt-labs/dbt-core/blob/434f3d26789d4bed3f0057c7c7d898e9dbf6687f/core/dbt/adapters/base/impl.py#L1057-L1098freshness.py
package ofdbt-core
: https://github.com/dbt-labs/dbt-core/blob/434f3d26789d4bed3f0057c7c7d898e9dbf6687f/core/dbt/task/freshness.py
from dbt_fivetran_utils.
Hey @pawelngei thanks so much for opening this issue and being vigilant about the capability and limitations of the union_schema/database functionality within a few of our packages.
If you're using one of the Fivetran DBT plugins which supports multiple schemas / databases, like Shopify, Klaviyo, (soon) Google Ads or Facebook Ads, the freshness tests defined in _source package fail as there is no {{ var ('klaviyo_schema', 'klaviyo') }} defined.
As we discussed during office hours yesterday. It seems the resolution for this fix will be to add a[platform]_schema
variable to your rootdbt_project.yml
. However, this will only address thedbt source freshness
from not erroring and will not in fact run freshness on the other sources your are unioning.
I do see how you provide a suggestion on incorporating a for loop into the collect_freshness
macro to iterate through the sources to then run these tests across all of the sources within the union_schemas/databases
macro. However, we did point out that this would not be helpful if the source freshness failed on a single source as we would not be able to identify which source the failure occurred within. To fix this, we could add a simple {{ dbt_utils.log(message) }}
to highlight within the log what source failed.
While the above would be a dynamic fix, I think we are breaching into territory of pushing dbt to a limit it was not intended to reach (or at least is not intended for at the current point in time). I believe a more "dbt-esque" way (but more manual) would be to create a separate src.yml for each of the sources you are unioning. This way you can ensure the proper freshness tests are being applied to the relevant sources.
I tested this on locally and was able to see the freshness tests be applied to the appropriate sources once I created the new src.yml files in my root project for the relevant additional sources after the [platform]_schema
source. I would be really interested to hear @DylanBaker thoughts on this as well.
from dbt_fivetran_utils.
Hey @fivetran-joemarkiewicz !
I believe a more "dbt-esque" way (but more manual) would be to create a separate src.yml for each of the sources you are unioning. This way you can ensure the proper freshness tests are being applied to the relevant sources.
Sadly, this solution is very brittle. If you're using the plugins in your own project, you can declare one schema as an override, but if you would like to test the freshness of several unioned schemas this way, you need to essentially re-declare the whole source model - all the tables from the _source
plugin. You cannot override
one schema (from the _union
) plugin multiple times or "import" it's data shape / table list in any way.
My understanding is that a lot of your clients use those plugins not to worry about the data shape changes, so that they can always count on the end product. If they need to maintain a list of the tables locally anyway, what's the point of using your plugins instead of writing all the transformations on their side?
from dbt_fivetran_utils.
Hey @fivetran-joemarkiewicz !
For a larger override of the source and collect_freshness features, I believe this is best to be brought up with the folks at dbt-labs within dbt-core. I want to ensure our packages are scalable into the future and I fear editing the collect_freshness macro further will reach into more involved modifications that will pose more of a risk than addressing the identified issue by ourselves.
I agree with that, this change would be quite substantial. It would make sense to reach out to dbt-core
team about it, but we would need to formulate a really good proposal, as this gets more architectural. I'll think about it and let you know if I come up with anything.
However, @DylanBaker has come up with a pretty neat workaround solution! Instead of attempting to test the freshness on the source (where this issue presents itself due to the union functionality of the package), we can create a custom test as a macro and use that test within the staging models.
I really like the idea of the manual freshness tests! If I understand the DBT macros and namespaces correctly, that solution would even allow users to override these from their own project, right?
One issue: I'm afraid that this would not address the problem of the freshness tests failing if no klaviyo_schema
is defined. We would need to add that single-schema / single-db variable to every union config. If I remember correctly, during our last chat you assured us that this would not interfere with the plugin working otherwise, so it's a possible option, just a little dirty in my opinion.
Alternatively, we could remove all auto-generated freshness tests from the plugins code and replace them with the custom tests altogether.
It would be good to give the users an ability to turn the freshness tests on/off in the plugin config, in case someone doesn't need them.
from dbt_fivetran_utils.
Hi @pawelngei I wanted to post back here and let you know PR #89 actually provides a workaround to allow for the union data macro to properly connect to sources. There is a bit of additional work needed (manually creating source.yml files for the desired sources) and leveraging a has_defined_sources
variable to set to true
to take advantage of these source configs.
With the above merged. This bug has been addressed in the latest release of fivetran_utils and is available in the packages that have the union feature! Therefore, I will close this bug report. Thanks again for bringing this to our attention!
from dbt_fivetran_utils.
Related Issues (20)
- `json_extract` should use double quotes on Snowflake HOT 2
- [FEATURE] - Add `persist_pass_through_columns` -capability for non-mapping passthrough variables
- [FEATURE] - Update fill_pass_through_columns macro so pre-defining variables not required
- Update Union Data to properly work with identifiers when not unioning HOT 1
- Create a seed data helper for identifier variables HOT 1
- [FEATURE] - make drop_schemas_automation more flexible HOT 1
- [FEATURE] - explore/keep eyes on BQ new JSON functions HOT 1
- [FEATURE] - update_persist_pass_through_columns to include field aliasing
- Union data macro not working as expected in integration tests HOT 1
- [FEATURE] Add Full Coverage of Dependent Packages in Integration Testing HOT 1
- add string-to-date macro HOT 2
- Duplication check for add_pass_through_columns macro
- [FEATURE] - Union CTEs capability
- [FEATURE] - Check for enabled variables
- Update master readme to link to latest release version HOT 1
- Feature - Adjust Union Relations to use sources or references HOT 2
- upgrade dbt_utils dependency HOT 15
- Add does_table_exist macro HOT 1
- [FEATURE] - Add package-agnostic docs definitions only once (here) HOT 1
- make union_data more robust
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt_fivetran_utils.