Code Monkey home page Code Monkey logo

Comments (6)

fivetran-joemarkiewicz avatar fivetran-joemarkiewicz commented on August 16, 2024 1

Hey @pawelngei 👋

I was able to think over this further and chat with @DylanBaker on the source freshness bug and how we could feasibly address the limitation in the Fivetran packages that take advantage of the union source feature.

I believe some of the proposed edits to the collect_freshness macro and the expanding of the sources within the dbt packages is pushing dbt to a limit that I don't feel is intended to be addressed within our packages. For a larger override of the source and collect_freshness features, I believe this is best to be brought up with the folks at dbt-labs within dbt-core. I want to ensure our packages are scalable into the future and I fear editing the collect_freshness macro further will reach into more involved modifications that will pose more of a risk than addressing the identified issue by ourselves.

However, @DylanBaker has come up with a pretty neat workaround solution! Instead of attempting to test the freshness on the source (where this issue presents itself due to the union functionality of the package), we can create a custom test as a macro and use that test within the staging models. These tests can be called simply with dbt test and will highlight where the freshness failed based on the interval you have assigned within the test. The additional upside of this solution is the test groups by the source_relation. Therefore, if the test fails you may run the compiled code and see exactly which source failed. See below how I did this on my dbt_facebook_pages datasets.
image

Implementing this solution would be pretty easy as well. We would add the test identified above to this package and we could then leverage the test in the stg.yml throughout all Fivetran dbt packages that utilize the union feature.

Looking back over the original issue described, this solution will handle the concerns you raised. Albeit the solution is addressed in a bit different way, but still addresses the issue. I am curious what your thoughts are. Anything beyond this solution I will say it outside of the scope of the work our team will do to address the issue. However, I would be happy to comment on a FR within dbt-core if you decide to raise this to the dbt-labs team.

from dbt_fivetran_utils.

pawelngei avatar pawelngei commented on August 16, 2024

I see that you re-declare the dbt-core macro collect_freshness at https://github.com/fivetran/dbt_fivetran_utils/blob/master/macros/collect_freshness.sql . I believe the easiest way to fix it is to add the multiple-schema functionality to this macro.

Looking at the following lines:

    select
      {% if is_enabled %}
      max({{ loaded_at_field }})
      {% else %} 
      {{ current_timestamp() }} {% endif %} as max_loaded_at,
      {{ current_timestamp() }} as snapshotted_at

    {% if is_enabled %}
    from {{ source }}
      {% if filter %}
      where {{ filter }}
      {% endif %}
    {% endif %}

I think we could modify this code, outputting a freshness tests for each source declared in src_[plugin].yml, to essentially iterate through all the schemas / databases available in union_schemas and union_databases variables.

Currently I lack the exact DBT / Jinja skills to write this well, but the algorithm would be similar to this:

  1. IF there is a union_schemas or union_database variable THEN
  2. when generating Freshness test SQL iterate
  3. schema in union_schemas
  4. override the {{ source }} with the schema, even if it means just deleting everything after the last dot and concatenating
  5. union tables by min() to check which one is broken
  6. return result

The obvious disadvantage would be that it generates only a single freshness tests for multiple schemas / databases, and if one fails, the user isn't informed which one was it. Bypassing this might be possible by changing how the collect_freshness is called in the first place, to call it once for each source, but currently I don't know if it's possible to implement it with just a macro.

The advantage: Once implemented here, it would affect all the Fivetran DBT packages and all the unioning in the future.

Some possible files to look into:

from dbt_fivetran_utils.

fivetran-joemarkiewicz avatar fivetran-joemarkiewicz commented on August 16, 2024

Hey @pawelngei thanks so much for opening this issue and being vigilant about the capability and limitations of the union_schema/database functionality within a few of our packages.

If you're using one of the Fivetran DBT plugins which supports multiple schemas / databases, like Shopify, Klaviyo, (soon) Google Ads or Facebook Ads, the freshness tests defined in _source package fail as there is no {{ var ('klaviyo_schema', 'klaviyo') }} defined.
As we discussed during office hours yesterday. It seems the resolution for this fix will be to add a [platform]_schema variable to your root dbt_project.yml. However, this will only address the dbt source freshness from not erroring and will not in fact run freshness on the other sources your are unioning.

I do see how you provide a suggestion on incorporating a for loop into the collect_freshness macro to iterate through the sources to then run these tests across all of the sources within the union_schemas/databases macro. However, we did point out that this would not be helpful if the source freshness failed on a single source as we would not be able to identify which source the failure occurred within. To fix this, we could add a simple {{ dbt_utils.log(message) }} to highlight within the log what source failed.

While the above would be a dynamic fix, I think we are breaching into territory of pushing dbt to a limit it was not intended to reach (or at least is not intended for at the current point in time). I believe a more "dbt-esque" way (but more manual) would be to create a separate src.yml for each of the sources you are unioning. This way you can ensure the proper freshness tests are being applied to the relevant sources.

I tested this on locally and was able to see the freshness tests be applied to the appropriate sources once I created the new src.yml files in my root project for the relevant additional sources after the [platform]_schema source. I would be really interested to hear @DylanBaker thoughts on this as well.

from dbt_fivetran_utils.

pawelngei avatar pawelngei commented on August 16, 2024

Hey @fivetran-joemarkiewicz !

I believe a more "dbt-esque" way (but more manual) would be to create a separate src.yml for each of the sources you are unioning. This way you can ensure the proper freshness tests are being applied to the relevant sources.

Sadly, this solution is very brittle. If you're using the plugins in your own project, you can declare one schema as an override, but if you would like to test the freshness of several unioned schemas this way, you need to essentially re-declare the whole source model - all the tables from the _source plugin. You cannot override one schema (from the _union) plugin multiple times or "import" it's data shape / table list in any way.

My understanding is that a lot of your clients use those plugins not to worry about the data shape changes, so that they can always count on the end product. If they need to maintain a list of the tables locally anyway, what's the point of using your plugins instead of writing all the transformations on their side?

from dbt_fivetran_utils.

pawelngei avatar pawelngei commented on August 16, 2024

Hey @fivetran-joemarkiewicz !

For a larger override of the source and collect_freshness features, I believe this is best to be brought up with the folks at dbt-labs within dbt-core. I want to ensure our packages are scalable into the future and I fear editing the collect_freshness macro further will reach into more involved modifications that will pose more of a risk than addressing the identified issue by ourselves.

I agree with that, this change would be quite substantial. It would make sense to reach out to dbt-core team about it, but we would need to formulate a really good proposal, as this gets more architectural. I'll think about it and let you know if I come up with anything.

However, @DylanBaker has come up with a pretty neat workaround solution! Instead of attempting to test the freshness on the source (where this issue presents itself due to the union functionality of the package), we can create a custom test as a macro and use that test within the staging models.

I really like the idea of the manual freshness tests! If I understand the DBT macros and namespaces correctly, that solution would even allow users to override these from their own project, right?

One issue: I'm afraid that this would not address the problem of the freshness tests failing if no klaviyo_schema is defined. We would need to add that single-schema / single-db variable to every union config. If I remember correctly, during our last chat you assured us that this would not interfere with the plugin working otherwise, so it's a possible option, just a little dirty in my opinion.

Alternatively, we could remove all auto-generated freshness tests from the plugins code and replace them with the custom tests altogether.

It would be good to give the users an ability to turn the freshness tests on/off in the plugin config, in case someone doesn't need them.

from dbt_fivetran_utils.

fivetran-joemarkiewicz avatar fivetran-joemarkiewicz commented on August 16, 2024

Hi @pawelngei I wanted to post back here and let you know PR #89 actually provides a workaround to allow for the union data macro to properly connect to sources. There is a bit of additional work needed (manually creating source.yml files for the desired sources) and leveraging a has_defined_sources variable to set to true to take advantage of these source configs.

With the above merged. This bug has been addressed in the latest release of fivetran_utils and is available in the packages that have the union feature! Therefore, I will close this bug report. Thanks again for bringing this to our attention!

from dbt_fivetran_utils.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.