Code Monkey home page Code Monkey logo

Comments (9)

stumelius avatar stumelius commented on August 30, 2024 1

@dico-teberfre Thanks for reporting this, and sorry for the late reply. The profiler is using the MEDIAN function in Redshift to calculate the median measure and there are currently 3 measures (distinct_proportion, distinct_count and is_unique) that use distinct. I see two options here:

  1. We find another way to calculate the median in Redshift to work around the MEDIAN<->DISTINCT incompatibility issue
  2. We refactor the existing distinct measures in some way, maybe calculating them separately and then joining with the others (see https://stackoverflow.com/questions/67994522/using-median-aggregate-in-same-query-as-avg)

I don't currently have access to a Redshift instance so I can't test this out. Would you be interested in contributing this fix? :)

from dbt-profiler.

odikia avatar odikia commented on August 30, 2024 1

Hi @stumelius ,

Let me dig into your suggestions a bit, and see if I have others. Will start now, in fact! Nothing more needed at the moment. Will fork, and start playing. Thank you!

from dbt-profiler.

stumelius avatar stumelius commented on August 30, 2024 1

@odikia Wow, this is really thorough - amazing! 💯 I'm okay with the subquery method as it requires less changes to the code base. Also, having a third "table_name"/"cte_name"/"relation_name" argument only in the measure_median macro is okay for me. It's not consistent with the other measures but I don't see much harm in that. If we ever need to add similar subquery logic to any of the other measures to work around limitations, we'll use that as an example.

Please go ahead and open a PR :)

from dbt-profiler.

odikia avatar odikia commented on August 30, 2024

Hi @stumelius ,

I came upon this issue just now. Primary DB is Redshift. Happy to contribute towards the fix! :)

from dbt-profiler.

stumelius avatar stumelius commented on August 30, 2024

@odikia Thank you for the offer ❤️ I'd like to take you up on that. I laid out two options to resolve this in the previous comments. What are your thoughts on those? Do you have other ideas? And is there anything you need from me to get started?

from dbt-profiler.

odikia avatar odikia commented on August 30, 2024

I worked through the suggestions, and found a subquery with median to be the most performant method. In testing locally on a dataset with 7m rows, I found the subquery to be 1.41x faster than the join method.

I think joining, as described in the stackexchange conversation mentioned above, would also require a lot of refactoring to get_profile and measures, so made as few changes as possible to get around the problem in redshift.

addition to measures.sql

...
{%- macro redshift__measure_median(column_name, data_type) -%}

{%- if dbt_profiler.is_numeric_dtype(data_type) and not dbt_profiler.is_struct_dtype(data_type) -%}
    select percentile_cont(0.5) within group (order by {{ adapter.quote(column_name) }}) from source_data
{%- else -%}
    cast(null as {{ dbt.type_numeric() }})
{%- endif -%}

{%- endmacro -%}
....

perhaps there is a better method than using source_data at the level of measures ? The challenge with the simplest method, is that I'm using a cte's name (source_data defined in get_profiles macro) at the level of measures , and this might not be ideal. I think an alternative would be to create a third table parameter for each median macro, and refactor the call of the measure_median macro in get_profile to have the correct cte (as of v0.8.1, source_data) , included as the third parameter. This might make it easier to change a table name in a single place and keep up with those changes, rather than digging around elsewhere. On the other hand, measure_median would no longer be consistent with the other measures if it were the only one to have a third parameter. I opted for the simple change in the above jinja block for the time being!

Let me know your thoughts, @stumelius , and I'd be happy to submit a PR

from dbt-profiler.

odikia avatar odikia commented on August 30, 2024

In further testing, I also needed to change the following repeating median call in get_profiles macro to include parentheses. This shouldn't have a negative impact on other adapter types:

...
{% if "median" not in exclude_measures -%}
    ({{ dbt_profiler.measure_median(column_name, data_type) }}) as median,
{%- endif %}

from dbt-profiler.

odikia avatar odikia commented on August 30, 2024

@stumelius , pr submitted! I submitted pr 88 which didn't pass circe, as I only passed 2 parameters into the non redshift macros. I fixed it in pr 89, and resubmitted. Passed circe (postgres check it seems?) this time around and is awaiting the approval at data-mie.

from dbt-profiler.

stumelius avatar stumelius commented on August 30, 2024

Thank you for your contribution! https://github.com/data-mie/dbt-profiler/releases/tag/0.8.2

from dbt-profiler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.