When using print_profile in Redshift with Version 0.8.1 I get the following error:

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you for your contribution! <a href="https://github.com/data-mie/dbt-profiler/rel

Error when profiling a table in Redshift about dbt-profiler HOT 9 CLOSED

dico-teberfre commented on August 30, 2024

Error when profiling a table in Redshift

from dbt-profiler.

Comments (9)

stumelius commented on August 30, 2024 1

@dico-teberfre Thanks for reporting this, and sorry for the late reply. The profiler is using the MEDIAN function in Redshift to calculate the median measure and there are currently 3 measures (distinct_proportion, distinct_count and is_unique) that use distinct. I see two options here:

We find another way to calculate the median in Redshift to work around the MEDIAN<->DISTINCT incompatibility issue
We refactor the existing distinct measures in some way, maybe calculating them separately and then joining with the others (see https://stackoverflow.com/questions/67994522/using-median-aggregate-in-same-query-as-avg)

I don't currently have access to a Redshift instance so I can't test this out. Would you be interested in contributing this fix? :)

from dbt-profiler.

odikia commented on August 30, 2024 1

Hi @stumelius ,

Let me dig into your suggestions a bit, and see if I have others. Will start now, in fact! Nothing more needed at the moment. Will fork, and start playing. Thank you!

from dbt-profiler.

stumelius commented on August 30, 2024 1

@odikia Wow, this is really thorough - amazing! 💯 I'm okay with the subquery method as it requires less changes to the code base. Also, having a third "table_name"/"cte_name"/"relation_name" argument only in the measure_median macro is okay for me. It's not consistent with the other measures but I don't see much harm in that. If we ever need to add similar subquery logic to any of the other measures to work around limitations, we'll use that as an example.

Please go ahead and open a PR :)

from dbt-profiler.

odikia commented on August 30, 2024

Hi @stumelius ,

I came upon this issue just now. Primary DB is Redshift. Happy to contribute towards the fix! :)

from dbt-profiler.

stumelius commented on August 30, 2024

@odikia Thank you for the offer ❤️ I'd like to take you up on that. I laid out two options to resolve this in the previous comments. What are your thoughts on those? Do you have other ideas? And is there anything you need from me to get started?

from dbt-profiler.

odikia commented on August 30, 2024

I worked through the suggestions, and found a subquery with median to be the most performant method. In testing locally on a dataset with 7m rows, I found the subquery to be 1.41x faster than the join method.

I think joining, as described in the stackexchange conversation mentioned above, would also require a lot of refactoring to get_profile and measures, so made as few changes as possible to get around the problem in redshift.

addition to measures.sql

...
{%- macro redshift__measure_median(column_name, data_type) -%}

{%- if dbt_profiler.is_numeric_dtype(data_type) and not dbt_profiler.is_struct_dtype(data_type) -%}
    select percentile_cont(0.5) within group (order by {{ adapter.quote(column_name) }}) from source_data
{%- else -%}
    cast(null as {{ dbt.type_numeric() }})
{%- endif -%}

{%- endmacro -%}
....

perhaps there is a better method than using source_data at the level of measures ? The challenge with the simplest method, is that I'm using a cte's name (source_data defined in get_profiles macro) at the level of measures , and this might not be ideal. I think an alternative would be to create a third table parameter for each median macro, and refactor the call of the measure_median macro in get_profile to have the correct cte (as of v0.8.1, source_data) , included as the third parameter. This might make it easier to change a table name in a single place and keep up with those changes, rather than digging around elsewhere. On the other hand, measure_median would no longer be consistent with the other measures if it were the only one to have a third parameter. I opted for the simple change in the above jinja block for the time being!

Let me know your thoughts, @stumelius , and I'd be happy to submit a PR

from dbt-profiler.

odikia commented on August 30, 2024

In further testing, I also needed to change the following repeating median call in get_profiles macro to include parentheses. This shouldn't have a negative impact on other adapter types:

...
{% if "median" not in exclude_measures -%}
    ({{ dbt_profiler.measure_median(column_name, data_type) }}) as median,
{%- endif %}

from dbt-profiler.

odikia commented on August 30, 2024

@stumelius , pr submitted! I submitted pr 88 which didn't pass circe, as I only passed 2 parameters into the non redshift macros. I fixed it in pr 89, and resubmitted. Passed circe (postgres check it seems?) this time around and is awaiting the approval at data-mie.

from dbt-profiler.

stumelius commented on August 30, 2024

Thank you for your contribution! https://github.com/data-mie/dbt-profiler/releases/tag/0.8.2

from dbt-profiler.

Error when profiling a table in Redshift about dbt-profiler HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent