data-mie / dbt-profiler Goto Github PK

View Code? Open in Web Editor NEW

74.0 74.0 30.0 138 KB

Macros for generating dbt model data profiles

License: Apache License 2.0

Shell 100.00%

dbt-profiler's People

Contributors

Stargazers

Watchers

dbt-profiler's Issues

docblock is not outputted when executing `print_profile_docs` with BigQuery

Hi here,

I'm trying to include profile result with using print_profile_docs method but it seems I get different output.

When I refer to postgres I got following result.

$ POSTGRES_HOST=localhost dbt run-operation print_profile_docs --args '{"relation_name": "test_data_default"}' -t postgres
07:26:38  Running with dbt=1.5.1
07:26:38  Found 8 models, 25 tests, 0 snapshots, 0 analyses, 806 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
{% docs dbt_profiler__test_data_default  %}
| column_name          | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min        | max        |                avg | median |     std_dev_population |         std_dev_sample | ... |
| -------------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------ | ------ | ---------------------- | ---------------------- | --- |
| id                   | integer   |         5 |                 1.0 |                 1.0 |              5 |      True | 1          | 5          | 3.0000000000000000 |      3 | 1.41421356237309500000 | 1.58113883008418970000 | ... |
| numeric_not_nullable | integer   |         5 |                 1.0 |                 0.6 |              3 |     False | 1          | 3          | 1.8000000000000000 |      2 | 0.74833147735478827712 | 0.83666002653407554798 | ... |
| numeric_nullable     | integer   |         5 |                 0.6 |                 0.4 |              2 |     False | 1          | 3          | 1.6666666666666667 |      1 | 0.94280904158206336587 | 1.15470053837925150000 | ... |
| string_not_nullable  | text      |         5 |                 1.0 |                 0.6 |              3 |     False |            |            |                    |        |                        |                        | ... |
| string_nullable      | text      |         5 |                 0.6 |                 0.4 |              2 |     False |            |            |                    |        |                        |                        | ... |
| date_nullable        | date      |         5 |                 0.6 |                 0.2 |              1 |     False | 2022-01-01 | 2022-01-01 |                    |        |                        |                        | ... |
| bool_nullable        | boolean   |         5 |                 0.8 |                 0.4 |              2 |     False |            |            | 0.6000000000000000 |        |                        |                        | ... |
{% enddocs %}

But when I refer to BigQuery I got following result.

$ dbt run-operation print_profile_docs --args '{"relation_name": "customers"}' -t dev
07:29:00  Running with dbt=1.5.0
07:29:01  Found 3 models, 9 tests, 0 snapshots, 0 analyses, 407 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
| column_name            | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min        | max        |                 avg |  std_dev_population |      std_dev_sample | profiled_at                   |
| ---------------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------- | ------------------- | ------------------- | ----------------------------- |
| customer_id            | int64     |       100 |                1.00 |                1.00 |            100 |      True | 1          | 100        | 50.5000000000000100 | 28.8660700477221250 | 29.0114919758820240 | 2023-06-21 07:29:06.504541+00 |
| first_name             | string    |       100 |                1.00 |                0.79 |             79 |     False |            |            |                     |                     |                     | 2023-06-21 07:29:06.504541+00 |
| last_name              | string    |       100 |                1.00 |                0.19 |             19 |     False |            |            |                     |                     |                     | 2023-06-21 07:29:06.504541+00 |
| first_order_date       | date      |       100 |                0.62 |                0.46 |             46 |     False | 2018-01-01 | 2018-04-07 |                     |                     |                     | 2023-06-21 07:29:06.504541+00 |
| most_recent_order_date | date      |       100 |                0.62 |                0.52 |             52 |     False | 2018-01-09 | 2018-04-09 |                     |                     |                     | 2023-06-21 07:29:06.504541+00 |
| number_of_orders       | int64     |       100 |                1.00 |                0.05 |              5 |     False | 0          | 5          |  0.9900000000000002 |  0.9848350115628506 |  0.9897964284119083 | 2023-06-21 07:29:06.504541+00 |

The biggest difference is that docblock aren't output. Is there anything I have to do? Thanks.

Is this stable?

Can not run dbt-profiler interactively in dbt-cloud

expecting to be able to run is similar to codegen:
{{ codegen.generate_model_yaml('dim_customers') }}
which when hitting </> Compile, yields an output
but when using
{{ dbt_profiler.print_profile('dim_customers') }}
I get

Relation “dim_customers” does not exist or not authorized.)

Slack 🧵 here

Use "macro_namespace" instead of the deprecated "packages" in adapter.dispatch()

Running with dbt=0.21.0
* Deprecation Warning: The "packages" argument of adapter.dispatch() has been
deprecated. Use the "macro_namespace" argument instead.
Raised during dispatch for: type_string
For more information, see:
https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch

Feature request: Handling structs

Currently the macro get_profile cannot handle struct columns. I tried using struct itself, or using struct_col.*. It would either show this below error message or missing the column_profiles cte in compiled code.

Using print_profile on dbt sources

I can't seem to get the syntax correct for using the print_profile macro on command line with a dbt source.

I've tried the following:

dbt run-operation get_profile -t dev --args '{"relation":"source(my_source)"}'

And several other variations. Any ideas how this would work?

I have modified some of the macros to work with MySql

Hi,

My org uses MySql as the primary database engine. As a result I've had to extract some of the default macros from dbt-profiler and modify them to work with MySql.

I'd like to contribute the changes back to dbt-profiler if it would help others.

Catch-22 with docblock in schema file

I have added the suggested docblock to my schema file.

eg:

    tables:
      - name: admin_activity
        description: |
          Admin Activity logging [UNUSED]
          
          `dbt-profiler` results:

          {{ doc("dbt_profiler_results__test_admin_activity") }}

However, when I try to generate the doc.md files I get the following error:

Documentation for 'source.dbt_test_schema.test.admin_activity' depends on doc 'dbt_profiler_results__test_admin_activity' which was not found

I'm using the update-relation-profile.sh script which works nicely.

It looks like there is no way the generate the markdown files if the docblock is in the YAML schema file.

Compilation Error in macro default__information_schema (macros\cross_db_utils.sql)

Versions
dbt: 0.19.1
snowflake: 0.19.1
dbt_profiler: 0.1.3

Steps To Reproduce

dbt run-operation print_profile_docs --args '{"relation_name": "shipment__fct_shipment_stop"}'

Or this

dbt run-operation print_profile_docs --args '{"relation_name": "shipment__fct_shipment_stop", "schema": "shipment"}'

I also tried several other combinations of the above arguments and with each profile macro.

Results

Encountered an error while running operation: Compilation Error in macro default__information_schema (macros\cross_db_utils.sql)
'None' has no attribute 'information_schema'

in macro information_schema (macros\cross_db_utils.sql)
called by macro get_profile (macros\get_profile.sql)
called by macro print_profile_docs (macros\print_profile_docs.sql)
called by macro default__information_schema (macros\cross_db_utils.sql)

Add row_count

Old version of example output, without median column

Hi there,

In working on another issue for the dbt-profiler, I compared my results to example output in the readme, and found that the output is a little dated, as it aligns aligns with release 0.6.4, rather than >= 0.7.0.

dbt-profiler/README.md

Line 50 in c1c7850

    
           | column_name             | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min        | max        |                 avg |  std_dev_population |      std_dev_sample | profiled_at                   |

Redshift `avg` returning `1` or `0`

Hey there,

I seem to be having issues getting correct avg values returned in Redshift.

When we try to get the avg of a binary metric, Redshift returns either 1 or 0. Ideally, these metrics should be stored as a boolean value; however, I think we'd still have an issue since the measure_avg casts to 1 or 0 for boolean values, instead of 1.0 or 0.0.

Related stackoverflow thread: Redshift Avg not returning decimals

Is there some way that we can something like this so it works in Redshift?:


{%- if dbt_profiler.is_numeric_dtype(data_type) and not dbt_profiler.is_struct_dtype(data_type) -%}
    avg({{ adapter.quote(column_name) }} :: float)
{%- elif dbt_profiler.is_logical_dtype(data_type) -%}
    avg(case when {{ adapter.quote(column_name) }} then 1.0 else 0.0 end)
{%- else -%}
    cast(null as {{ dbt.type_numeric() }})
{%- endif -%}

Maybe create a different version of this macro to use only with the dbt-redshift adapter? I'm not sure how to do that, but would help in contributing with some guidance!

Error when profiling integer after date after string columns

Profiling a table whose column are integer, date, string in this order raises the following error :
ERROR: UNION types text and numeric cannot be matched
LINE 60: avg("int_after_date_after_string") as avg,

Appropriately casting the null default value solves it, I'm in the process of opening a PR.

Question: Are you planning to extend dbt-profiler for dbt-sqlserver adapter?

Like the above: Are you planning to extend dbt-profiler for dbt-sqlserver adapter?

Suggestion to modify `sed` to `awk` in update-relation-profile.sh

dbt-profiler/update-relation-profile.sh

Line 14 in 5c7002e

PROFILE=$(echo "$RUN_OPERATION_OUTPUT" | sed -n '/{% docs/,$p')

Current:
PROFILE=$(echo "$RUN_OPERATION_OUTPUT" | sed -n '/{% docs/,$p')

Suggested replacement:

PROFILE=$(echo "$RUN_OPERATION_OUTPUT" | awk '/{%.docs.*%}$/{flag=1} flag; /{%.enddocs.%}/{flag=0}')

Error when profiling a table in Redshift

When using print_profile in Redshift with Version 0.8.1 I get the following error:

Using LISTAGG/PERCENTILE_CONT/MEDIAN aggregate functions with other distinct aggregate function not supported

When stepping down the versions I get a successful run with version 0.6.4.

Best,
Frederik

Add `print_table` macro that works in dbt Cloud

Move the printing logic from the print_profile_docs macro

CTE Query Optimization

Hello,
I use your package for profiling and I find it great ! Thank's for your contribution.
I use Snwoflake for the compute part and I noticed (tested) a potential optimization
What is currently being done
A "column_profiles" CTE that performs a union all for each column and reads from the source table (DB.SCHEMA.TABLE) .

When analyzing the snowflake profile query, it performs as many table scans as there are columns. for large volumes, the small warehouse is not enough and returns a memory error.

We can optimize a little by adding a first cte to read the source table in full and then reference this cte for each column, that will force snowflake to read the table only once, mount it in memory and then use it.

I have metrics, we divide by 3/4 the amount of I/O , There is also the number of partitions, In my case

Before

After

PS : The notion of cache in the original runtime doesn't matter since I disabled the session cache
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
I also restarted the warehouses between each execution (using several warehouses on different time intervals). I can't explain myself but I think it's a common cache due to union all.
On the execution times, there is not a big difference but I have tests on large tables where the first query does not succeed because of the number of columns (1000) and the size of the table
I remain at your disposal in case of more information

Snippet code

Include/exclude metrics

Configuration option for including/excluding metrics (e.g., exclude_metrics=["stg_dev_population"])

`dbt_profiler.get_profile` doesn't seem to work with views in Redshift

It appears to be an issue in macro redshift__select_from_information_schema_columns in getting columns/data types from views.

Update to use pg_get_cols, e.g.:


select * from pg_get_cols('stg_final_view') 
cols(view_schema name, view_name name, col_name name, col_type varchar, col_num int);

Allow user to specify a where clause for the source table

Currently the profiler selects all data from the source table for profiling. While this is OK for most use cases, there are some cases in which a user may want to limit the amount of data being profiled. This could be done via a where clause that would allow the user to segregate the profiling as they choose.

Question: add TOP x column values and distribution.

Hi,

I was looking at this project and I must say: it's awesome and something that dbt docs currently is missing.

One thing got in my mind is the question why there's not an option to add the TOP x column values and their distribution? Is there any other reason to not include this in the docs?

Like in this example where you show TOP 2 for example:

Column Name	Top 1 Value	Distribution	Top 2 Value	Distribution
Column 1	Value 1 A	0.50 ("number"/"total")	Value 1 B	0.20 ("number"/"total")
Column 2	Value 2 A	0.50 ("number"/"total")	Value 2 B	0.30 ("number"/"total")
Column 3	Value 3 A	0.10 ("number"/"total")	Value 3 B	0.05 ("number"/"total")
Column 4	Value 4 A	0.10 ("number"/"total"	Value 4 B	0.05 ("number"/"total")

Looking forward to your thoughts!

Amend integration tests

Add at least the following tests:

test_print_profile
test_print_profile_schema

Run profiler for all models

Is it possible to just include the profiler to all models and keep the output in another table with the results to have historical data?

"Value overflow in a SUM aggregate" with HASH fields

As title indicates, the use of a hash generates a value overflow. Switching to MD5 runs fine.

Question: running dbt run-operation print_profile_docs prints log info

running dbt run-operation print_profile_docs prints the normal dbt logging information e.g. timestamps when it's run and dbt version.
Is there a way to remove this or how do I print this to file otherwise?

Example of output for me

12:58:59  Running with dbt=1.1.1
12:59:06  {% docs dbt_profiler__time_to_merge  %}
| column_name   | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min                    | max                    |                avg | std_dev_population |    std_dev_sample | profiled_at                   |
| ------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------------------- | ---------------------- | ------------------ | ------------------ | ----------------- | ----------------------------- |
| merged_at     | timestamp |       351 |                   1 |         1,000000000 |            351 |      True | 2022-07-12 07:48:30+00 | 2022-09-13 17:09:22+00 |                    |                    |                   | 2022-09-16 12:59:05.495281+00 |
| time_to_merge | int64     |       351 |                   1 |         0,894586895 |            314 |     False | 4                      | 5523425                | 218 528,3219373218 | 595 672,7009238278 | 596 523,054962641 | 2022-09-16 12:59:05.495281+00 |
12:59:06  {% enddocs %}

I expected that I could use this opperation like this dbt run-operation print_profile_docs <args> > some_file.md however then I'll get the timestamp and other output that will break the docs reference. If I want to run this in a ci workflow then copy pasting is not an option!

Thanks for your help

data-mie / dbt-profiler Goto Github PK

dbt-profiler's People

Contributors

Stargazers

Watchers

Forkers

dbt-profiler's Issues

Recommend Projects

Recommend Topics

Recommend Org