data-mie / dbt-profiler Goto Github PK
View Code? Open in Web Editor NEWMacros for generating dbt model data profiles
License: Apache License 2.0
Macros for generating dbt model data profiles
License: Apache License 2.0
Hi here,
I'm trying to include profile result with using print_profile_docs
method but it seems I get different output.
When I refer to postgres I got following result.
$ POSTGRES_HOST=localhost dbt run-operation print_profile_docs --args '{"relation_name": "test_data_default"}' -t postgres
07:26:38 Running with dbt=1.5.1
07:26:38 Found 8 models, 25 tests, 0 snapshots, 0 analyses, 806 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
{% docs dbt_profiler__test_data_default %}
| column_name | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | median | std_dev_population | std_dev_sample | ... |
| -------------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------ | ------ | ---------------------- | ---------------------- | --- |
| id | integer | 5 | 1.0 | 1.0 | 5 | True | 1 | 5 | 3.0000000000000000 | 3 | 1.41421356237309500000 | 1.58113883008418970000 | ... |
| numeric_not_nullable | integer | 5 | 1.0 | 0.6 | 3 | False | 1 | 3 | 1.8000000000000000 | 2 | 0.74833147735478827712 | 0.83666002653407554798 | ... |
| numeric_nullable | integer | 5 | 0.6 | 0.4 | 2 | False | 1 | 3 | 1.6666666666666667 | 1 | 0.94280904158206336587 | 1.15470053837925150000 | ... |
| string_not_nullable | text | 5 | 1.0 | 0.6 | 3 | False | | | | | | | ... |
| string_nullable | text | 5 | 0.6 | 0.4 | 2 | False | | | | | | | ... |
| date_nullable | date | 5 | 0.6 | 0.2 | 1 | False | 2022-01-01 | 2022-01-01 | | | | | ... |
| bool_nullable | boolean | 5 | 0.8 | 0.4 | 2 | False | | | 0.6000000000000000 | | | | ... |
{% enddocs %}
But when I refer to BigQuery I got following result.
$ dbt run-operation print_profile_docs --args '{"relation_name": "customers"}' -t dev
07:29:00 Running with dbt=1.5.0
07:29:01 Found 3 models, 9 tests, 0 snapshots, 0 analyses, 407 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
| column_name | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | std_dev_population | std_dev_sample | profiled_at |
| ---------------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------- | ------------------- | ------------------- | ----------------------------- |
| customer_id | int64 | 100 | 1.00 | 1.00 | 100 | True | 1 | 100 | 50.5000000000000100 | 28.8660700477221250 | 29.0114919758820240 | 2023-06-21 07:29:06.504541+00 |
| first_name | string | 100 | 1.00 | 0.79 | 79 | False | | | | | | 2023-06-21 07:29:06.504541+00 |
| last_name | string | 100 | 1.00 | 0.19 | 19 | False | | | | | | 2023-06-21 07:29:06.504541+00 |
| first_order_date | date | 100 | 0.62 | 0.46 | 46 | False | 2018-01-01 | 2018-04-07 | | | | 2023-06-21 07:29:06.504541+00 |
| most_recent_order_date | date | 100 | 0.62 | 0.52 | 52 | False | 2018-01-09 | 2018-04-09 | | | | 2023-06-21 07:29:06.504541+00 |
| number_of_orders | int64 | 100 | 1.00 | 0.05 | 5 | False | 0 | 5 | 0.9900000000000002 | 0.9848350115628506 | 0.9897964284119083 | 2023-06-21 07:29:06.504541+00 |
The biggest difference is that docblock aren't output. Is there anything I have to do? Thanks.
expecting to be able to run is similar to codegen:
{{ codegen.generate_model_yaml('dim_customers') }}
which when hitting </> Compile, yields an output
but when using
{{ dbt_profiler.print_profile('dim_customers') }}
I get
Relation “dim_customers” does not exist or not authorized.)
Slack 🧵 here
Running with dbt=0.21.0
* Deprecation Warning: The "packages" argument of adapter.dispatch() has been
deprecated. Use the "macro_namespace" argument instead.
Raised during dispatch for: type_string
For more information, see:
https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch
I can't seem to get the syntax correct for using the print_profile
macro on command line with a dbt source.
I've tried the following:
dbt run-operation get_profile -t dev --args '{"relation":"source(my_source)"}'
And several other variations. Any ideas how this would work?
Hi,
My org uses MySql as the primary database engine. As a result I've had to extract some of the default
macros from dbt-profiler and modify them to work with MySql.
I'd like to contribute the changes back to dbt-profiler if it would help others.
I have added the suggested docblock to my schema file.
eg:
tables:
- name: admin_activity
description: |
Admin Activity logging [UNUSED]
`dbt-profiler` results:
{{ doc("dbt_profiler_results__test_admin_activity") }}
However, when I try to generate the doc.md files I get the following error:
Documentation for 'source.dbt_test_schema.test.admin_activity' depends on doc 'dbt_profiler_results__test_admin_activity' which was not found
I'm using the update-relation-profile.sh
script which works nicely.
It looks like there is no way the generate the markdown files if the docblock is in the YAML schema file.
Versions
dbt: 0.19.1
snowflake: 0.19.1
dbt_profiler: 0.1.3
Steps To Reproduce
dbt run-operation print_profile_docs --args '{"relation_name": "shipment__fct_shipment_stop"}'
Or this
dbt run-operation print_profile_docs --args '{"relation_name": "shipment__fct_shipment_stop", "schema": "shipment"}'
I also tried several other combinations of the above arguments and with each profile macro.
Results
Encountered an error while running operation: Compilation Error in macro default__information_schema (macros\cross_db_utils.sql)
'None' has no attribute 'information_schema'
in macro information_schema (macros\cross_db_utils.sql)
called by macro get_profile (macros\get_profile.sql)
called by macro print_profile_docs (macros\print_profile_docs.sql)
called by macro default__information_schema (macros\cross_db_utils.sql)
Hi there,
In working on another issue for the dbt-profiler, I compared my results to example output in the readme, and found that the output is a little dated, as it aligns aligns with release 0.6.4, rather than >= 0.7.0.
Line 50 in c1c7850
Hey there,
I seem to be having issues getting correct avg
values returned in Redshift.
When we try to get the avg
of a binary metric, Redshift returns either 1
or 0
. Ideally, these metrics should be stored as a boolean
value; however, I think we'd still have an issue since the measure_avg casts to 1
or 0
for boolean values, instead of 1.0
or 0.0
.
Related stackoverflow thread: Redshift Avg not returning decimals
Is there some way that we can something like this so it works in Redshift?:
{%- if dbt_profiler.is_numeric_dtype(data_type) and not dbt_profiler.is_struct_dtype(data_type) -%}
avg({{ adapter.quote(column_name) }} :: float)
{%- elif dbt_profiler.is_logical_dtype(data_type) -%}
avg(case when {{ adapter.quote(column_name) }} then 1.0 else 0.0 end)
{%- else -%}
cast(null as {{ dbt.type_numeric() }})
{%- endif -%}
Maybe create a different version of this macro to use only with the dbt-redshift
adapter? I'm not sure how to do that, but would help in contributing with some guidance!
Profiling a table whose column are integer, date, string in this order raises the following error :
ERROR: UNION types text and numeric cannot be matched
LINE 60: avg("int_after_date_after_string") as avg,
Appropriately casting the null default value solves it, I'm in the process of opening a PR.
Like the above: Are you planning to extend dbt-profiler for dbt-sqlserver adapter?
dbt-profiler/update-relation-profile.sh
Line 14 in 5c7002e
Current:
PROFILE=$(echo "$RUN_OPERATION_OUTPUT" | sed -n '/{% docs/,$p')
Suggested replacement:
PROFILE=$(echo "$RUN_OPERATION_OUTPUT" | awk '/{%.docs.*%}$/{flag=1} flag; /{%.enddocs.%}/{flag=0}')
When using print_profile in Redshift with Version 0.8.1 I get the following error:
Using LISTAGG/PERCENTILE_CONT/MEDIAN aggregate functions with other distinct aggregate function not supported
When stepping down the versions I get a successful run with version 0.6.4.
Best,
Frederik
Move the printing logic from the print_profile_docs
macro
Hello,
I use your package for profiling and I find it great ! Thank's for your contribution.
I use Snwoflake for the compute part and I noticed (tested) a potential optimization
What is currently being done
A "column_profiles" CTE that performs a union all for each column and reads from the source table (DB.SCHEMA.TABLE) .
When analyzing the snowflake profile query, it performs as many table scans as there are columns. for large volumes, the small warehouse is not enough and returns a memory error.
We can optimize a little by adding a first cte to read the source table in full and then reference this cte for each column, that will force snowflake to read the table only once, mount it in memory and then use it.
I have metrics, we divide by 3/4 the amount of I/O , There is also the number of partitions, In my case
PS : The notion of cache in the original runtime doesn't matter since I disabled the session cache
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
I also restarted the warehouses between each execution (using several warehouses on different time intervals). I can't explain myself but I think it's a common cache due to union all.
On the execution times, there is not a big difference but I have tests on large tables where the first query does not succeed because of the number of columns (1000) and the size of the table
I remain at your disposal in case of more information
Configuration option for including/excluding metrics (e.g., exclude_metrics=["stg_dev_population"]
)
It appears to be an issue in macro redshift__select_from_information_schema_columns
in getting columns/data types from views.
Update to use pg_get_cols
, e.g.:
select * from pg_get_cols('stg_final_view')
cols(view_schema name, view_name name, col_name name, col_type varchar, col_num int);
Currently the profiler selects all data from the source table for profiling. While this is OK for most use cases, there are some cases in which a user may want to limit the amount of data being profiled. This could be done via a where clause that would allow the user to segregate the profiling as they choose.
Hi,
I was looking at this project and I must say: it's awesome and something that dbt docs currently is missing.
One thing got in my mind is the question why there's not an option to add the TOP x column values and their distribution? Is there any other reason to not include this in the docs?
Like in this example where you show TOP 2 for example:
Column Name | Top 1 Value | Distribution | Top 2 Value | Distribution |
---|---|---|---|---|
Column 1 | Value 1 A | 0.50 ("number"/"total") | Value 1 B | 0.20 ("number"/"total") |
Column 2 | Value 2 A | 0.50 ("number"/"total") | Value 2 B | 0.30 ("number"/"total") |
Column 3 | Value 3 A | 0.10 ("number"/"total") | Value 3 B | 0.05 ("number"/"total") |
Column 4 | Value 4 A | 0.10 ("number"/"total" | Value 4 B | 0.05 ("number"/"total") |
Looking forward to your thoughts!
Add at least the following tests:
Is it possible to just include the profiler to all models and keep the output in another table with the results to have historical data?
As title indicates, the use of a hash generates a value overflow. Switching to MD5 runs fine.
running dbt run-operation print_profile_docs
prints the normal dbt logging information e.g. timestamps when it's run and dbt version.
Is there a way to remove this or how do I print this to file otherwise?
Example of output for me
12:58:59 Running with dbt=1.1.1
12:59:06 {% docs dbt_profiler__time_to_merge %}
| column_name | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | std_dev_population | std_dev_sample | profiled_at |
| ------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------------------- | ---------------------- | ------------------ | ------------------ | ----------------- | ----------------------------- |
| merged_at | timestamp | 351 | 1 | 1,000000000 | 351 | True | 2022-07-12 07:48:30+00 | 2022-09-13 17:09:22+00 | | | | 2022-09-16 12:59:05.495281+00 |
| time_to_merge | int64 | 351 | 1 | 0,894586895 | 314 | False | 4 | 5523425 | 218 528,3219373218 | 595 672,7009238278 | 596 523,054962641 | 2022-09-16 12:59:05.495281+00 |
12:59:06 {% enddocs %}
I expected that I could use this opperation like this dbt run-operation print_profile_docs <args> > some_file.md
however then I'll get the timestamp and other output that will break the docs reference. If I want to run this in a ci workflow then copy pasting is not an option!
Thanks for your help
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.