Code Monkey home page Code Monkey logo

dbt-profiler's People

Contributors

datamie-simo avatar dbeatty10 avatar dweaver33 avatar ernestoongaro avatar jomach avatar mdutoo avatar ngocanhtt0805 avatar odikia avatar owenprough-sift avatar stevenconnorg avatar stumelius avatar toohsk avatar trymzet avatar xtutran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dbt-profiler's Issues

docblock is not outputted when executing `print_profile_docs` with BigQuery

Hi here,

I'm trying to include profile result with using print_profile_docs method but it seems I get different output.

When I refer to postgres I got following result.

$ POSTGRES_HOST=localhost dbt run-operation print_profile_docs --args '{"relation_name": "test_data_default"}' -t postgres
07:26:38  Running with dbt=1.5.1
07:26:38  Found 8 models, 25 tests, 0 snapshots, 0 analyses, 806 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
{% docs dbt_profiler__test_data_default  %}
| column_name          | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min        | max        |                avg | median |     std_dev_population |         std_dev_sample | ... |
| -------------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------ | ------ | ---------------------- | ---------------------- | --- |
| id                   | integer   |         5 |                 1.0 |                 1.0 |              5 |      True | 1          | 5          | 3.0000000000000000 |      3 | 1.41421356237309500000 | 1.58113883008418970000 | ... |
| numeric_not_nullable | integer   |         5 |                 1.0 |                 0.6 |              3 |     False | 1          | 3          | 1.8000000000000000 |      2 | 0.74833147735478827712 | 0.83666002653407554798 | ... |
| numeric_nullable     | integer   |         5 |                 0.6 |                 0.4 |              2 |     False | 1          | 3          | 1.6666666666666667 |      1 | 0.94280904158206336587 | 1.15470053837925150000 | ... |
| string_not_nullable  | text      |         5 |                 1.0 |                 0.6 |              3 |     False |            |            |                    |        |                        |                        | ... |
| string_nullable      | text      |         5 |                 0.6 |                 0.4 |              2 |     False |            |            |                    |        |                        |                        | ... |
| date_nullable        | date      |         5 |                 0.6 |                 0.2 |              1 |     False | 2022-01-01 | 2022-01-01 |                    |        |                        |                        | ... |
| bool_nullable        | boolean   |         5 |                 0.8 |                 0.4 |              2 |     False |            |            | 0.6000000000000000 |        |                        |                        | ... |
{% enddocs %}

But when I refer to BigQuery I got following result.

$ dbt run-operation print_profile_docs --args '{"relation_name": "customers"}' -t dev
07:29:00  Running with dbt=1.5.0
07:29:01  Found 3 models, 9 tests, 0 snapshots, 0 analyses, 407 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
| column_name            | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min        | max        |                 avg |  std_dev_population |      std_dev_sample | profiled_at                   |
| ---------------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------- | ------------------- | ------------------- | ----------------------------- |
| customer_id            | int64     |       100 |                1.00 |                1.00 |            100 |      True | 1          | 100        | 50.5000000000000100 | 28.8660700477221250 | 29.0114919758820240 | 2023-06-21 07:29:06.504541+00 |
| first_name             | string    |       100 |                1.00 |                0.79 |             79 |     False |            |            |                     |                     |                     | 2023-06-21 07:29:06.504541+00 |
| last_name              | string    |       100 |                1.00 |                0.19 |             19 |     False |            |            |                     |                     |                     | 2023-06-21 07:29:06.504541+00 |
| first_order_date       | date      |       100 |                0.62 |                0.46 |             46 |     False | 2018-01-01 | 2018-04-07 |                     |                     |                     | 2023-06-21 07:29:06.504541+00 |
| most_recent_order_date | date      |       100 |                0.62 |                0.52 |             52 |     False | 2018-01-09 | 2018-04-09 |                     |                     |                     | 2023-06-21 07:29:06.504541+00 |
| number_of_orders       | int64     |       100 |                1.00 |                0.05 |              5 |     False | 0          | 5          |  0.9900000000000002 |  0.9848350115628506 |  0.9897964284119083 | 2023-06-21 07:29:06.504541+00 |

The biggest difference is that docblock aren't output. Is there anything I have to do? Thanks.

Can not run dbt-profiler interactively in dbt-cloud

expecting to be able to run is similar to codegen:
{{ codegen.generate_model_yaml('dim_customers') }}
which when hitting </> Compile, yields an output
but when using
{{ dbt_profiler.print_profile('dim_customers') }}
I get

Relation “dim_customers” does not exist or not authorized.)

Slack 🧵 here

Feature request: Handling structs

Currently the macro get_profile cannot handle struct columns. I tried using struct itself, or using struct_col.*. It would either show this below error message or missing the column_profiles cte in compiled code.

image

Using print_profile on dbt sources

I can't seem to get the syntax correct for using the print_profile macro on command line with a dbt source.

I've tried the following:

dbt run-operation get_profile -t dev --args '{"relation":"source(my_source)"}' 

And several other variations. Any ideas how this would work?

I have modified some of the macros to work with MySql

Hi,

My org uses MySql as the primary database engine. As a result I've had to extract some of the default macros from dbt-profiler and modify them to work with MySql.

I'd like to contribute the changes back to dbt-profiler if it would help others.

Catch-22 with docblock in schema file

I have added the suggested docblock to my schema file.

eg:

    tables:
      - name: admin_activity
        description: |
          Admin Activity logging [UNUSED]
          
          `dbt-profiler` results:

          {{ doc("dbt_profiler_results__test_admin_activity") }}

However, when I try to generate the doc.md files I get the following error:

Documentation for 'source.dbt_test_schema.test.admin_activity' depends on doc 'dbt_profiler_results__test_admin_activity' which was not found

I'm using the update-relation-profile.sh script which works nicely.

It looks like there is no way the generate the markdown files if the docblock is in the YAML schema file.

Compilation Error in macro default__information_schema (macros\cross_db_utils.sql)

Versions
dbt: 0.19.1
snowflake: 0.19.1
dbt_profiler: 0.1.3

Steps To Reproduce

dbt run-operation print_profile_docs --args '{"relation_name": "shipment__fct_shipment_stop"}'

Or this

dbt run-operation print_profile_docs --args '{"relation_name": "shipment__fct_shipment_stop", "schema": "shipment"}'

I also tried several other combinations of the above arguments and with each profile macro.

Results

Encountered an error while running operation: Compilation Error in macro default__information_schema (macros\cross_db_utils.sql)
'None' has no attribute 'information_schema'

in macro information_schema (macros\cross_db_utils.sql)
called by macro get_profile (macros\get_profile.sql)
called by macro print_profile_docs (macros\print_profile_docs.sql)
called by macro default__information_schema (macros\cross_db_utils.sql)

Old version of example output, without median column

Hi there,

In working on another issue for the dbt-profiler, I compared my results to example output in the readme, and found that the output is a little dated, as it aligns aligns with release 0.6.4, rather than >= 0.7.0.

| column_name | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | std_dev_population | std_dev_sample | profiled_at |

Redshift `avg` returning `1` or `0`

Hey there,

I seem to be having issues getting correct avg values returned in Redshift.

When we try to get the avg of a binary metric, Redshift returns either 1 or 0. Ideally, these metrics should be stored as a boolean value; however, I think we'd still have an issue since the measure_avg casts to 1 or 0 for boolean values, instead of 1.0 or 0.0.

Related stackoverflow thread: Redshift Avg not returning decimals

Is there some way that we can something like this so it works in Redshift?:


{%- if dbt_profiler.is_numeric_dtype(data_type) and not dbt_profiler.is_struct_dtype(data_type) -%}
    avg({{ adapter.quote(column_name) }} :: float)
{%- elif dbt_profiler.is_logical_dtype(data_type) -%}
    avg(case when {{ adapter.quote(column_name) }} then 1.0 else 0.0 end)
{%- else -%}
    cast(null as {{ dbt.type_numeric() }})
{%- endif -%}

Maybe create a different version of this macro to use only with the dbt-redshift adapter? I'm not sure how to do that, but would help in contributing with some guidance!

Error when profiling integer after date after string columns

Profiling a table whose column are integer, date, string in this order raises the following error :
ERROR: UNION types text and numeric cannot be matched
LINE 60: avg("int_after_date_after_string") as avg,

Appropriately casting the null default value solves it, I'm in the process of opening a PR.

Error when profiling a table in Redshift

When using print_profile in Redshift with Version 0.8.1 I get the following error:

Using LISTAGG/PERCENTILE_CONT/MEDIAN aggregate functions with other distinct aggregate function not supported

When stepping down the versions I get a successful run with version 0.6.4.

Best,
Frederik

CTE Query Optimization

Hello,
I use your package for profiling and I find it great ! Thank's for your contribution.
I use Snwoflake for the compute part and I noticed (tested) a potential optimization
What is currently being done
A "column_profiles" CTE that performs a union all for each column and reads from the source table (DB.SCHEMA.TABLE) .

When analyzing the snowflake profile query, it performs as many table scans as there are columns. for large volumes, the small warehouse is not enough and returns a memory error.

We can optimize a little by adding a first cte to read the source table in full and then reference this cte for each column, that will force snowflake to read the table only once, mount it in memory and then use it.

I have metrics, we divide by 3/4 the amount of I/O , There is also the number of partitions, In my case

Before
2022-05-17_13h21_54

After
2022-05-17_13h11_07

PS : The notion of cache in the original runtime doesn't matter since I disabled the session cache
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
I also restarted the warehouses between each execution (using several warehouses on different time intervals). I can't explain myself but I think it's a common cache due to union all.
On the execution times, there is not a big difference but I have tests on large tables where the first query does not succeed because of the number of columns (1000) and the size of the table
I remain at your disposal in case of more information

Snippet code
2022-05-17_13h35_23

Include/exclude metrics

Configuration option for including/excluding metrics (e.g., exclude_metrics=["stg_dev_population"])

Allow user to specify a where clause for the source table

Currently the profiler selects all data from the source table for profiling. While this is OK for most use cases, there are some cases in which a user may want to limit the amount of data being profiled. This could be done via a where clause that would allow the user to segregate the profiling as they choose.

Question: add TOP x column values and distribution.

Hi,

I was looking at this project and I must say: it's awesome and something that dbt docs currently is missing.

One thing got in my mind is the question why there's not an option to add the TOP x column values and their distribution? Is there any other reason to not include this in the docs?

Like in this example where you show TOP 2 for example:

Column Name Top 1 Value Distribution Top 2 Value Distribution
Column 1 Value 1 A 0.50 ("number"/"total") Value 1 B 0.20 ("number"/"total")
Column 2 Value 2 A 0.50 ("number"/"total") Value 2 B 0.30 ("number"/"total")
Column 3 Value 3 A 0.10 ("number"/"total") Value 3 B 0.05 ("number"/"total")
Column 4 Value 4 A 0.10 ("number"/"total" Value 4 B 0.05 ("number"/"total")

Looking forward to your thoughts!

Run profiler for all models

Is it possible to just include the profiler to all models and keep the output in another table with the results to have historical data?

Question: running dbt run-operation print_profile_docs prints log info

running dbt run-operation print_profile_docs prints the normal dbt logging information e.g. timestamps when it's run and dbt version.
Is there a way to remove this or how do I print this to file otherwise?

Example of output for me

12:58:59  Running with dbt=1.1.1
12:59:06  {% docs dbt_profiler__time_to_merge  %}
| column_name   | data_type | row_count | not_null_proportion | distinct_proportion | distinct_count | is_unique | min                    | max                    |                avg | std_dev_population |    std_dev_sample | profiled_at                   |
| ------------- | --------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------------------- | ---------------------- | ------------------ | ------------------ | ----------------- | ----------------------------- |
| merged_at     | timestamp |       351 |                   1 |         1,000000000 |            351 |      True | 2022-07-12 07:48:30+00 | 2022-09-13 17:09:22+00 |                    |                    |                   | 2022-09-16 12:59:05.495281+00 |
| time_to_merge | int64     |       351 |                   1 |         0,894586895 |            314 |     False | 4                      | 5523425                | 218 528,3219373218 | 595 672,7009238278 | 596 523,054962641 | 2022-09-16 12:59:05.495281+00 |
12:59:06  {% enddocs %}

I expected that I could use this opperation like this dbt run-operation print_profile_docs <args> > some_file.md however then I'll get the timestamp and other output that will break the docs reference. If I want to run this in a ci workflow then copy pasting is not an option!

Thanks for your help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.