Code Monkey home page Code Monkey logo

dbt-meshify's Introduction

dbt-meshify

EXPERIMENTAL

maintained with โค๏ธ by dbt practitioners for dbt practitioners

Click here for full package documentation

Overview

dbt-meshify is a CLI tool that automates the creation of model governance and cross-project lineage features introduced in dbt-core v1.5 and v1.6. This package will leverage your dbt project metadata to create and/or edit the files in your project to properly configure the models in your project with these features.

These features include:

  1. Groups - group your models into logical sets.
  2. Contracts - add model contracts to your models to ensure consistent data shape.
  3. Access - control the access level of models within groups
  4. Versions - create and increment versions of particular models.
  5. Project dependencies - split a monolithic dbt project into component projects, or connect multiple pre-existing dbt projects using cross-project ref.

Installation

To install dbt-meshify, run:

pip install dbt-meshify

To upgrade dbt-meshify, run:

pip install --upgrade dbt-meshify

dbt-meshify's People

Contributors

b-per avatar dave-connors-3 avatar davidbloss avatar graciegoheen avatar jtcohen6 avatar nicholasyager avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-meshify's Issues

remove public model workaround

1.5.2 will include the new access selection syntax, and will be released next week -- once it's out, we should update the dependencies and remove the workaround logic for selecting public models introduced in #47

update docs to reflect multiselect behavior of select

          Since `select` now allows multiple arguments, we cannot have `--select` before the `project_name` argument.
(dbt-meshify-py3.11) > $ poetry run dbt-meshify split --select "+orders" revenue                                                                                                                                         
Usage: dbt-meshify split [OPTIONS] PROJECT_NAME
Try 'dbt-meshify split --help' for help.

Error: Missing argument 'PROJECT_NAME'.

Instead, we need to order arguments/options specifically

(dbt-meshify-py3.11) > $ poetry run dbt-meshify split revenue --select "+orders"                                                                                                                                   

I don't this is a blocker per se. At the very least, documentation should be refined in a follow-up.

Originally posted by @nicholasyager in #63 (comment)

enhance logging UX

Describe the feature

From review of #73, we have some room for improvement for logging UX

  1. more specific error types
  2. UX for setting log levels

let's improve it!

Who will this benefit?

users, log enthusiasts

Are you interested in contributing this feature?

sure

Create base test suite

the code in /tests/merge_source_metadata/test_merge_source_metadata.py doesn't run!

we should design some pytest infrastructure to test all the features of the package

duplicate versions defined after using `--prerelease` flag for version operation

Describe the bug

when using the add-version operation, if you use the --prerelease flag, which skips the latest_version: increment, and creates a file, the latest version and the defined versions are intentionally our of sync. Subsequent invocations of the add-version command after having used the --pre-release flag result in duplicated version numbers in yml.

Steps to reproduce

  1. invoke dbt-meshify operation add-version -s my_model

yml result:

models: 
  • name: my_model
    latest_version: 1
    versions:
    • v: 1

2. invoke `dbt-meshify operation add-version -s my_model --prerelease`

yml result:

models:

  - name: my_model
    latest_version: 1
    versions:
      - v: 1
      - v: 2

no error

  1. invoke dbt-meshify operation add-version -s my_model

yml result:

models: 
  - name: my_model
    latest_version: 2
    versions:
      - v: 1
      - v: 2
      - v: 2

duplicate version entry

Expected results

The add version command should

  1. (no prerelease flag) increment latest_version and create next version + file based on defined yml versions
  2. (prerelease flag) not increment latest_version and still create next version + file based on defined yml versions

no parse errors from dbt

Actual results

dupe versions and a parse error to boot

Screenshots and log output

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt debug:

<output goes here>

The output of dbt --version:

<output goes here>

Additional context

Are you interested in contributing the fix?

better exception handling for dbtRunner

          I think something like this would work:
    def get_subproject_resources(self, subproject_selector: str) -> List[str]:
        ls_results = self.dbt_operation(["--log-level", "none", "ls", "-s", subproject_selector])

        if not ls_results.success:
            raise ls_results.exception

        return ls_results.result

Given that this would be common code, there's a good argument for having exception handling performed within the dbt_operation method.

Originally posted by @nicholasyager in #5 (comment)

Running the command when no `dbt_project.yml` is available crashes

Describe the bug

Runtime Error
  No dbt_project.yml found at expected path /mypath/dbt-meshify/dbt_project.yml
  Verify that each entry within packages.yml (and their transitive dependencies) contains a file named dbt_project.yml
  
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/mypath/cli.py", line 74, in wrapper_decorator
    return func(*args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/mypath/main.py", line 237, in group
    ctx.forward(create_group)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 781, in forward
    return __self.invoke(__cmd, *args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/mypath/cli.py", line 74, in wrapper_decorator
    return func(*args, **kwargs)
  File "/mypath/main.py", line 182, in create_group
    project = DbtProject.from_directory(path)
  File "/mypath/dbt_projects.py", line 172, in from_directory
    manifest=dbt.parse(directory),
  File "/mypath/dbt.py", line 28, in parse
    return self.invoke(directory, ["--quiet", "parse"])
  File "/mypath/dbt.py", line 24, in invoke
    raise result.exception
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/cli/requires.py", line 86, in wrapper
    result, success = func(*args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/cli/requires.py", line 71, in wrapper
    return func(*args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/cli/requires.py", line 139, in wrapper
    profile = load_profile(flags.PROJECT_DIR, flags.VARS, flags.PROFILE, flags.TARGET, threads)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/config/runtime.py", line 65, in load_profile
    raw_project = load_raw_project(project_root)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/config/project.py", line 170, in load_raw_project
    raise DbtProjectError(MISSING_DBT_PROJECT_ERROR.format(path=project_yaml_filepath))
dbt.exceptions.DbtProjectError: Runtime Error
  No dbt_project.yml found at expected path /mypath/dbt-meshify/dbt_project.yml
  Verify that each entry within packages.yml (and their transitive dependencies) contains a file named dbt_project.yml

Steps to reproduce

Run poetry run dbt-meshify group finance --owner-name "Monopoly Man" -s +tag:finance where there is no dbt project

Expected results

An error message is raised but doesn't crash the program

account for `target` when `connect` ing via the source hack

When users declare sources in a downstream project to refer to models in an upstream project, most often the sources are hard coded to the production environment outputs of the upstream project.

For example if fct_orders in project_a is built into the analytics schema when running in production, project_b might have a source that looks like this:

sources:
  - name: src_proj_a
    database: project_a_prod_db
    schema: analytics
    tables:
      - name: fct_orders

The current method of interacting with LocalDbtProjects will simply compile each project with it's default target. If both local projects default to dev, project_a.fct_orders may have metadata that is different than the hardcoded source metadata, causing us to miss the connection.

let's think about how to resolve this!

YML dumping formatting

Right now, the yml dumping logic does not follow our YML style guide.

sample dictionary:

{
    "my_items": [
        {
            "name": "item1",
            "price": 100
        },
        {
            "name": "item2",
            "price": 200
        },
        {
            "name": "item3",
            "price": 300
        }
    ]
}

curernt yml output:

my_items:
- name: item1
  price: 100
- name: item2
  price: 200
- name: item3
  price: 300

The output should be indented for arrays to more accurately represent the hierarchy:

my_items:
  - name: item1
    price: 100
  - name: item2
    price: 200
  - name: item3
    price: 300

May require some investigation into pyyaml options (i tried indent and width to no avail) or a new package altogether

create strategy for cloning/reusing macros

how do we handle macros that are in a project that is being split up so that all moved files still work?

Options include but are not limited to:

  1. copy all macros from parent project to child project
  2. copy all macros that are used by child project only
  3. create an additional, common project that can be installed as a package to all children packages

Identify candidates for `groups` and `private` access

https://docs.getdbt.com/docs/collaborate/govern/model-access

As the maintainer of a large & complex project, I'd be interested in a tool that can help me identify / recommend:

  • potential groups for a set of currently ungrouped models
  • ungrouped models that are good candidates for existing groups
  • protected models that could be switched to access: private

Proposed criteria

Many dbt developers follow our recommended practices around project structure, and so they're already in the habit of grouping together related models in subdirectories. I expect that this will be the way many teams start using groups, by adding a group config in dbt_project.yml for all models in that subdirectory, and defining an owner for those models. This could be a naive starting point for us too.

At the cleverer end of the spectrum, we could cook up a DAG simplification / edge contraction algorithm, where all removed vertices should be private models in the same group.

Given:

model_a --> [one or more models] --> model_b

Where the only children of those models is model_b, they should be switched to private models within the same group as model_b.

I'm sure there are even cleverer approaches we could take here!

handle `on_schema_change` for incrementals with contracts

Incremental models with contracts must have the on_schema_change config set to append_new_columns. This will need to be a conditional piece of logic in the add-contract command.

Options

  1. add it to the model yml entry blindly
  2. be more creative and find where the config is set from and override apropriately.

1 feels good short ter, 2 feels right long term!

investigate methods to ensure models exist

right now, for model contracts, we rely on the catalog.json artifact for column information. Using the artifact assumes that the models exist in the environment we're running the package in, and can lead to some issues if the models do not yet exist.

This is a challenge to manage when

  • Models are renamed with the version operation (fix: do any contract operations first)
  • Users have not run a whole dbt run in dev before attempting to use this package.

Should we think about some sort of dry-run mode to stub in metadata only models before all ops (select * from ... where false)? Is it more reliable to leverage the get_columns_in_query macro, perhaps as a run operation?

add structured logs

Describe the feature

Right now, you have no idea what's going on when you run a command! One should have an idea what is going on when one runs a command

Describe alternatives you've considered

do nothing

Who will this benefit?

users!

Are you interested in contributing this feature?

maybe?

rollback on failed command

Describe the bug

if a meshify command fails, it should roll back anything it did

Steps to reproduce

run

> dbt group group_name -s orders
(without owner)

see that the parse fails, and some files changed despite the failure

Expected results

nothing in the event of a failed commans

Actual results

Screenshots and log output

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt debug:

<output goes here>

The output of dbt --version:

<output goes here>

Additional context

Are you interested in contributing the fix?

preserve quotes in YML

Describe the bug

Our YAML provider seems to be dropping quotes from string properties:

image

Steps to reproduce

  1. have some strings in a yml file
  2. run an operation that interacts with that yml file
  3. check the formatting of the string object

Expected results

No YML should be edited that isn't directly related to the feature set being invoked by the package

Actual results

Errant YAML edits!

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ๐Ÿฆ† )

The output of dbt debug:

โฏ dbt debug
15:15:21  Running with dbt=1.5.1
15:15:21  dbt version: 1.5.1
15:15:21  python version: 3.10.10
15:15:21  python path: /Users/daveconnors/dev/sandbox/dbt-meshify-demo/.venv/bin/python3
15:15:21  os info: macOS-13.2.1-arm64-arm-64bit
15:15:21  Using profiles.yml file at /Users/daveconnors/dev/sandbox/dbt-meshify-demo/profiles.yml
15:15:21  Using dbt_project.yml file at /Users/daveconnors/dev/sandbox/dbt-meshify-demo/dbt_project.yml
15:15:21  Configuration:
15:15:21    profiles.yml file [OK found and valid]
15:15:21    dbt_project.yml file [OK found and valid]
15:15:21  Required dependencies:
15:15:21   - git [OK found]

15:15:21  Connection:
15:15:21    database: jaffle_shop
15:15:21    schema: analytics
15:15:21    path: ./reports/jaffle_shop.duckdb
15:15:21    Connection test: [OK connection ok]

15:15:21  All checks passed!

Are you interested in contributing the fix?

yes!

should be fairly easy!

add interactive mode a la `dbt init`

Describe the feature

In addition to structured logs, there's was desire from the DX team on a more guided experience for each command, like the dbt init command currently offers!

Describe alternatives you've considered

  1. black box
  2. better structured logs only

Who will this benefit?

users who are new to the mesh experience

Are you interested in contributing this feature?

selector syntax does not support set operators

Describe the bug

When I attempt to specify a union selector, I get an error:
Screen Shot 2023-06-26 at 11 14 19 AM

We should support set operators or a way to capture "and" and "or" for --select and --exclude. Again, this maybe points to use changing the UX design.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.5.2
  - latest:    1.5.2 - Up to date!

Plugins:
  - snowflake: 1.5.2 - Up to date!

Additional context

Are you interested in contributing the fix?

Model contract removes field descriptions

Describe the bug

When I create a new group, the contracts on my public model(s) overwrite any column descriptions I have in my YML file.

Screen Shot 2023-06-26 at 11 09 56 AM
dbt-meshify group raw_vault --owner name Logan -s 2__raw_vault 

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.5.2
  - latest:    1.5.2 - Up to date!

Plugins:
  - snowflake: 1.5.2 - Up to date!

Additional context

Are you interested in contributing the fix?

What should happen when you add-contract to a private model?

Describe the bug

Right now when I execute dbt-meshify operation add-contract -s 2__raw_vault (which included models that are previously marked as access:private), dbt-meshify does not update those models' access to public.

Should it?

Should there be a separate operation for updating access?

Thoughts?

Bug: Poetry dependencies are out-of-date for using the dbt-core CLI Library

Current Behavior

Within the main branch of this project, code within the dbt_project.py file is referencing a dbtRunner class which does not exist with dbt-core 1.4.5, but rather in dbt-core 1.5.0rc1.

Expected Behavior
The dbt-meshify project declares dbt-core^=1.5.0rc1 as a dependency instead of dbt-core^=1.4.5.

Error with multiple words for owner name

Describe the bug

Following the example from the docs site:

# create a group of all models tagged with "finance"
# leaf nodes and nodes with cross-group dependencies will be `public`
# public nodes will also have contracts added to them
dbt-meshify group finance --owner name Monopoly Man -s +tag:finance

I attempted to create a new group with a 2-word owner name and got an error:
Screen Shot 2023-06-26 at 11 04 29 AM

This might just be a docs issue (and we need double quotes around the owner name), but overall I think this points to us maybe re-thinking the UX!

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.5.2
  - latest:    1.5.2 - Up to date!

Plugins:
  - snowflake: 1.5.2 - Up to date!

Additional context

Are you interested in contributing the fix?

Add support for yml selectors

#34 adds the option for using yml selectors for the selection/exclusion of resources, but it's currently not doing anything! we should update the dbt runner class to support those flags

`--owner name dave` seems odd

Describe the feature

Adding name/email to groups has a bit of an odd syntax today: --owner name dave to assign the name dave. Is this following some CLI convention?

My initial feel is that --owner-name dave or --owner "name:dave" would seem more appropriate, but I might be wrong ๐Ÿ˜„

create strategy for editing files

we may need to

  1. edit ref() to two arguments
  2. edit source yml to remove unnecessary sources after connection
  3. edit packages.yml (maybe)

This may turn into many issues, but need a standard way to alter contents of dbt files in a reliable way

Update to `pyproject.toml` dependencies

The current file has this section

[tool.poetry.dependencies]
python = "^3.9"
dbt-core = "^1.5.0rc1"
click = "^8.1.3"
dbt-postgres = { version = "^1.5.0rc1", optional = true }
dbt-duckdb = "^1.5.0"
black = "^23.3.0"
mypy = "^1.3.0"
isort = "^5.12.0"
pre-commit = "^3.3.1"
ruamel-yaml = "^0.17.31"
mike = "^1.1.2"

Shouldn't some of them move to [tool.poetry.group.dev.dependencies]:

[tool.poetry.group.dev.dependencies]
...
mypy = "^1.3.0"
isort = "^5.12.0"
pre-commit = "^3.3.1"
mike = "^1.1.2"

Feature - Get suggestions on how to split the project into groups

Describe the feature

dbt-meshify could do some introspection on the project to try to identify relevant groups

Spitballing ideas would be:

  • a model with many children is likely to be a public one
  • models from the same set of sources might be in the same group
  • (did someone say graph theory?)

Who will this benefit?

People wanting to move to a dbt-mesh paradigm but not having an idea of what groups their models should be split into.

Are you interested in contributing to this feature?

Why not

read `catalog.json` if it already exists rather than a new `dbt docs generate` on each command

Describe the feature

Right now, each invocation loads a catalog using a docs generate command. For large projects, this stinks and is very slow. we should load the catalog.json file from the target directory if it exists, and only generate a new one if it's not already there.

Describe alternatives you've considered

Leaving it alone!

Who will this benefit?

Users with large dbt projects

Are you interested in contributing this feature?

Yes

add-contract command should update on_schema_change config for incremental models

Describe the bug

When I run dbt-meshify operation add-contract -s 2__raw_vault (which includes incremental models), I get the error:

Invalid value for on_schema_change: ignore. Models materialized as incremental with contracts enabled must set on_schema_change to 'append_new_columns'

we should automatically update this config for incremental models when we use the add-contract command.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt debug:

<output goes here>

The output of dbt --version:

<output goes here>

Additional context

Are you interested in contributing the fix?

`group` classification is too inclusive for model access

Grace did the following:

I just did the following workflow:

  1. Ok, I know I want to create a new group for all of my models in my 1__stage, 2__raw_vault, and 3__business_vault folders. I execute: dbt-meshify group vault --owner-name "Trainer Logan" -s "1__stage 2__raw_vault 3__business_vault"
  2. Because I've only built out a few models that build on those "vault" models, some are marked as public and some are marked as private. But actually, I want all of my models in 2__raw_vault and 3__business_vault folders to be public so that anyone can access and use them. So I executed dbt-meshify operation add-contract -s "2__raw_vault 3__business_vault"
  3. But that just added contracts, it didn't update my access. So I had to go in and manually change the access config for all of the models in those folders.

I'm okay with not automatically updating access when you add a contract, but I do think we then need an operation to update access manually.

Originally posted by @graciegoheen in #85 (comment)

The reason number 2 did not properly classify her resource types appropriately based on her selection syntax is that we're doing the classify_resource_access step on all resources types in the selected group. Instead, we should limit the interface analysis to model nodes only, as they are the ones that should have access settings at all!

once the split command PR is merged, we'll have yml operations for all resource types, so we can add the group config to all relevant resource types, and limit the access type step to just models

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.