dbt-labs / dbt-meshify Goto Github PK

View Code? Open in Web Editor NEW

104.0 8.0 4.0 17.05 MB

A dbt-core python package that automates the management and creation of dbt groups, contracts, access, and versions.

Home Page: https://dbt-labs.github.io/dbt-meshify/

License: Apache License 2.0

Python 100.00%

data dbt dbt-cloud dbt-core

dbt-meshify's Introduction

dbt-meshify

EXPERIMENTAL

maintained with ❤️ by dbt practitioners for dbt practitioners

Click here for full package documentation

Overview

dbt-meshify is a CLI tool that automates the creation of model governance and cross-project lineage features introduced in dbt-core v1.5 and v1.6. This package will leverage your dbt project metadata to create and/or edit the files in your project to properly configure the models in your project with these features.

These features include:

Groups - group your models into logical sets.
Contracts - add model contracts to your models to ensure consistent data shape.
Access - control the access level of models within groups
Versions - create and increment versions of particular models.
Project dependencies - split a monolithic dbt project into component projects, or connect multiple pre-existing dbt projects using cross-project ref.

Installation

To install dbt-meshify, run:

pip install dbt-meshify

To upgrade dbt-meshify, run:

pip install --upgrade dbt-meshify

dbt-meshify's People

Contributors

Stargazers

Watchers

Forkers

nicholasyager rmmrzv donnyzhao harshaadwait

dbt-meshify's Issues

remove public model workaround

1.5.2 will include the new access selection syntax, and will be released next week -- once it's out, we should update the dependencies and remove the workaround logic for selecting public models introduced in #47

update docs to reflect multiselect behavior of select

          Since `select` now allows multiple arguments, we cannot have `--select` before the `project_name` argument.

(dbt-meshify-py3.11) > $ poetry run dbt-meshify split --select "+orders" revenue                                                                                                                                         
Usage: dbt-meshify split [OPTIONS] PROJECT_NAME
Try 'dbt-meshify split --help' for help.

Error: Missing argument 'PROJECT_NAME'.

Instead, we need to order arguments/options specifically

(dbt-meshify-py3.11) > $ poetry run dbt-meshify split revenue --select "+orders"

I don't this is a blocker per se. At the very least, documentation should be refined in a follow-up.

Originally posted by @nicholasyager in #63 (comment)

automatically create dbt Cloud resources

enhance logging UX

Describe the feature

From review of #73, we have some room for improvement for logging UX

more specific error types
UX for setting log levels

let's improve it!

Who will this benefit?

users, log enthusiasts

Are you interested in contributing this feature?

sure

Create base test suite

the code in /tests/merge_source_metadata/test_merge_source_metadata.py doesn't run!

we should design some pytest infrastructure to test all the features of the package

duplicate versions defined after using `--prerelease` flag for version operation

Describe the bug

when using the add-version operation, if you use the --prerelease flag, which skips the latest_version: increment, and creates a file, the latest version and the defined versions are intentionally our of sync. Subsequent invocations of the add-version command after having used the --pre-release flag result in duplicated version numbers in yml.

Steps to reproduce

invoke dbt-meshify operation add-version -s my_model

yml result:

models:

name: my_model
latest_version: 1
versions:
- v: 1


2. invoke `dbt-meshify operation add-version -s my_model --prerelease`

yml result:

models:

  - name: my_model
    latest_version: 1
    versions:
      - v: 1
      - v: 2

no error

invoke dbt-meshify operation add-version -s my_model

yml result:

models: 
  - name: my_model
    latest_version: 2
    versions:
      - v: 1
      - v: 2
      - v: 2

duplicate version entry

Expected results

The add version command should

(no prerelease flag) increment latest_version and create next version + file based on defined yml versions
(prerelease flag) not increment latest_version and still create next version + file based on defined yml versions

no parse errors from dbt

Actual results

dupe versions and a parse error to boot

Screenshots and log output

System information

Which database are you using dbt with?

The output of dbt debug:

<output goes here>

The output of dbt --version:

<output goes here>

Additional context

Are you interested in contributing the fix?

create class for interacting with dbtRunner

Add open source license

better exception handling for dbtRunner

          I think something like this would work:

    def get_subproject_resources(self, subproject_selector: str) -> List[str]:
        ls_results = self.dbt_operation(["--log-level", "none", "ls", "-s", subproject_selector])

        if not ls_results.success:
            raise ls_results.exception

        return ls_results.result

Given that this would be common code, there's a good argument for having exception handling performed within the dbt_operation method.

Originally posted by @nicholasyager in #5 (comment)

Running the command when no `dbt_project.yml` is available crashes

Describe the bug

Runtime Error
  No dbt_project.yml found at expected path /mypath/dbt-meshify/dbt_project.yml
  Verify that each entry within packages.yml (and their transitive dependencies) contains a file named dbt_project.yml
  
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/mypath/cli.py", line 74, in wrapper_decorator
    return func(*args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/mypath/main.py", line 237, in group
    ctx.forward(create_group)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 781, in forward
    return __self.invoke(__cmd, *args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/mypath/cli.py", line 74, in wrapper_decorator
    return func(*args, **kwargs)
  File "/mypath/main.py", line 182, in create_group
    project = DbtProject.from_directory(path)
  File "/mypath/dbt_projects.py", line 172, in from_directory
    manifest=dbt.parse(directory),
  File "/mypath/dbt.py", line 28, in parse
    return self.invoke(directory, ["--quiet", "parse"])
  File "/mypath/dbt.py", line 24, in invoke
    raise result.exception
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/cli/requires.py", line 86, in wrapper
    result, success = func(*args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/cli/requires.py", line 71, in wrapper
    return func(*args, **kwargs)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/cli/requires.py", line 139, in wrapper
    profile = load_profile(flags.PROJECT_DIR, flags.VARS, flags.PROFILE, flags.TARGET, threads)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/config/runtime.py", line 65, in load_profile
    raw_project = load_raw_project(project_root)
  File "/mypath/dbt-meshify/.venv/lib/python3.9/site-packages/dbt/config/project.py", line 170, in load_raw_project
    raise DbtProjectError(MISSING_DBT_PROJECT_ERROR.format(path=project_yaml_filepath))
dbt.exceptions.DbtProjectError: Runtime Error
  No dbt_project.yml found at expected path /mypath/dbt-meshify/dbt_project.yml
  Verify that each entry within packages.yml (and their transitive dependencies) contains a file named dbt_project.yml

Steps to reproduce

Run poetry run dbt-meshify group finance --owner-name "Monopoly Man" -s +tag:finance where there is no dbt project

Expected results

An error message is raised but doesn't crash the program

add an issue template

account for `target` when `connect` ing via the source hack

When users declare sources in a downstream project to refer to models in an upstream project, most often the sources are hard coded to the production environment outputs of the upstream project.

For example if fct_orders in project_a is built into the analytics schema when running in production, project_b might have a source that looks like this:

sources:
  - name: src_proj_a
    database: project_a_prod_db
    schema: analytics
    tables:
      - name: fct_orders

The current method of interacting with LocalDbtProjects will simply compile each project with it's default target. If both local projects default to dev, project_a.fct_orders may have metadata that is different than the hardcoded source metadata, causing us to miss the connection.

let's think about how to resolve this!

unified user experience for CLI

The user interface for the CLI should be updated to support the MVP Goals

TODO

YML dumping formatting

Right now, the yml dumping logic does not follow our YML style guide.

sample dictionary:

{
    "my_items": [
        {
            "name": "item1",
            "price": 100
        },
        {
            "name": "item2",
            "price": 200
        },
        {
            "name": "item3",
            "price": 300
        }
    ]
}

curernt yml output:

my_items:
- name: item1
  price: 100
- name: item2
  price: 200
- name: item3
  price: 300

The output should be indented for arrays to more accurately represent the hierarchy:

my_items:
  - name: item1
    price: 100
  - name: item2
    price: 200
  - name: item3
    price: 300

May require some investigation into pyyaml options (i tried indent and width to no avail) or a new package altogether

simplify file manager logic

biggest code smell at the moment is in file_manager.py -- @nicholasyager may have some thoughts on how to approach it in a cleaner way!

clear mypy issues :)

thank you

add docs site

make a mkdocs site!

create strategy for cloning/reusing macros

how do we handle macros that are in a project that is being split up so that all moved files still work?

Options include but are not limited to:

copy all macros from parent project to child project
copy all macros that are used by child project only
create an additional, common project that can be installed as a package to all children packages

Identify candidates for `groups` and `private` access

https://docs.getdbt.com/docs/collaborate/govern/model-access

As the maintainer of a large & complex project, I'd be interested in a tool that can help me identify / recommend:

potential groups for a set of currently ungrouped models
ungrouped models that are good candidates for existing groups
protected models that could be switched to access: private

Proposed criteria

Many dbt developers follow our recommended practices around project structure, and so they're already in the habit of grouping together related models in subdirectories. I expect that this will be the way many teams start using groups, by adding a group config in dbt_project.yml for all models in that subdirectory, and defining an owner for those models. This could be a naive starting point for us too.

At the cleverer end of the spectrum, we could cook up a DAG simplification / edge contraction algorithm, where all removed vertices should be private models in the same group.

Given:

model_a --> [one or more models] --> model_b

Where the only children of those models is model_b, they should be switched to private models within the same group as model_b.

I'm sure there are even cleverer approaches we could take here!

create method for writing files from one project to another directory

Once subprojects are a bit more stable, we'll need a method/class to write new files with the original (edited?) content into a new location

Ideally, this operation will be idempotent, and non-destructive (if at all possible)

create optional `--dry-run` flag

there should a way for the user to audit the actions that the package will be taking before making changes to their project.

handle `on_schema_change` for incrementals with contracts

Incremental models with contracts must have the on_schema_change config set to append_new_columns. This will need to be a conditional piece of logic in the add-contract command.

Options

add it to the model yml entry blindly
be more creative and find where the config is set from and override apropriately.

1 feels good short ter, 2 feels right long term!

investigate methods to ensure models exist

right now, for model contracts, we rely on the catalog.json artifact for column information. Using the artifact assumes that the models exist in the environment we're running the package in, and can lead to some issues if the models do not yet exist.

This is a challenge to manage when

Models are renamed with the version operation (fix: do any contract operations first)
Users have not run a whole dbt run in dev before attempting to use this package.

Should we think about some sort of dry-run mode to stub in metadata only models before all ops (select * from ... where false)? Is it more reliable to leverage the get_columns_in_query macro, perhaps as a run operation?

add structured logs

Describe the feature

Right now, you have no idea what's going on when you run a command! One should have an idea what is going on when one runs a command

Describe alternatives you've considered

do nothing

Who will this benefit?

users!

Are you interested in contributing this feature?

maybe?

rollback on failed command

Describe the bug

if a meshify command fails, it should roll back anything it did

Steps to reproduce

run

> dbt group group_name -s orders
(without owner)

see that the parse fails, and some files changed despite the failure

Expected results

nothing in the event of a failed commans

Actual results

Screenshots and log output

System information

Which database are you using dbt with?

The output of dbt debug:

<output goes here>

The output of dbt --version:

<output goes here>

Additional context

Are you interested in contributing the fix?

preserve quotes in YML

Describe the bug

Our YAML provider seems to be dropping quotes from string properties:

Steps to reproduce

have some strings in a yml file
run an operation that interacts with that yml file
check the formatting of the string object

Expected results

No YML should be edited that isn't directly related to the feature set being invoked by the package

Actual results

Errant YAML edits!

System information

Which database are you using dbt with?

The output of dbt debug:

❯ dbt debug
15:15:21  Running with dbt=1.5.1
15:15:21  dbt version: 1.5.1
15:15:21  python version: 3.10.10
15:15:21  python path: /Users/daveconnors/dev/sandbox/dbt-meshify-demo/.venv/bin/python3
15:15:21  os info: macOS-13.2.1-arm64-arm-64bit
15:15:21  Using profiles.yml file at /Users/daveconnors/dev/sandbox/dbt-meshify-demo/profiles.yml
15:15:21  Using dbt_project.yml file at /Users/daveconnors/dev/sandbox/dbt-meshify-demo/dbt_project.yml
15:15:21  Configuration:
15:15:21    profiles.yml file [OK found and valid]
15:15:21    dbt_project.yml file [OK found and valid]
15:15:21  Required dependencies:
15:15:21   - git [OK found]

15:15:21  Connection:
15:15:21    database: jaffle_shop
15:15:21    schema: analytics
15:15:21    path: ./reports/jaffle_shop.duckdb
15:15:21    Connection test: [OK connection ok]

15:15:21  All checks passed!

Are you interested in contributing the fix?

yes!

should be fairly easy!

update file writing class to use pathlib

should make some of the classes I made in the file writing lib a lot simpler

add interactive mode a la `dbt init`

Describe the feature

In addition to structured logs, there's was desire from the DX team on a more guided experience for each command, like the dbt init command currently offers!

Describe alternatives you've considered

black box
better structured logs only

Who will this benefit?

users who are new to the mesh experience

Are you interested in contributing this feature?

selector syntax does not support set operators

Describe the bug

When I attempt to specify a union selector, I get an error:

We should support set operators or a way to capture "and" and "or" for --select and --exclude. Again, this maybe points to use changing the UX design.

System information

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.5.2
  - latest:    1.5.2 - Up to date!

Plugins:
  - snowflake: 1.5.2 - Up to date!

Additional context

Are you interested in contributing the fix?

[stretch] have the package automatically create new repositories

Model contract removes field descriptions

Describe the bug

When I create a new group, the contracts on my public model(s) overwrite any column descriptions I have in my YML file.

dbt-meshify group raw_vault --owner name Logan -s 2__raw_vault

System information

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.5.2
  - latest:    1.5.2 - Up to date!

Plugins:
  - snowflake: 1.5.2 - Up to date!

Additional context

Are you interested in contributing the fix?

What should happen when you add-contract to a private model?

Describe the bug

Right now when I execute dbt-meshify operation add-contract -s 2__raw_vault (which included models that are previously marked as access:private), dbt-meshify does not update those models' access to public.

Should it?

Should there be a separate operation for updating access?

Thoughts?

publish package to pypi

alright!

Bug: Poetry dependencies are out-of-date for using the dbt-core CLI Library

Current Behavior

Within the main branch of this project, code within the dbt_project.py file is referencing a dbtRunner class which does not exist with dbt-core 1.4.5, but rather in dbt-core 1.5.0rc1.

Expected Behavior
The dbt-meshify project declares dbt-core^=1.5.0rc1 as a dependency instead of dbt-core^=1.4.5.

Remove the `public_only` arg from the contract command

dbt-labs/dbt-core#7739 was just merged, which adds the ability to select models based on their access. Once released, we should use this in favor of the hidden public_only argument in the add_contact command and just select these models with selection syntax

When a model has no yml, and is marked as private -> extra yml configs

Describe the bug

When I execute a create-group command, and dbt-meshify creates new yaml for a private model - it creates extra "config" and "columns" blank fields:

dbt-meshify operation create-group raw_vault

Error with multiple words for owner name

Describe the bug

Following the example from the docs site:

# create a group of all models tagged with "finance"
# leaf nodes and nodes with cross-group dependencies will be `public`
# public nodes will also have contracts added to them
dbt-meshify group finance --owner name Monopoly Man -s +tag:finance

I attempted to create a new group with a 2-word owner name and got an error:

This might just be a docs issue (and we need double quotes around the owner name), but overall I think this points to us maybe re-thinking the UX!

System information

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.5.2
  - latest:    1.5.2 - Up to date!

Plugins:
  - snowflake: 1.5.2 - Up to date!

Additional context

Are you interested in contributing the fix?

allow `grouper.py` to add access and groups separately

per convo, we'll need a way to get the access logic without adding a new group

create method for generating a boilerplate dbt Project to clone

should we use cookiecutter ?
should we shell to the dbtRunner and call dbt init? (i don't think so)
do we use the starter_project that ships in core

Default create-group with nothing selected, makes a group with all of your models

Describe the bug

When I run dbt-meshify operation create-group raw_vault, dbt-meshify creates a new group with all of my project's models in it. I would expect that when I don't specify selector syntax, dbt-meshify would create a new empty group.

Thoughts?

Add support for yml selectors

#34 adds the option for using yml selectors for the selection/exclusion of resources, but it's currently not doing anything! we should update the dbt runner class to support those flags

`--owner name dave` seems odd

Describe the feature

Adding name/email to groups has a bit of an odd syntax today: --owner name dave to assign the name dave. Is this following some CLI convention?

My initial feel is that --owner-name dave or --owner "name:dave" would seem more appropriate, but I might be wrong 😄

create strategy for editing files

we may need to

edit ref() to two arguments
edit source yml to remove unnecessary sources after connection
edit packages.yml (maybe)

This may turn into many issues, but need a standard way to alter contents of dbt files in a reliable way

Update to `pyproject.toml` dependencies

The current file has this section

[tool.poetry.dependencies]
python = "^3.9"
dbt-core = "^1.5.0rc1"
click = "^8.1.3"
dbt-postgres = { version = "^1.5.0rc1", optional = true }
dbt-duckdb = "^1.5.0"
black = "^23.3.0"
mypy = "^1.3.0"
isort = "^5.12.0"
pre-commit = "^3.3.1"
ruamel-yaml = "^0.17.31"
mike = "^1.1.2"

Shouldn't some of them move to [tool.poetry.group.dev.dependencies]:

[tool.poetry.group.dev.dependencies]
...
mypy = "^1.3.0"
isort = "^5.12.0"
pre-commit = "^3.3.1"
mike = "^1.1.2"

refactor dbtProject Class such that subprojects are instances of the same class

Feature - Get suggestions on how to split the project into groups

Describe the feature

dbt-meshify could do some introspection on the project to try to identify relevant groups

Spitballing ideas would be:

a model with many children is likely to be a public one
models from the same set of sources might be in the same group
(did someone say graph theory?)

Who will this benefit?

People wanting to move to a dbt-mesh paradigm but not having an idea of what groups their models should be split into.

Are you interested in contributing to this feature?

Why not

read `catalog.json` if it already exists rather than a new `dbt docs generate` on each command

Describe the feature

Right now, each invocation loads a catalog using a docs generate command. For large projects, this stinks and is very slow. we should load the catalog.json file from the target directory if it exists, and only generate a new one if it's not already there.

Describe alternatives you've considered

Leaving it alone!

Who will this benefit?

Users with large dbt projects

Are you interested in contributing this feature?

Yes

add-contract command should update on_schema_change config for incremental models

Describe the bug

When I run dbt-meshify operation add-contract -s 2__raw_vault (which includes incremental models), I get the error:

Invalid value for on_schema_change: ignore. Models materialized as incremental with contracts enabled must set on_schema_change to 'append_new_columns'

we should automatically update this config for incremental models when we use the add-contract command.

System information

Which database are you using dbt with?

The output of dbt debug:

<output goes here>

The output of dbt --version:

<output goes here>

Additional context

Are you interested in contributing the fix?

`group` classification is too inclusive for model access

Grace did the following:

I just did the following workflow:

Ok, I know I want to create a new group for all of my models in my 1__stage, 2__raw_vault, and 3__business_vault folders. I execute: dbt-meshify group vault --owner-name "Trainer Logan" -s "1__stage 2__raw_vault 3__business_vault"
Because I've only built out a few models that build on those "vault" models, some are marked as public and some are marked as private. But actually, I want all of my models in 2__raw_vault and 3__business_vault folders to be public so that anyone can access and use them. So I executed dbt-meshify operation add-contract -s "2__raw_vault 3__business_vault"
But that just added contracts, it didn't update my access. So I had to go in and manually change the access config for all of the models in those folders.

I'm okay with not automatically updating access when you add a contract, but I do think we then need an operation to update access manually.

Originally posted by @graciegoheen in #85 (comment)

The reason number 2 did not properly classify her resource types appropriately based on her selection syntax is that we're doing the classify_resource_access step on all resources types in the selected group. Instead, we should limit the interface analysis to model nodes only, as they are the ones that should have access settings at all!

once the split command PR is merged, we'll have yml operations for all resource types, so we can add the group config to all relevant resource types, and limit the access type step to just models

standardized method/class to detect dependcies between projects

early version is LocalDbtProject.overlapping_sources() which tells you the sources in the downstream project that point to models in the upstream project, but this is not really the way want to do this. should be more generic

dbt-labs / dbt-meshify Goto Github PK

dbt-meshify's Introduction

dbt-meshify

Overview

Installation

dbt-meshify's People

Contributors

Stargazers

Watchers

Forkers

dbt-meshify's Issues

Describe the feature

Who will this benefit?

Are you interested in contributing this feature?

Describe the bug

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Additional context

Are you interested in contributing the fix?

Describe the bug

Steps to reproduce

Expected results

Proposed criteria

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Describe the bug

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Additional context

Are you interested in contributing the fix?

Describe the bug

Steps to reproduce

Expected results

Actual results

System information

Are you interested in contributing the fix?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Describe the bug

System information

Additional context

Are you interested in contributing the fix?

Describe the bug

System information

Additional context

Are you interested in contributing the fix?

Describe the bug

Describe the bug

Describe the bug

System information

Additional context

Are you interested in contributing the fix?

Describe the bug

Describe the feature

Describe the feature

Who will this benefit?

Are you interested in contributing to this feature?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Describe the bug

System information

Additional context

Are you interested in contributing the fix?

Recommend Projects

Recommend Topics

Recommend Org