Comments (4)
We had a productive discussion of this issue offline, resulting in some changes to the scope. Codifying that discussion here for posterity:
It feels like this issue is experiencing scope creep, so we're going to cut the scope and put the extraneous features on the backburner. Specifically, there are a few distinct goals that we were trying to accomplish with the proposed design above:
- Speeding up the model iteration cycle by making the variables easier to update (the original goal as stated in the issue above)
- Consolidating our data documentation by moving variable definitions into dbt
- Exposing a public interface for translating between variable names and accessing their rich descriptions (e.g. in Tableau reports)
The proposed solution (in brief: variables stored in dbt metadata, which gets compiled to a CSV, then updated to add a version column, then uploaded to S3, then crawled by Glue, then exposed in Athena, then downloaded by the ccao
package as necessary to support ccao::vars_dict
) feels overengineered due to our desire to solve all of these problems at once. Instead, an 80/20 solution might improve on each goal incrementally:
- Speed up the model iteration cycle by stripping the model variables out of the
ccao
variable dictionary and removing the requirement that we keep it up to date with our changes - Consolidate our data documentation by moving column notes from the
ccao
variable dictionary to the dbt DAG - Expose a public interface for column names/descriptions by designing a dedicated solution (and punting on this work in the immediate term)
The steps of this work will now be:
- Update the
ccao
variable dictionary to remove unused rows and columns and focus exclusively on supporting renaming/recoding variables - Move rich column descriptions from the
notes
column of theccao
variable dictionary into the dbt DAG - Open up an issue to work with Nicole to determine our Tableau needs for the data catalog, and scope a dedicated solution
from ccao.
Here's a brief proposal for how we might do this!
Defining the variables
Variable definitions will be stored as metadata on our dbt model definitions, with a meta
entry containing var_*
attributes for each column in the columns
key. (Example to follow shortly). We'll need to make a few schema changes in order to support mapping the rows in vars_dict.csv
to a dbt meta
object, and vice versa:
- The
notes
column of the vars dict will be moved into thedescription
attribute for the column (outside ofmeta
) and used to document the column. - Rows will be aggregated such that there is only one variable object per column in the dbt model; in particular, categorical variables that span multiple rows in
vars_dict.csv
(one for each variable code) will be refactored such that thevar_code
,var_value
, andvar_value_short
attributes are combined into a singlevar_codes
object, where eachvar_code
value is its own key with avar_value
andvar_value_short
attribute.
Below is an example of the proposed definition of the default.vw_card_res_char.char_air
feature. Here's what the original rows look like in vars_dict.csv
:
var_from_source | var_from_table | var_from_ctas | var_from_view | var_name_hie | var_name_iasworld | var_name_athena | var_name_model | var_name_publish | var_name_pretty | var_type | var_data_type | var_model_type | var_is_published | var_is_predictor | var_code | var_value | var_value_short | var_notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
iasWorld | iasworld.dweldat | default.vw_card_res_char | qu_air | user7 | char_air | char_air | central_air | Central Air Conditioning | char | categorical | res | TRUE | TRUE | 1 | Central A/C | YES | Whether the unit has central air conditioning. | |
iasWorld | iasworld.dweldat | default.vw_card_res_char | qu_air | user7 | char_air | char_air | central_air | Central Air Conditioning | char | categorical | res | TRUE | TRUE | 2 | No Central A/C | NO | Whether the unit has central air conditioning. |
Here's what that would look like converted to a dbt meta
object in dbt/models/default/schema.yml
:
diff --git a/dbt/models/default/schema.yml b/dbt/models/default/schema.yml
index c031e78..22d6aa1 100644
--- a/dbt/models/default/schema.yml
+++ b/dbt/models/default/schema.yml
@@ -66,6 +66,32 @@ models:
error_if: ">266758"
- name: default.vw_card_res_char
description: '{{ doc("vw_card_res_char") }}'
+ columns:
+ - name: char_air
+ description: Whether the unit has central air conditioning.
+ meta:
+ var_from_source: iasWorld
+ var_from_table: iasworld.dweldat
+ var_from_ctas: null
+ var_from_view: default.vw_card_res_char
+ var_name_hie: qu_air
+ var_name_iasworld: user7
+ var_name_athena: char_air
+ var_name_model: char_air
+ var_name_publish: central_air
+ var_name_pretty: Central Air Conditioning
+ var_type: char
+ var_data_type: categorical
+ var_model_type: res
+ var_is_published: true
+ var_is_predictor: true
+ var_codes:
+ 1:
+ var_value: Central A/C
+ var_value_short: "YES"
+ 2:
+ var_value: No Central A/C
+ var_value_short: "NO"
tests:
# Unique by card and year
- dbt_utils.unique_combination_of_columns:
It will probably be most efficient to write a script to do this for the 500+ rows in vars_dict.csv
.
Compiling the variables into the vars dict
Whenever dbt compiles the DAG, it produces a new manifest.json
file with the full DAG definition; per the dbt docs, any resource metadata stored in a meta
field will get compiled into the manifest as well. I confirmed this locally by running:
$ jq '.nodes."model.athena.default.vw_card_res_char".columns.char_air.meta' target/manifest.json
And the output:
{
"var_from_source": "iasWorld",
"var_from_table": "iasworld.dweldat",
"var_from_ctas": null,
"var_from_view": "default.vw_card_res_char",
"var_name_hie": "qu_air",
"var_name_iasworld": "user7",
"var_name_athena": "char_air",
"var_name_model": "char_air",
"var_name_publish": "central_air",
"var_name_pretty": "Central Air Conditioning",
"var_type": "char",
"var_data_type": "categorical",
"var_model_type": "res",
"var_is_published": true,
"var_is_predictor": true,
"var_codes": {
"1": {
"var_value": "Central A/C",
"var_value_short": "YES"
},
"2": {
"var_value": "No Central A/C",
"var_value_short": "NO"
}
}
}
We can write a simple script that parses a manifest.json
file to parse these metadata objects and compile them into output rows matching the format of vars_dict.csv
. This will essentially be the reverse of the operation described in Defining the variables above, so it might make sense to write one conversion script that can perform both operations.
Deployment
Once we have metadata compiled to the dbt manifest and a script to convert it into the vars dict, we can write a GitHub workflow that runs on each commit to the main branch and pushes the dict to S3 remote storage in the ccao-data-public-us-east-1
bucket under the vars_dict
prefix.
We would like to version the vars dict according to the git commit that generated it. As such, the workflow will push two copies of the dict to S3: one using the short commit hash (e.g. ccao-data-public-us-east-1/vars_dict/3798bfa.csv
) and one using the reserved latest
key (e.g. ccao-data-public-us-east-1/vars_dict/latest.csv
). This way, data consumers can pin to a specific version of the dict if they so desire.
To avoid unnecessary versions, the workflow will only push a new version if the CSV produced by the conversion script differs from latest.csv
in S3.
Incorporating the remote CSV into the ccao
package
We will update the ccao
package so that it pulls vars_dict
from the remote S3 bucket. (This is the step that gives me the least confidence, since I haven't yet written an R package!)
We will define a new get_vars_dict
function that accepts an optional version
argument. Data consumers will be able to use this function to retrieve a specific version of the vars dict according to the short hash of the git commit that generated it. If no version
is supplied, get_vars_dict
should return the latest version.
When the package is first imported, it should call get_vars_dict()
to pull the latest vars dict from S3 and save it to the vars_dict
variable. This variable will then be available in memory for use by package internals or package consumers. We may also want to consider passing a version
into this initial call so that each published version of the ccao
package refers to a specific version of the vars dict by default, but I'm unsure on this point.
from ccao.
Curious what @dfsnow and @wrridgeway think about my proposal above!
from ccao.
@jeancochrane See my notes on each section:
Defining the variables
- Re: notes. Is there a way to store separate
description
(a high-level description of the field) andnotes
(caveats, long description, etc.) columns? - The aggregation of rows also makes sense. However, since a lot of the stuff in
vars_dict
will already be represented by dbt, I think we can drop a few of the columns. See diff and comments within it:
diff --git a/dbt/models/default/schema.yml b/dbt/models/default/schema.yml
index c031e78..22d6aa1 100644
--- a/dbt/models/default/schema.yml
+++ b/dbt/models/default/schema.yml
@@ -66,6 +66,32 @@ models:
error_if: ">266758"
- name: default.vw_card_res_char
description: '{{ doc("vw_card_res_char") }}'
+ columns:
+ - name: char_air
+ description: Whether the unit has central air conditioning.
+ meta:
+ var_from_source: iasWorld # Perhaps we can move this into dbt's loader field?
- var_from_table: iasworld.dweldat # All of these should be able to be imputed from dbt
- var_from_ctas: null
- var_from_view: default.vw_card_res_char
+ var_name_hie: qu_air
+ var_name_iasworld: user7
+ var_name_athena: char_air
+ var_name_model: char_air
+ var_name_publish: central_air
+ var_name_pretty: Central Air Conditioning
+ var_type: char # This maps to the prefix used when renaming
+ var_data_type: categorical # This maps to the R data type, so must stay
- var_model_type: res
- var_is_published: true # No longer really used
- var_is_predictor: true # Built into the model pipeline now
+ var_codes:
+ 1:
+ var_value: Central A/C
+ var_value_short: "YES"
+ 2:
+ var_value: No Central A/C
+ var_value_short: "NO"
tests:
# Unique by card and year
- dbt_utils.unique_combination_of_columns:
Compiling the variables into the vars dict
Flattening these vars might be tricky (see my question about duplicates below), but I think it's totally doable with jq or python. Agree that we can probably fit this into one script.
Deployment
Deploying to S3 with the latest
+ hash makes sense. However, it would also be extremely useful to have this output accessible as an Athena table. IMO, I think the way to achieve this is to write the output to ccao-data-public-us-east-1/vars_dict/
, but include the hash/version as a column in the CSV, in addition to using it in the naming schema. This makes the version
accessible to Athena.
If we then define ccao-data-public-us-east-1/vars_dict/
as an Athena table with a crawler, we can access all versions of the CSV in a single table and simply use WHERE = 'latest'
. This will be extremely handy for grabbing high-level variable descriptions. I would write the table to ccao.vars_dict
.
Questions
- How should we represent duplicate data? Something like
char_air
is stored in a raw form indweldat
and a cleaned up form invw_res_card_char
. In the former, the numeric codes are stored, in the latter, the string representations are stored. - Should we just dupe the definitions across both the view and table? If so, how do we represent that when parsing back to CSV?
- Can we collapse the duplicate definitions into a single row in
vars_dict
? i.e. how they are currently represented? - One thing to consider: Athena can read and parse arbitrary JSON. We might be able to define the construction of the
vars_dict
table using SQL alone, meaning it could simply be a dbt build step. Not sure how well this plays with state though (would updating a documentation column alter the table state and force a rebuild?)
from ccao.
Related Issues (11)
- Update vars_dict with `meta_card_protation_rate` and `loc_tax_municipality_name`
- Release v1.3.0 HOT 1
- Update name generator to reflect new data team employees
- Update name for munciplity column encoding
- Add hie_num_active to vars_dict
- Add sv_is_outlier to `vars_dict`
- Add several new columns to vars dict
- Fix `is_corner_lot` athena name
- Update vars_dict.rda
- Add condo characteristics to vars dict
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ccao.