ccao-data / model-res-avm Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 3.0 12.34 MB

Automated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)

License: GNU Affero General Public License v3.0

R 90.17% Dockerfile 0.79% HCL 3.14% Python 5.89%

assessment data-science machine-learning model property-taxes r res tidymodels

model-res-avm's People

Contributors

Stargazers

Watchers

Forkers

ntentes gaybro8777 ksgupta1

model-res-avm's Issues

Try categorical embeddings using the `embed` package

Currently, the CCAO uses LightGBM's native categorical handling to deal with categoricals as features. However, there may be more efficient/better ways to handle categoricals using various embeddings. See https://embed.tidymodels.org/

Rewrite model pipelines for speed, simplicity

Last year's model had some performance issues (slow) during the post-modeling stages (assess, evaluate). I believe that these issues are likely due to heavy use of complicated dplyr calls. We should refactor each pipeline stage to improve runtimes. The easiest option is to simply drop in dtplyr and use data.table to speed things up, but the more prudent option is to rewrite in polars or data.table.

Switch to `loc_tax_municipality_name` feature in `params.yaml`

It seems that this is a typo, and the res model should use the loc_tax_municipality_name feature like the condo model does:

model-res-avm/params.yaml

Line 203 in e956384

"loc_cook_municipality_name",

Discussion here: ccao-data/model-condo-avm#4 (comment) As part of this issue, we should also remove the hardcoded workaround in model-condo-avm that's discussed here.

[Infra updates] Generate and publish a Quarto doc with performance results on each model run

This year, we're going to consolidate all diagnostic reporting into a single Quarto document that gets created for each model run. The document will knit at the end of each run, once the model is finished training, has created performance statistics, etc. The output PDF (or HTML) will be uploaded to S3 along with the other model artifacts, then linked in SNS completion notifications.

This Quarto doc will be the primary way we evaluate individual model performance. Cross-model comparison will still be done via Tableau.

model_qc.qmd might be a good starting place for defining the Quarto doc itself.

Tasks

Update the 05-finalize.R step of the model so that it generates a Quarto doc containing performance results, uploads it to S3, and updates the SNS notification body to include a link to the doc in S3.
Consolidate existing Quarto docs (in reports/) into a single document
Add simple residuals scatterplots by township
Moran's I / spatial autocorrelation stats
IAAO performance stats by area
Lorenz curves by area
Checking for highly correlated variables, always check residuals plot (calibration plots with probably) cal_plot_regression
Explore using dtreeviz - https://github.com/parrt/dtreeviz/blob/master/notebooks/dtreeviz_lightgbm_visualisations.ipynb

Improve townhome fuzzy grouping, conditionally add 211s to townhome groups

In 2022, we improved the townhome valuation methodology by implementing "fuzzy grouping". Basically, townhome units with similar, but not perfectly identical, features should receive similar values.

Valuations pointed out that 211s are often mixed into townhome (210) complexes, such that you may have an alternating 211 - 210 - 211 - 210 pattern. Due to the different methodologies between 210s and 211s, neighboring units may receive very different values.

We should expand the fuzzy grouping methodology to include 211s for 211s mixed in with townhome classes. We can use the building touching indicator (ccao-data/data-architecture#7) to help identify such 211s.

[Infra updates] Enable EC2 backend for model Batch jobs

Now that we've established a better understanding of the basic IAM permissions and networking required to run Batch jobs with a Fargate backend (#26), we should take another crack at enabling the EC2 backend. The EC2 backend would allow us to select instance types with GPUs, and it's also cheaper than Fargate.

Investigate replacing Lightsnip with bonsai

In 2022, both the residential and condo models used a custom shim R package to bridge the gap between LightGBM and Tidymodels. This package is called Lightsnip. It has a variety of extra features and improvements that were not included in the previous shim package (treesnip).

Since last year's model, Tidymodels has released an officially supported parsnip shim for LightGBM, called bonsai. We should investigate whether or not bonsai can reasonably replace Lightsnip.

Some reasons to replace include:

Official support from Tidymodels maintainers, rather than a one-off package
Better integration with parsnip and its features

Some reasons not to replace include:

Missing critical and useful features from Lightsnip, such as parameter linking and early stopping
Not as flexible to the specific needs of the CCAO

Add comparables finding output to res and condo models `04-interpret` step

We established an experimental sale comp finding methodology in ccao-data/lightsnip@5b00f48 which uses the tree structure of LightGBM/XGBoost models to find similar properties from training data. We should integrate this method into the 04-interpret stage and output it to Athena in a new model.comp table.

Note that we should probably resolve ccao-data/lightsnip#7 first.

Add compiler information to model metadata output table

In order to make the models more reproducible, we should also export the compiler settings used for compiling LightGBM/XGBoost from source. This should be included in the export sent to model.metadata.

Add README links to model benchmarks

https://github.com/ccao-data/report-model-benchmark

[Infra updates] Add a workflow to run the model on PRs and `workflow_dispatch`

Building on #22, we need a GitHub Actions workflow that can run the model. The workflow should:

Depend on the workflow that pushes the Docker image to the GitHub Container Registry in #22
Run on:
- Every commit to every pull request
- The workflow_dispatch event
Deploy to an environment called staging that requires manual approval
Define a job to run the model

There are two ways we could define a job to run the model. Try option 1 first, and fall back to option 2 if CML doesn't work as advertised.

Option 1: Use CML self-hosted runners

Define a job, launch-runner, to start an AWS spot EC2 instance using cml runner
- Set sensible defaults for the instance options, but allow them to be overridden via workflow inputs
Define a job, run-model, to run the model on the EC2 instance created by CML
- Set the runs-on key for the job to point at the runner
  - This will cause steps defined in the job to run on the remote runner
- Run the model using dvc pull and dvc repro

Option 2: Write custom code to run model jobs on AWS Batch

Run Terraform to make sure an AWS Batch job queue and job definition exist for the PR
- The job definition should define the code that will be used to run the model itself, e.g. dvc pull and dvc repro
Use the AWS CLI to submit a job to the Batch queue
Use the AWS CLI to poll the job status until it has a terminal status (SUCCEEDED or FAILED)
- Once the job has at least a RUNNING status, use the jobStreamName parameter to print a link to its logs

Depends on #22.

[Infra updates] Create a manually dispatched GitHub Actions workflow to delete model runs

Not every run created by the modeling pipeline(s) actually needs to be kept. Many runs are testing some sort of CI or pipeline infrastructure change and aren't serious candidates for model selection. These runs pollute the Athena model.* tables and cost S3 storage. We should delete them when possible.

Manually deleting all the artifacts of a model run from the relevant S3 buckets is kind of a pain, so in the past we used a helper function (https://github.com/ccao-data/model-res-avm/blob/master/R/helpers.R#L35-L64) to delete unneeded runs.

This year, to make things easier, we should create a dedicated GitHub Actions workflow to delete erroneous model runs. The workflow should meet the following requirements:

Manual dispatch only with a deployment environment check
Takes a model run_id as an input
Tightly scoped IAM permissions to only allow recent (>= 2024) runs to be deleted

Separate methodology + potential flags for 212s

Class 212 buildings are mixed residential and commercial space, prototypically a multi-family small apartment building with ground floor commercial. These buildings are incredibly tough to value and have historically been used to game the assessment system (adding a residential unit to your grocery store and calling it a 212 in order to get the lower 10% residential assessment rather the 25% commercial assessment). That ends this year.

We need to come up with a new, potentially separate valuation method for these properties. That could be additional 212-specific features or an entirely separate model. Either way, this issue should be used as a parent issue to track work on 212s.

This year (2023), we added flags for abnormally large 212s and recommended them for additional desk review. However, there are not many 212s in the south triad. For the city triad, we should take an extra look at the valuation methodology of 212s and perhaps consider a combined income-based valuation approach.

Refactor sales flagging out of model repos

All the sales validation code was factored out into its own repository: https://github.com/ccao-data/model-sales-val

Therefore, we should move all traces of the sales val code from the 00-ingest stage of each respective model.

Add map of townhome complex IDs

The ingest stage of the pipeline creates townhome "complexes" by fuzzy grouping nearby properties using their characteristics. This ensures that nearly identical units in the same complex receive the same value. We should map the townhomes by "complex ID" to ensure that the fuzzy grouping is working accurately and is spatially constrained.

Rebuild model diagnostic reporting using Quarto

This Quarto doc will be the primary way we evaluate individual model performance. Cross-model comparison will still be done via Tableau.

Tasks

Consolidate existing Quarto docs (in reports/) into a single document
Add simple residuals scatterplots by township
Moran's I / spatial autocorrelation stats
IAAO performance stats by area
Lorenz curves by area

Test separate single-family and multi-family models

Prior CCAO residential models have used a single model that encompasses both single- and multi-family properties. We may achieve better accuracy for multi-family properties if we instead separate them out in a separate model.

Pull SHAP diagnostic plots from preliminary SHAP report

The draft SHAP reporting we did in report-shap-values could have some pretty useful diagnostic code. Let's try to pull code out of it where possible and add it to the main performance report. Start by making a PR comment with a list of possible plots you could make using the SHAP code.

Test propensity weights as case weights for ultra high value properties

Add Gini-based performance stats

We recently added Gini-based vertical equity statistics to the AssessR package: ccao-data/assessr@ac9df08. We should add MKI as one of the performance stats calculated during the 03-evaluate stage.

Update `tune_bayes()` to use grid search as an entrypoint

The grid argument of tune_bayes() can use a result from tune_grid(). We should use this to "seed" tune_bayes() with a manual or space-filling grid.

Move existing reports into `reports/performance.Qmd`

We have some existing QC reports that should be consolidated into the new Quarto report under reports/performance.qmd. This new report is generated for every run and saved to S3.

Add baseline linear model (`glmnet`)

Previously, the CCAO ran linear model (using glmnet) as a sort of baseline to compare to our tree-based models. We should consider re-adding the linear model.

Add new filters for ingesting sales

default.vw_pin_sale will soon be unfiltered by default and we need to use certain conditions to make sure we don't ingest unwanted sales:

AND NOT sale.sale_filter_is_outlier
AND NOT sale.sale_filter_deed_type
AND NOT sale.sale_filter_less_than_10k
AND NOT sale.sale_filter_same_sale_within_365

Draft design doc for 2024 modeling infrastructure

We would like to make some updates to our modeling infrastructure ahead of the 2024 modeling season in order to speed up experimentation and integrate with GitHub Actions. Draft a design doc describing these changes so that we can align on them before opening up a milestone and beginning work.

Improve feature engineering

This issue is a catch-all for improved feature engineering efforts in the residential model. There are a variety of things we can try this year using recipes and various recipes extensions. We can add sub-issues to this issue as we brainstorm.

For reference, see @dfsnow's posit::conf notes and Kuhn's feature engineering book in our mini-library.

Use `workflowsets` + racing to test lots and lots of models and recipes quickly

Test xgboost modeling engine

The Data Department recently performed some model benchmarking (ccao-data/report-model-benchmark) comparing the run times of XGBoost and LightGBM. We found that the current iteration of XGBoost runs much faster than LightGBM on most machines, while achieving similar performance.

We should test replacing LightGBM as the primary modeling engine in both models.

LightGBM

Pros

Native categorical support (easier feature engineering + clean SHAP values)
Better maintain R package
Already have bindings for advanced features (via Lightsnip)
Slightly better performance for our data

Cons

Slightly slower for general training (as of XGBoost 2.0.0)
Massively slower for calculating SHAP values (full order of magnitude)
Backend code seems much buggier
GPU support is lacking (+ hard to build for the R package)
Approximately 50,000 hyperparameters

XGBoost

Pros

Well-maintained codebase, will definitely exist in perpetuity
Excellent GPU and multi-core training support. Calculates SHAPs very quickly
More widely used than LightGBM

Cons

No native categorical support in the R package, even though the underlying XGBoost C++ supports it. Unlikely to change by the time we need to ship the 2024 model
R package support seems lacking

Trim the full pipeline package dependencies

To improve reproducibility and build times, we should limit the packages used in the pipeline to only packages that are absolutely critical.

Try `cubist` model from the `rules` package in tidymodels

Per recommendation from folks at posit::conf, we should test out the cubist engine from the rules R package to see how it performs on our data.

Separate Quarto doc into file per topic

Currently, the diagnostic output of the model is contained in a single file reports/performance.qmd. This works but leads to an extremely large file that takes a long time to render. We should break the main document into a document per topic, i.e. input data QC, model summary, assessment summary, etc. Each document should be able to be rendered independently. The goal is to enable faster iteration and separation of concerns.

We should be able to do this using the basic Quarto project setup, but it might take some finagling.

Might need assistance from @jeancochrane.

Add more model diagnostic plots

Now that we're moving to Quarto-first model diagnostics/reporting, we need to include a ton more diagnostic plots in the Quarto doc. To get started, I would research best practices around model diagnostics/explainability and look at what others have done re: diagnostic plots. Kaggle would be a good place to start for this, go check the top N house price regressions for useful diagnostic plots.

You can also steal directly from the Tableau work we've already done.

Kick off this issue by making a checklist of possible plots as a PR comment.

@wagnerlmichael owns this. @Damonamajor can help. @wrridgeway can oversee.

Update res model README

The residential model README figures and copy need to be updated for the 2024 model.

Update changelog for 2024

Update to LightGBM 4.2.0

LightGBM has not had a major release in well over 6 months. However, there is a roadmap issue for release 4.0.0. Assuming that this release comes out before or during the 2023 modeling cycle, we should investigate whether or not updating to 4.0.0 is worthwhile.

Reasons to update:

Could have significant performance improvements, both in terms of training time and predictive results
Better GPU support (CUDA) seems to be on the roadmap. This would likely lead to much faster training times
Updates to the R package to add new features and advanced abilities

Reasons not to update:

Relatively instability of the APIs. Could break pipeline if there are lots of major changes
A lot seems to have changed in 4.0.0. Such a major release is bound have bugs when still at x.0.0.
Performance improvements may not be worth the additional lift of updating

Edit: LightGBM 4.2.0 released on CRAN on 12/8, so we'll go with that.

Add docs for how to run the model on CI

Update the README with instructions on running the model via GitHub Actions.

Make sure these changes are duplicated in the condo model README as well.

Test the submodels trick for fitting different number of trees using the same hyperparameters

Revisit using a stacked model with the `stacks` package

Previously, the CCAO attempted to create a stacked/ensemble model using tidymodels functions. However, tidymodels' support for this method was at the time quite new, and it didn't work very well. We should revisit using an ensemble model utilizing the relatively new stacks package.

Add Median APE as a performance metric

Private AVMs (Zillow, Redfin, etc.) tend to use Median Absolute Percent Error (MeAPE) as their main performance statistic. We should add this stat to the performance output created by 03-evaluate.

Replace Quarto setup with `R/setup.R` call

The Quarto doc at reports/performance.qmd calls library() and many of the same setup functions called in R/setup.R. We should simplify this setup stage by replacing it with the same setup head now used in the pipeline, i.e.:

# Load libraries, helpers, and recipes from files
purrr::walk(list.files("R/", "\\.R$", full.names = TRUE), source)

# Load additional dev R libraries (see README#managing-r-dependencies)
suppressPackageStartupMessages({
  <all libraries not contained in the Depends: field of DESCRIPTION>
})

Add tables and plots from 2021 reports

The content from the 2021 model report was actually incredibly useful, particularly the topline stats. We should pull the following from those reports:

Model Breakdown by Township
Ratio Distribution by Township by Sale Price Decile
Overall Ratio Distribution (Model Pred. / Sale Price)
Spatial Distribution of Outliers
Feature Importance
Final Hyperparameters

Note

Basically all of the tables and graphs above now have their inputs pre-generated by the model pipeline. This means you need to remove the code that does calculation/aggregation and replace it with the appropriate data from the output/ directory.

Incorporate 2024 rollover changes to iasWorld

Per Mirella, the 2024 rollover will result in some minor architectural changes to how condos, land, and cards are prorated. We'll need to update this codebase to reflect any changes to the backend values.

Old issue text:

Valuations informed us that some PINs (very rarely) have a separate proration rate per card. We should adjust our code to use this rate where available, as well as institute a Desk Review check for different PIN <> card proration rates.

[Infra updates] Build and push a Docker image for the model

For the first step in the new model deployment pipeline described in https://github.com/ccao-data/model-res-avm/pull/21/files, we need to containerize our model code by building a Docker image for the model and pushing it to the GitHub container registry. There are a couple steps required for this:

Define a Dockerfile that encapsulates the code and the environment necessary to run the model
Define a new GitHub Actions workflow that authenticates with the container registry, checks if the image has changed, and pushes it to the container registry if it has
1. It might be useful to use Docker's official composite actions for these operations

We'll also want to enable layer caching for the build; see this Docker guide for instructions. I don't think we'll need a cache key, since I think Docker layer caching should take care of that for us automatically, but if we do it should be enough to use dvc.lock.

Some of this work has already been sketched out in code available on GitLab: https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/master/.gitlab-ci.yml?ref_type=heads

Test racing tuning method from the `finetune` package

We may be able to cross-validate significantly faster using racing-based tune methods, which are available in the finetune R package.

We should also test the simulated annealing tune method.

Note: Be careful using racing with time-based CV. Check for seasonal variation in performance stats.

Create Docker images of all old model tags

To improve model reproducibility, we should create Docker container images containing all the necessary dependencies to run all currently tagged versions of each model. This would involve building a Docker image of the specific R version used + building all the R dependencies captured in renv.lock.

These images should then be saved to each model's respective registry and tagged using the same git tag as the code, i.e. 2022-assessment-year.

Comp finder based QC

Once #41 is complete, we should use the comparables output to add additional QC checks to the main Quarto diagnostic document. For example, we could flag properties whose set of comparables is not within a set range or in cases where the comps are very physically distant from the property of interest.

Model SNS topic notifications are not being sent

I'm subscribed to the ccao-model-pipeline topic but it doesn't seem to be sending me emails when the model pipeline finishes running. I haven't confirmed, but my guess at the root cause is that we don't have AWS_SNS_ARN_MODEL_STATUS set in the container environment:

model-res-avm/pipeline/05-finalize.R

Line 423 in c5dda8e

if (!is.na(Sys.getenv("AWS_SNS_ARN_MODEL_STATUS", unset = NA))) {

We should set this variable in the container and test to make sure that the notifications work again.

Cleanup Quarto code

The current Quarto code is pretty messy: lots of relative local paths, likely unnecessary S3/Athena calls, and just generally inefficient. Give it a quick pass to try to simplify it and remove any unnecessary overhead. Take this as on opportunity to trim unneeded visualizations and package dependencies as well.

Update pipeline with convenience functions

The tidymodels ecosystem has added convenience functions for things we previously had to program ourselves. We should replace our custom versions with the official tidymodels version, where appropriate.

Replace fit -> predict with last_fit()
Replace Lightsnip serialization with bundle package

Test a Chicago-only model

Prior CCAO residential and condo models have used sales from all available triads for training the model. It's possible that sales from other triads are materially (but unobservably) different from city triad sales. Additionally, there is a variety of data available only for the City of Chicago, such as zoning info, specific permits, etc.

We should test running a City-only model using expanded data and only City sales, then compare performance to City-only ratios from other county-wide models.

Test a homeowner exemption model feature

As a proxy for owner-occupied / vacancy.

ccao-data / model-res-avm Goto Github PK

model-res-avm's People

Contributors

Stargazers

Watchers

Forkers

model-res-avm's Issues

Tasks

Option 1: Use CML self-hosted runners

Option 2: Write custom code to run model jobs on AWS Batch

Tasks

LightGBM

Pros

Cons

XGBoost

Pros

Cons

Recommend Projects

Recommend Topics

Recommend Org