Code Monkey home page Code Monkey logo

model-res-avm's People

Contributors

asanhueza1 avatar damonamajor avatar dfsnow avatar jeancochrane avatar njardine avatar wagnerlmichael avatar wrridgeway avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

model-res-avm's Issues

Rewrite model pipelines for speed, simplicity

Last year's model had some performance issues (slow) during the post-modeling stages (assess, evaluate). I believe that these issues are likely due to heavy use of complicated dplyr calls. We should refactor each pipeline stage to improve runtimes. The easiest option is to simply drop in dtplyr and use data.table to speed things up, but the more prudent option is to rewrite in polars or data.table.

[Infra updates] Generate and publish a Quarto doc with performance results on each model run

This year, we're going to consolidate all diagnostic reporting into a single Quarto document that gets created for each model run. The document will knit at the end of each run, once the model is finished training, has created performance statistics, etc. The output PDF (or HTML) will be uploaded to S3 along with the other model artifacts, then linked in SNS completion notifications.

This Quarto doc will be the primary way we evaluate individual model performance. Cross-model comparison will still be done via Tableau.

model_qc.qmd might be a good starting place for defining the Quarto doc itself.

Tasks

  • Update the 05-finalize.R step of the model so that it generates a Quarto doc containing performance results, uploads it to S3, and updates the SNS notification body to include a link to the doc in S3.
  • Consolidate existing Quarto docs (in reports/) into a single document
  • Add simple residuals scatterplots by township
  • Moran's I / spatial autocorrelation stats
  • IAAO performance stats by area
  • Lorenz curves by area
  • Checking for highly correlated variables, always check residuals plot (calibration plots with probably) cal_plot_regression
  • Explore using dtreeviz - https://github.com/parrt/dtreeviz/blob/master/notebooks/dtreeviz_lightgbm_visualisations.ipynb

Improve townhome fuzzy grouping, conditionally add 211s to townhome groups

In 2022, we improved the townhome valuation methodology by implementing "fuzzy grouping". Basically, townhome units with similar, but not perfectly identical, features should receive similar values.

Valuations pointed out that 211s are often mixed into townhome (210) complexes, such that you may have an alternating 211 - 210 - 211 - 210 pattern. Due to the different methodologies between 210s and 211s, neighboring units may receive very different values.

We should expand the fuzzy grouping methodology to include 211s for 211s mixed in with townhome classes. We can use the building touching indicator (ccao-data/data-architecture#7) to help identify such 211s.

[Infra updates] Enable EC2 backend for model Batch jobs

Now that we've established a better understanding of the basic IAM permissions and networking required to run Batch jobs with a Fargate backend (#26), we should take another crack at enabling the EC2 backend. The EC2 backend would allow us to select instance types with GPUs, and it's also cheaper than Fargate.

Investigate replacing Lightsnip with bonsai

In 2022, both the residential and condo models used a custom shim R package to bridge the gap between LightGBM and Tidymodels. This package is called Lightsnip. It has a variety of extra features and improvements that were not included in the previous shim package (treesnip).

Since last year's model, Tidymodels has released an officially supported parsnip shim for LightGBM, called bonsai. We should investigate whether or not bonsai can reasonably replace Lightsnip.

Some reasons to replace include:

  • Official support from Tidymodels maintainers, rather than a one-off package
  • Better integration with parsnip and its features

Some reasons not to replace include:

[Infra updates] Add a workflow to run the model on PRs and `workflow_dispatch`

Building on #22, we need a GitHub Actions workflow that can run the model. The workflow should:

  • Depend on the workflow that pushes the Docker image to the GitHub Container Registry in #22
  • Run on:
    • Every commit to every pull request
    • The workflow_dispatch event
  • Deploy to an environment called staging that requires manual approval
  • Define a job to run the model

There are two ways we could define a job to run the model. Try option 1 first, and fall back to option 2 if CML doesn't work as advertised.

Option 1: Use CML self-hosted runners

  • Define a job, launch-runner, to start an AWS spot EC2 instance using cml runner
    • Set sensible defaults for the instance options, but allow them to be overridden via workflow inputs
  • Define a job, run-model, to run the model on the EC2 instance created by CML
    • Set the runs-on key for the job to point at the runner
      • This will cause steps defined in the job to run on the remote runner
    • Run the model using dvc pull and dvc repro

Option 2: Write custom code to run model jobs on AWS Batch

  • Run Terraform to make sure an AWS Batch job queue and job definition exist for the PR
    • The job definition should define the code that will be used to run the model itself, e.g. dvc pull and dvc repro
  • Use the AWS CLI to submit a job to the Batch queue
  • Use the AWS CLI to poll the job status until it has a terminal status (SUCCEEDED or FAILED)
    • Once the job has at least a RUNNING status, use the jobStreamName parameter to print a link to its logs

Depends on #22.

[Infra updates] Create a manually dispatched GitHub Actions workflow to delete model runs

Not every run created by the modeling pipeline(s) actually needs to be kept. Many runs are testing some sort of CI or pipeline infrastructure change and aren't serious candidates for model selection. These runs pollute the Athena model.* tables and cost S3 storage. We should delete them when possible.

Manually deleting all the artifacts of a model run from the relevant S3 buckets is kind of a pain, so in the past we used a helper function (https://github.com/ccao-data/model-res-avm/blob/master/R/helpers.R#L35-L64) to delete unneeded runs.

This year, to make things easier, we should create a dedicated GitHub Actions workflow to delete erroneous model runs. The workflow should meet the following requirements:

  • Manual dispatch only with a deployment environment check
  • Takes a model run_id as an input
  • Tightly scoped IAM permissions to only allow recent (>= 2024) runs to be deleted

Separate methodology + potential flags for 212s

Class 212 buildings are mixed residential and commercial space, prototypically a multi-family small apartment building with ground floor commercial. These buildings are incredibly tough to value and have historically been used to game the assessment system (adding a residential unit to your grocery store and calling it a 212 in order to get the lower 10% residential assessment rather the 25% commercial assessment). That ends this year.

We need to come up with a new, potentially separate valuation method for these properties. That could be additional 212-specific features or an entirely separate model. Either way, this issue should be used as a parent issue to track work on 212s.

This year (2023), we added flags for abnormally large 212s and recommended them for additional desk review. However, there are not many 212s in the south triad. For the city triad, we should take an extra look at the valuation methodology of 212s and perhaps consider a combined income-based valuation approach.

Add map of townhome complex IDs

The ingest stage of the pipeline creates townhome "complexes" by fuzzy grouping nearby properties using their characteristics. This ensures that nearly identical units in the same complex receive the same value. We should map the townhomes by "complex ID" to ensure that the fuzzy grouping is working accurately and is spatially constrained.

Rebuild model diagnostic reporting using Quarto

This year, we're going to consolidate all diagnostic reporting into a single Quarto document that gets created for each model run. The document will knit at the end of each run, once the model is finished training, has created performance statistics, etc. The output PDF (or HTML) will be uploaded to S3 along with the other model artifacts, then linked in SNS completion notifications.

This Quarto doc will be the primary way we evaluate individual model performance. Cross-model comparison will still be done via Tableau.

Tasks

  • Consolidate existing Quarto docs (in reports/) into a single document
  • Add simple residuals scatterplots by township
  • Moran's I / spatial autocorrelation stats
  • IAAO performance stats by area
  • Lorenz curves by area

Test separate single-family and multi-family models

Prior CCAO residential models have used a single model that encompasses both single- and multi-family properties. We may achieve better accuracy for multi-family properties if we instead separate them out in a separate model.

Add baseline linear model (`glmnet`)

Previously, the CCAO ran linear model (using glmnet) as a sort of baseline to compare to our tree-based models. We should consider re-adding the linear model.

Add new filters for ingesting sales

default.vw_pin_sale will soon be unfiltered by default and we need to use certain conditions to make sure we don't ingest unwanted sales:

AND NOT sale.sale_filter_is_outlier
AND NOT sale.sale_filter_deed_type
AND NOT sale.sale_filter_less_than_10k
AND NOT sale.sale_filter_same_sale_within_365

Draft design doc for 2024 modeling infrastructure

We would like to make some updates to our modeling infrastructure ahead of the 2024 modeling season in order to speed up experimentation and integrate with GitHub Actions. Draft a design doc describing these changes so that we can align on them before opening up a milestone and beginning work.

Improve feature engineering

This issue is a catch-all for improved feature engineering efforts in the residential model. There are a variety of things we can try this year using recipes and various recipes extensions. We can add sub-issues to this issue as we brainstorm.

For reference, see @dfsnow's posit::conf notes and Kuhn's feature engineering book in our mini-library.

Test xgboost modeling engine

The Data Department recently performed some model benchmarking (ccao-data/report-model-benchmark) comparing the run times of XGBoost and LightGBM. We found that the current iteration of XGBoost runs much faster than LightGBM on most machines, while achieving similar performance.

We should test replacing LightGBM as the primary modeling engine in both models.

LightGBM

Pros

  • Native categorical support (easier feature engineering + clean SHAP values)
  • Better maintain R package
  • Already have bindings for advanced features (via Lightsnip)
  • Slightly better performance for our data

Cons

  • Slightly slower for general training (as of XGBoost 2.0.0)
  • Massively slower for calculating SHAP values (full order of magnitude)
  • Backend code seems much buggier
  • GPU support is lacking (+ hard to build for the R package)
  • Approximately 50,000 hyperparameters

XGBoost

Pros

  • Well-maintained codebase, will definitely exist in perpetuity
  • Excellent GPU and multi-core training support. Calculates SHAPs very quickly
  • More widely used than LightGBM

Cons

  • No native categorical support in the R package, even though the underlying XGBoost C++ supports it. Unlikely to change by the time we need to ship the 2024 model
  • R package support seems lacking

Separate Quarto doc into file per topic

Currently, the diagnostic output of the model is contained in a single file reports/performance.qmd. This works but leads to an extremely large file that takes a long time to render. We should break the main document into a document per topic, i.e. input data QC, model summary, assessment summary, etc. Each document should be able to be rendered independently. The goal is to enable faster iteration and separation of concerns.

We should be able to do this using the basic Quarto project setup, but it might take some finagling.

Might need assistance from @jeancochrane.

Add more model diagnostic plots

Now that we're moving to Quarto-first model diagnostics/reporting, we need to include a ton more diagnostic plots in the Quarto doc. To get started, I would research best practices around model diagnostics/explainability and look at what others have done re: diagnostic plots. Kaggle would be a good place to start for this, go check the top N house price regressions for useful diagnostic plots.

You can also steal directly from the Tableau work we've already done.

Kick off this issue by making a checklist of possible plots as a PR comment.

@wagnerlmichael owns this. @Damonamajor can help. @wrridgeway can oversee.

Update res model README

The residential model README figures and copy need to be updated for the 2024 model.

  • Update changelog for 2024

Update to LightGBM 4.2.0

LightGBM has not had a major release in well over 6 months. However, there is a roadmap issue for release 4.0.0. Assuming that this release comes out before or during the 2023 modeling cycle, we should investigate whether or not updating to 4.0.0 is worthwhile.

Reasons to update:

  • Could have significant performance improvements, both in terms of training time and predictive results
  • Better GPU support (CUDA) seems to be on the roadmap. This would likely lead to much faster training times
  • Updates to the R package to add new features and advanced abilities

Reasons not to update:

  • Relatively instability of the APIs. Could break pipeline if there are lots of major changes
  • A lot seems to have changed in 4.0.0. Such a major release is bound have bugs when still at x.0.0.
  • Performance improvements may not be worth the additional lift of updating

Edit: LightGBM 4.2.0 released on CRAN on 12/8, so we'll go with that.

Revisit using a stacked model with the `stacks` package

Previously, the CCAO attempted to create a stacked/ensemble model using tidymodels functions. However, tidymodels' support for this method was at the time quite new, and it didn't work very well. We should revisit using an ensemble model utilizing the relatively new stacks package.

Add Median APE as a performance metric

Private AVMs (Zillow, Redfin, etc.) tend to use Median Absolute Percent Error (MeAPE) as their main performance statistic. We should add this stat to the performance output created by 03-evaluate.

Replace Quarto setup with `R/setup.R` call

The Quarto doc at reports/performance.qmd calls library() and many of the same setup functions called in R/setup.R. We should simplify this setup stage by replacing it with the same setup head now used in the pipeline, i.e.:

# Load libraries, helpers, and recipes from files
purrr::walk(list.files("R/", "\\.R$", full.names = TRUE), source)

# Load additional dev R libraries (see README#managing-r-dependencies)
suppressPackageStartupMessages({
  <all libraries not contained in the Depends: field of DESCRIPTION>
})

Add tables and plots from 2021 reports

The content from the 2021 model report was actually incredibly useful, particularly the topline stats. We should pull the following from those reports:

  • Model Breakdown by Township
  • Ratio Distribution by Township by Sale Price Decile
  • Overall Ratio Distribution (Model Pred. / Sale Price)
  • Spatial Distribution of Outliers
  • Feature Importance
  • Final Hyperparameters

Note

Basically all of the tables and graphs above now have their inputs pre-generated by the model pipeline. This means you need to remove the code that does calculation/aggregation and replace it with the appropriate data from the output/ directory.

Incorporate 2024 rollover changes to iasWorld

Per Mirella, the 2024 rollover will result in some minor architectural changes to how condos, land, and cards are prorated. We'll need to update this codebase to reflect any changes to the backend values.

Old issue text:

Valuations informed us that some PINs (very rarely) have a separate proration rate per card. We should adjust our code to use this rate where available, as well as institute a Desk Review check for different PIN <> card proration rates.

[Infra updates] Build and push a Docker image for the model

For the first step in the new model deployment pipeline described in https://github.com/ccao-data/model-res-avm/pull/21/files, we need to containerize our model code by building a Docker image for the model and pushing it to the GitHub container registry. There are a couple steps required for this:

  1. Define a Dockerfile that encapsulates the code and the environment necessary to run the model
  2. Define a new GitHub Actions workflow that authenticates with the container registry, checks if the image has changed, and pushes it to the container registry if it has
    1. It might be useful to use Docker's official composite actions for these operations

We'll also want to enable layer caching for the build; see this Docker guide for instructions. I don't think we'll need a cache key, since I think Docker layer caching should take care of that for us automatically, but if we do it should be enough to use dvc.lock.

Some of this work has already been sketched out in code available on GitLab: https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/master/.gitlab-ci.yml?ref_type=heads

Test racing tuning method from the `finetune` package

We may be able to cross-validate significantly faster using racing-based tune methods, which are available in the finetune R package.

We should also test the simulated annealing tune method.

Note: Be careful using racing with time-based CV. Check for seasonal variation in performance stats.

Create Docker images of all old model tags

To improve model reproducibility, we should create Docker container images containing all the necessary dependencies to run all currently tagged versions of each model. This would involve building a Docker image of the specific R version used + building all the R dependencies captured in renv.lock.

These images should then be saved to each model's respective registry and tagged using the same git tag as the code, i.e. 2022-assessment-year.

Comp finder based QC

Once #41 is complete, we should use the comparables output to add additional QC checks to the main Quarto diagnostic document. For example, we could flag properties whose set of comparables is not within a set range or in cases where the comps are very physically distant from the property of interest.

Model SNS topic notifications are not being sent

I'm subscribed to the ccao-model-pipeline topic but it doesn't seem to be sending me emails when the model pipeline finishes running. I haven't confirmed, but my guess at the root cause is that we don't have AWS_SNS_ARN_MODEL_STATUS set in the container environment:

if (!is.na(Sys.getenv("AWS_SNS_ARN_MODEL_STATUS", unset = NA))) {

We should set this variable in the container and test to make sure that the notifications work again.

Cleanup Quarto code

The current Quarto code is pretty messy: lots of relative local paths, likely unnecessary S3/Athena calls, and just generally inefficient. Give it a quick pass to try to simplify it and remove any unnecessary overhead. Take this as on opportunity to trim unneeded visualizations and package dependencies as well.

Update pipeline with convenience functions

The tidymodels ecosystem has added convenience functions for things we previously had to program ourselves. We should replace our custom versions with the official tidymodels version, where appropriate.

  • Replace fit -> predict with last_fit()
  • Replace Lightsnip serialization with bundle package

Test a Chicago-only model

Prior CCAO residential and condo models have used sales from all available triads for training the model. It's possible that sales from other triads are materially (but unobservably) different from city triad sales. Additionally, there is a variety of data available only for the City of Chicago, such as zoning info, specific permits, etc.

We should test running a City-only model using expanded data and only City sales, then compare performance to City-only ratios from other county-wide models.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.