ccao-data / model-res-avm Goto Github PK
View Code? Open in Web Editor NEWAutomated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)
License: GNU Affero General Public License v3.0
Automated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)
License: GNU Affero General Public License v3.0
Currently, the CCAO uses LightGBM's native categorical handling to deal with categoricals as features. However, there may be more efficient/better ways to handle categoricals using various embeddings. See https://embed.tidymodels.org/
Last year's model had some performance issues (slow) during the post-modeling stages (assess, evaluate). I believe that these issues are likely due to heavy use of complicated dplyr calls. We should refactor each pipeline stage to improve runtimes. The easiest option is to simply drop in dtplyr
and use data.table
to speed things up, but the more prudent option is to rewrite in polars
or data.table
.
It seems that this is a typo, and the res model should use the loc_tax_municipality_name
feature like the condo model does:
Line 203 in e956384
Discussion here: ccao-data/model-condo-avm#4 (comment) As part of this issue, we should also remove the hardcoded workaround in model-condo-avm
that's discussed here.
This year, we're going to consolidate all diagnostic reporting into a single Quarto document that gets created for each model run. The document will knit at the end of each run, once the model is finished training, has created performance statistics, etc. The output PDF (or HTML) will be uploaded to S3 along with the other model artifacts, then linked in SNS completion notifications.
This Quarto doc will be the primary way we evaluate individual model performance. Cross-model comparison will still be done via Tableau.
model_qc.qmd
might be a good starting place for defining the Quarto doc itself.
05-finalize.R
step of the model so that it generates a Quarto doc containing performance results, uploads it to S3, and updates the SNS notification body to include a link to the doc in S3.reports/
) into a single documentprobably
) cal_plot_regression
In 2022, we improved the townhome valuation methodology by implementing "fuzzy grouping". Basically, townhome units with similar, but not perfectly identical, features should receive similar values.
Valuations pointed out that 211s are often mixed into townhome (210) complexes, such that you may have an alternating 211 - 210 - 211 - 210
pattern. Due to the different methodologies between 210s and 211s, neighboring units may receive very different values.
We should expand the fuzzy grouping methodology to include 211s for 211s mixed in with townhome classes. We can use the building touching indicator (ccao-data/data-architecture#7) to help identify such 211s.
Now that we've established a better understanding of the basic IAM permissions and networking required to run Batch jobs with a Fargate backend (#26), we should take another crack at enabling the EC2 backend. The EC2 backend would allow us to select instance types with GPUs, and it's also cheaper than Fargate.
In 2022, both the residential and condo models used a custom shim R package to bridge the gap between LightGBM and Tidymodels. This package is called Lightsnip. It has a variety of extra features and improvements that were not included in the previous shim package (treesnip).
Since last year's model, Tidymodels has released an officially supported parsnip shim for LightGBM, called bonsai. We should investigate whether or not bonsai can reasonably replace Lightsnip.
Some reasons to replace include:
Some reasons not to replace include:
We established an experimental sale comp finding methodology in ccao-data/lightsnip@5b00f48 which uses the tree structure of LightGBM/XGBoost models to find similar properties from training data. We should integrate this method into the 04-interpret
stage and output it to Athena in a new model.comp
table.
Note that we should probably resolve ccao-data/lightsnip#7 first.
In order to make the models more reproducible, we should also export the compiler settings used for compiling LightGBM/XGBoost from source. This should be included in the export sent to model.metadata
.
Building on #22, we need a GitHub Actions workflow that can run the model. The workflow should:
workflow_dispatch
eventstaging
that requires manual approvalThere are two ways we could define a job to run the model. Try option 1 first, and fall back to option 2 if CML doesn't work as advertised.
launch-runner
, to start an AWS spot EC2 instance using cml runner
run-model
, to run the model on the EC2 instance created by CML
runs-on
key for the job to point at the runner
dvc pull
and dvc repro
dvc pull
and dvc repro
SUCCEEDED
or FAILED
)
RUNNING
status, use the jobStreamName
parameter to print a link to its logsDepends on #22.
Not every run created by the modeling pipeline(s) actually needs to be kept. Many runs are testing some sort of CI or pipeline infrastructure change and aren't serious candidates for model selection. These runs pollute the Athena model.*
tables and cost S3 storage. We should delete them when possible.
Manually deleting all the artifacts of a model run from the relevant S3 buckets is kind of a pain, so in the past we used a helper function (https://github.com/ccao-data/model-res-avm/blob/master/R/helpers.R#L35-L64) to delete unneeded runs.
This year, to make things easier, we should create a dedicated GitHub Actions workflow to delete erroneous model runs. The workflow should meet the following requirements:
run_id
as an inputClass 212 buildings are mixed residential and commercial space, prototypically a multi-family small apartment building with ground floor commercial. These buildings are incredibly tough to value and have historically been used to game the assessment system (adding a residential unit to your grocery store and calling it a 212 in order to get the lower 10% residential assessment rather the 25% commercial assessment). That ends this year.
We need to come up with a new, potentially separate valuation method for these properties. That could be additional 212-specific features or an entirely separate model. Either way, this issue should be used as a parent issue to track work on 212s.
This year (2023), we added flags for abnormally large 212s and recommended them for additional desk review. However, there are not many 212s in the south triad. For the city triad, we should take an extra look at the valuation methodology of 212s and perhaps consider a combined income-based valuation approach.
All the sales validation code was factored out into its own repository: https://github.com/ccao-data/model-sales-val
Therefore, we should move all traces of the sales val code from the 00-ingest
stage of each respective model.
The ingest
stage of the pipeline creates townhome "complexes" by fuzzy grouping nearby properties using their characteristics. This ensures that nearly identical units in the same complex receive the same value. We should map the townhomes by "complex ID" to ensure that the fuzzy grouping is working accurately and is spatially constrained.
This year, we're going to consolidate all diagnostic reporting into a single Quarto document that gets created for each model run. The document will knit at the end of each run, once the model is finished training, has created performance statistics, etc. The output PDF (or HTML) will be uploaded to S3 along with the other model artifacts, then linked in SNS completion notifications.
This Quarto doc will be the primary way we evaluate individual model performance. Cross-model comparison will still be done via Tableau.
reports/
) into a single documentPrior CCAO residential models have used a single model that encompasses both single- and multi-family properties. We may achieve better accuracy for multi-family properties if we instead separate them out in a separate model.
The draft SHAP reporting we did in report-shap-values could have some pretty useful diagnostic code. Let's try to pull code out of it where possible and add it to the main performance report. Start by making a PR comment with a list of possible plots you could make using the SHAP code.
We recently added Gini-based vertical equity statistics to the AssessR package: ccao-data/assessr@ac9df08. We should add MKI as one of the performance stats calculated during the 03-evaluate
stage.
The grid
argument of tune_bayes()
can use a result from tune_grid()
. We should use this to "seed" tune_bayes()
with a manual or space-filling grid.
We have some existing QC reports that should be consolidated into the new Quarto report under reports/performance.qmd
. This new report is generated for every run and saved to S3.
Previously, the CCAO ran linear model (using glmnet
) as a sort of baseline to compare to our tree-based models. We should consider re-adding the linear model.
default.vw_pin_sale will soon be unfiltered by default and we need to use certain conditions to make sure we don't ingest unwanted sales:
AND NOT sale.sale_filter_is_outlier
AND NOT sale.sale_filter_deed_type
AND NOT sale.sale_filter_less_than_10k
AND NOT sale.sale_filter_same_sale_within_365
We would like to make some updates to our modeling infrastructure ahead of the 2024 modeling season in order to speed up experimentation and integrate with GitHub Actions. Draft a design doc describing these changes so that we can align on them before opening up a milestone and beginning work.
This issue is a catch-all for improved feature engineering efforts in the residential model. There are a variety of things we can try this year using recipes
and various recipes
extensions. We can add sub-issues to this issue as we brainstorm.
For reference, see @dfsnow's posit::conf notes and Kuhn's feature engineering book in our mini-library.
The Data Department recently performed some model benchmarking (ccao-data/report-model-benchmark) comparing the run times of XGBoost and LightGBM. We found that the current iteration of XGBoost runs much faster than LightGBM on most machines, while achieving similar performance.
We should test replacing LightGBM as the primary modeling engine in both models.
To improve reproducibility and build times, we should limit the packages used in the pipeline to only packages that are absolutely critical.
Per recommendation from folks at posit::conf, we should test out the cubist
engine from the rules
R package to see how it performs on our data.
Currently, the diagnostic output of the model is contained in a single file reports/performance.qmd
. This works but leads to an extremely large file that takes a long time to render. We should break the main document into a document per topic, i.e. input data QC, model summary, assessment summary, etc. Each document should be able to be rendered independently. The goal is to enable faster iteration and separation of concerns.
We should be able to do this using the basic Quarto project setup, but it might take some finagling.
Might need assistance from @jeancochrane.
Now that we're moving to Quarto-first model diagnostics/reporting, we need to include a ton more diagnostic plots in the Quarto doc. To get started, I would research best practices around model diagnostics/explainability and look at what others have done re: diagnostic plots. Kaggle would be a good place to start for this, go check the top N house price regressions for useful diagnostic plots.
You can also steal directly from the Tableau work we've already done.
Kick off this issue by making a checklist of possible plots as a PR comment.
@wagnerlmichael owns this. @Damonamajor can help. @wrridgeway can oversee.
The residential model README figures and copy need to be updated for the 2024 model.
LightGBM has not had a major release in well over 6 months. However, there is a roadmap issue for release 4.0.0. Assuming that this release comes out before or during the 2023 modeling cycle, we should investigate whether or not updating to 4.0.0 is worthwhile.
Reasons to update:
Reasons not to update:
Edit: LightGBM 4.2.0 released on CRAN on 12/8, so we'll go with that.
Update the README with instructions on running the model via GitHub Actions.
Make sure these changes are duplicated in the condo model README as well.
Previously, the CCAO attempted to create a stacked/ensemble model using tidymodels functions. However, tidymodels' support for this method was at the time quite new, and it didn't work very well. We should revisit using an ensemble model utilizing the relatively new stacks
package.
Private AVMs (Zillow, Redfin, etc.) tend to use Median Absolute Percent Error (MeAPE) as their main performance statistic. We should add this stat to the performance
output created by 03-evaluate
.
The Quarto doc at reports/performance.qmd
calls library()
and many of the same setup functions called in R/setup.R
. We should simplify this setup stage by replacing it with the same setup head now used in the pipeline, i.e.:
# Load libraries, helpers, and recipes from files
purrr::walk(list.files("R/", "\\.R$", full.names = TRUE), source)
# Load additional dev R libraries (see README#managing-r-dependencies)
suppressPackageStartupMessages({
<all libraries not contained in the Depends: field of DESCRIPTION>
})
The content from the 2021 model report was actually incredibly useful, particularly the topline stats. We should pull the following from those reports:
Note
Basically all of the tables and graphs above now have their inputs pre-generated by the model pipeline. This means you need to remove the code that does calculation/aggregation and replace it with the appropriate data from the output/
directory.
Per Mirella, the 2024 rollover will result in some minor architectural changes to how condos, land, and cards are prorated. We'll need to update this codebase to reflect any changes to the backend values.
Old issue text:
Valuations informed us that some PINs (very rarely) have a separate proration rate per card. We should adjust our code to use this rate where available, as well as institute a Desk Review check for different PIN <> card proration rates.
For the first step in the new model deployment pipeline described in https://github.com/ccao-data/model-res-avm/pull/21/files, we need to containerize our model code by building a Docker image for the model and pushing it to the GitHub container registry. There are a couple steps required for this:
We'll also want to enable layer caching for the build; see this Docker guide for instructions. I don't think we'll need a cache key, since I think Docker layer caching should take care of that for us automatically, but if we do it should be enough to use dvc.lock
.
Some of this work has already been sketched out in code available on GitLab: https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/master/.gitlab-ci.yml?ref_type=heads
We may be able to cross-validate significantly faster using racing-based tune methods, which are available in the finetune
R package.
We should also test the simulated annealing tune method.
Note: Be careful using racing with time-based CV. Check for seasonal variation in performance stats.
To improve model reproducibility, we should create Docker container images containing all the necessary dependencies to run all currently tagged versions of each model. This would involve building a Docker image of the specific R version used + building all the R dependencies captured in renv.lock
.
These images should then be saved to each model's respective registry and tagged using the same git tag as the code, i.e. 2022-assessment-year
.
Once #41 is complete, we should use the comparables output to add additional QC checks to the main Quarto diagnostic document. For example, we could flag properties whose set of comparables is not within a set range or in cases where the comps are very physically distant from the property of interest.
I'm subscribed to the ccao-model-pipeline
topic but it doesn't seem to be sending me emails when the model pipeline finishes running. I haven't confirmed, but my guess at the root cause is that we don't have AWS_SNS_ARN_MODEL_STATUS
set in the container environment:
model-res-avm/pipeline/05-finalize.R
Line 423 in c5dda8e
We should set this variable in the container and test to make sure that the notifications work again.
The current Quarto code is pretty messy: lots of relative local paths, likely unnecessary S3/Athena calls, and just generally inefficient. Give it a quick pass to try to simplify it and remove any unnecessary overhead. Take this as on opportunity to trim unneeded visualizations and package dependencies as well.
The tidymodels ecosystem has added convenience functions for things we previously had to program ourselves. We should replace our custom versions with the official tidymodels version, where appropriate.
last_fit()
bundle
packagePrior CCAO residential and condo models have used sales from all available triads for training the model. It's possible that sales from other triads are materially (but unobservably) different from city triad sales. Additionally, there is a variety of data available only for the City of Chicago, such as zoning info, specific permits, etc.
We should test running a City-only model using expanded data and only City sales, then compare performance to City-only ratios from other county-wide models.
As a proxy for owner-occupied / vacancy.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.