pslmodels / taxdata Goto Github PK

View Code? Open in Web Editor NEW

19.0 14.0 30.0 3.07 GB

The TaxData project prepares microdata for use with the Tax-Calculator microsimulation project.

Home Page: http://pslmodels.github.io/taxdata/

License: Other

Python 94.50% Shell 1.51% Makefile 2.16% Julia 1.83%

psl-cataloged microsimulation policy tax

taxdata's Introduction

About taxdata Repository

This repository prepares data used in the Tax-Calculator repository.

The data files produced here, all of which have CSV format, provide two different sets of data files for Tax-Calculator:

A set based on a recent IRS-SOI Public Use File (PUF)
A set based on recent Census Current Population Survey (CPS) data

Because the PUF data are restricted in their use, the IRS-SOI-supplied PUF file and the puf.csv data file produced here are not part of the taxdata repository or the Tax-Calculator repository.

Each of these two sets of data files contains several types of files:

a sample data file containing variables for each tax filing unit;
a factors file containing annual variable extrapolation factors;
a weights file containing annual weights for each filing unit;
a ratios file containing annual adjustment ratios for some variables (currently only the PUF data set includes a ratios file);
a benefits file containing extrapolated benefits for each filing unit (currently only the CPS data set includes a benefits file).

Note that the factors file is the same in both sets of data files because the variable extrapolation factors are independent of the sample data being used. But the weights, ratios, and benefits files do depend on the data file, so they are different in the two sets of data files.

Installation

Currently, the only way to install taxdata currently is to clone the git repo locally.

git clone https://github.com/PSLmodels/taxdata.git

Next navigate to the directory and install the taxdata-dev conda environment

cd taxdata
conda env create -f environment.yml

After installing the conda environment, install pre-commit so that all the pre-commit hooks are run using:

pre-commit install

To run the scripts that produce puf.csv and cps.csv.gz, activate the taxdata-dev conda environment and follow the workflow laid out below.

Julia must also be installed to solve for the PUF and CPS weights. You can download Julia from their website or by using homebrew. After installing Julia, you will need to also install these three packages: JuMP, Cbc, NPZ.

Data-Preparation Documentation and Workflow

The best documentation of the data-preparation workflow is the taxdata Makefile. The Makefile shows the input files and the Python script that generates each made file. The files made in early stages of the workflow serve as input files in later stages, which means there is a cascading effect of changes in the scripts and/or input files. The Makefile automates this complex workflow in an economical way because it executes scripts to make new versions of made files only when necessary. Start exploring the Makefile by running the make help command. If you want more background on the make utility and makefiles, search for Internet links with the keywords makefile and automate.

Note that the stage2 linear program that generates the weights file for the PUF is very long-running, taking five or more hours depending on your computer's CPU speed. We are considering options for speeding up this stage2 work, but for the time being you can execute make puf-files and make cps-files in separate terminal windows to have the two stage2 linear programs run in parallel. (If you try this parallel execution approach, be sure to wait for the make puf-files job to begin stage2 work before executing the make cps-files command in the other terminal window. This is necessary because the CPS stage1 work depends on output from PUF stage1.) If you are generating the taxdata made files in an overnight run, then simply execute the make all command.

You can copy the made files to your local Tax-Calculator directory tree using the csvcopy.sh bash script. Use the dryrun option to see which files would be copied (because they are newer than the corresponding files in the Tax-Calculator directory tree) without actually doing the file copies. At the terminal command-prompt in the top-level taxdata directory, execute ./csvcopy.sh to get help.

Example

To create cps.csv.gz, run

conda activate taxdata-dev
make cps-files

Contributing to taxdata Repository

Before creating a GitHub pull request, on your development branch in the top-level directory of the taxdata repository, run make cstest to make sure your proposed code is consistent with the repository's coding style and then run make pytest to ensure that all the tests pass.

For information on contributing to TaxData, see the contributor guide.

Disclaimer

taxdata is under continuous development. As such, results will change as the underlying data and logic improves.

Contributors

A full list of contributors on GitHub can be found here. John O'Hare of Quantria Strategies has also made significant contributions to the development of taxdata.

Citing TaxData

Please cite the source of your analysis as "TaxData Release #.#.#, author's calculations". If you wish to provide a link, the preferred URL is https://github.com/PSLmodels/taxdata. Additionally, we strongly recommend that you describe the input data used, and provide a link to the materials required to replicate your analysis or, at least, note that those materials are available upon request.

Release Notes and Change Log

Information on the changes included in new releases can be found in the notes for each release.

taxdata's People

Contributors

Stargazers

Watchers

taxdata's Issues

AIPD factor missing in action

I've merged the current master branch --- which includes PR #63 --- into the PR #64 branch.
Then I did this with the current PR #64 code:

cd taxdata/stage1
python stage1.py
python factors_finalprep.py

The Stage_I_factors.csv output is as always, but when I compare the factors_finalprep.py output file (now called growfactors.csv) with the growfactors.csv in taxcalc PR #1178 (which I made from the old StageIFactors.csv file in taxcalc) there is only one set of differences: the AIPD factor is not in the growfactors.csv file generated by the factors_finalprep.py script.

@andersonfrailey, Where in taxdata is the AIPD factor added in?

Progress on fixing spouse ages in SAS code

@andersonfrailey said on March 29, 2017:

Thanks for the work on this, @martinholmer. Julia, our intern, discovered a bug in the original SAS code that caused the spouse to be misidentified when creating some of the CPS tax units, which could explain some of the major disparities in age we have been seeing.

She found a fix for it and I will work on evaluating the results before the next PUF release.

@andersonfrailey, Can you provide us with a progress report on the work you and Julia are doing?

Logic for new taxdata/puf_finalprep/puf_factors_processing.py

@andersonfrailey, Below is my version of a script --- puf_factors_processing.py --- that converts blowup factors from the way they are expressed inside the taxdata repo to the way they are expressed in the Tax-Calculator repo. I think the only thing you will have to change is the input_filename because the input StageIFactors.csv file will not be in the taxdata/puf_finalprep directory but the taxdata/puf_stage1 directory. Look at some of the taxcalc/tests code to see how the full path is constructed. You should be able to produce in taxdata the same taxcalc/puf_factors.csv file as in Tax-Calculator pull request 1178.

"""
puf_factors_processing.py does final preparation of Stage1 blowup factors by
transforming Stage1Factors.csv into puf_factors.csv
"""
# CODING-STYLE CHECKS:
# pep8 --ignore=E402 puf_factors_processing.py
# pylint --disable=locally-disabled puf_factors_processing.py
# (when importing numpy, add "--extension-pkg-whitelist=numpy" pylint option)

import pandas as pd

# pylint: disable=invalid-name
first_data_year = 2009
input_filename = 'StageIFactors.csv'
output_filename = 'puf_factors.csv'

# read in blowup factors used internally in taxdata repository
data = pd.read_csv(input_filename, index_col='YEAR')

# convert some aggregate factors into per-capita factors
elderly_pop = data['APOPSNR']
data['ASOCSEC'] = data['ASOCSEC'] / elderly_pop
pop = data['APOPN']
data['AWAGE'] = data['AWAGE'] / pop
data['ATXPY'] = data['ATXPY'] / pop
data['ASCHCI'] = data['ASCHCI'] / pop
data['ASCHCL'] = data['ASCHCL'] / pop
data['ASCHF'] = data['ASCHF'] / pop
data['AINTS'] = data['AINTS'] / pop
data['ADIVS'] = data['ADIVS'] / pop
data['ASCHEI'] = data['ASCHEI'] / pop
data['ASCHEL'] = data['ASCHEL'] / pop
data['ACGNS'] = data['ACGNS'] / pop
data['ABOOK'] = data['ABOOK'] / pop
data['AGDPN'] = data['AGDPN'] / pop  # TODO: remove line after Growth redesign

# convert factors into "one plus annual proportion change" format
data = 1.0 + data.pct_change()

# specify first row values because pct_change() leaves first year undefined
# (these values have been transferred from Tax-Calculator records.py)
for var in list(data):
    data[var][first_data_year] = 1.0
data['ACGNS'][first_data_year] = 1.1781
data['ADIVS'][first_data_year] = 1.0606
data['AINTS'][first_data_year] = 1.0357
data['ASCHCI'][first_data_year] = 1.0041
data['ASCHCL'][first_data_year] = 1.1629
data['ASCHEI'][first_data_year] = 1.1089
data['ASCHEL'][first_data_year] = 1.2953
data['AUCOMP'][first_data_year] = 1.0034
data['AWAGE'][first_data_year] = 1.0053

# round converted factors to six decimal digits of accuracy
data = data.round(6)

# delete from data variables not used by Tax-Calculator (TC)
TC_USED_VARS = set(['ABOOK',
                    'ACGNS',
                    'ACPIM',
                    'ACPIU',
                    'ADIVS',
                    'AGDPN',  # TODO: remove this line after Growth redesign
                    'AINTS',
                    'AIPD',
                    'APOPN',  # TODO: remove this line after Growth redesign
                    'ASCHCI',
                    'ASCHCL',
                    'ASCHEI',
                    'ASCHEL',
                    'ASCHF',
                    'ASOCSEC',
                    'ATXPY',
                    'AUCOMP',
                    'AWAGE'])
ALL_VARS = set(list(data))
TC_UNUSED_VARS = ALL_VARS - TC_USED_VARS
data = data.drop(TC_UNUSED_VARS, axis=1)

# write out blowup factors used in Tax-Calculator repository
data.to_csv(output_filename, index='YEAR')

Alternatives to CyLp

Before PR #100 was merged there was some discussion of finding an alternative to CyLp to use in stage 2 of our extrapolation process. @hdoupe and I have talked about this offline as well. I'm opening this issue so we have a place to discuss other open source options we may have.

I did a quick search after lunch, and if there is no opposition to moving away from Python, there is another CLP interface written in Julia, another open source language. I haven't looked too deeply yet, but judging from the repository, it is a relatively immature project, but a pull request was opened just 13 days ago so there is more activity going on than in CyLp.

I don't have much experience with Julia, but I'm working through the getting started manual and it's relatively easy to pick up the syntax and was built specifically for numerical computing so it's very fast.

I will keep looking for alternatives in my spare time and if others find any I would appreciate them posting them here for discussion. I'm sure there are more options that the one I've posted, possibly in R or another computational language.

@martinholmer @Amy-Xu

conda package inconsistency

In the taxdata repository, I tried to execute the final_prep\puf-cps-processing.py script on my computer.
Here is the error message I got:

iMac2:final prep mrh$ python puf-cps-processing.py cps-puf-2016-04-27.csv
Traceback (most recent call last):
  File "puf-cps-processing.py", line 211, in <module>
    sys.exit(main())
  File "puf-cps-processing.py", line 47, in main
    data = create_new_recid(data)
  File "puf-cps-processing.py", line 73, in create_new_recid
    sorted_dta = data.sort_values(by='recid')
  File "/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 2150, in __getattr__
    (type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'sort_values'

Turns out I had pandas 16 installed and the DataFrame.sort_values() method was introduced in version 17.

iMac2:final prep mrh$ conda list | grep pandas
pandas                    0.16.2               np19py27_0    defaults

So, I upgraded pandas as follows:

iMac2:final prep mrh$ conda update pandas
Fetching package metadata: ....
Solving package specifications: .........

Package plan for installation in environment /Users/mrh/anaconda:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    numpy-1.10.4               |           py27_0         3.0 MB
    numexpr-2.4.4              |      np110py27_0         103 KB
    scipy-0.16.0               |      np110py27_1        11.8 MB
    pandas-0.18.0              |      np110py27_0         5.8 MB
    scikit-learn-0.16.1        |      np110py27_0         3.3 MB
    ------------------------------------------------------------
                                           Total:        24.1 MB

The following packages will be UPDATED:

    numexpr:      2.4.3-np19py27_0  --> 2.4.4-np110py27_0 
    pandas:       0.16.2-np19py27_0 --> 0.18.0-np110py27_0
    scikit-learn: 0.16.1-np19py27_0 --> 0.16.1-np110py27_0
    scipy:        0.15.1-np19py27_0 --> 0.16.0-np110py27_1

The following packages will be DOWNGRADED:

    numpy:        1.11.0-py27_0     --> 1.10.4-py27_0     

Proceed ([y]/n)? y

Fetching packages ...
numpy-1.10.4-p 100% |################################| Time: 0:00:00   4.06 MB/s
numexpr-2.4.4- 100% |################################| Time: 0:00:00   1.60 MB/s
scipy-0.16.0-n 100% |################################| Time: 0:00:01   8.97 MB/s
pandas-0.18.0- 100% |################################| Time: 0:00:01   4.28 MB/s
scikit-learn-0 100% |################################| Time: 0:00:00   5.17 MB/s
Extracting packages ...
[      COMPLETE      ]|###################################################| 100%
Unlinking packages ...
[      COMPLETE      ]|###################################################| 100%
Linking packages ...
[      COMPLETE      ]|###################################################| 100%

Then I tried again to execute the puf-cps-processing.py script and here is what I got:

iMac2:final prep mrh$ python puf-cps-processing.py cps-puf-2016-04-27.csv
/Users/mrh/anaconda/lib/python2.7/site-packages/pandas/computation/__init__.py:19: UserWarning: The installed version of numexpr 2.4.4 is not supported in pandas and will be not be used

  UserWarning)

So, I have two questions:

(1) Why does the Anaconda distribution have conflicting packages? I asked for the newest pandas and conda gave my (without asking) a version of numexpr that does not work with the version of pandas is supplied. Isn't this a conda bug?

(2) How do I fix this problem? I get this message every time I run the Tax-Calculator, which is very annoying, and maybe, dangerous.

@talumbau @MattHJensen @Amy-Xu

Imputations for mortgage interest deduction reforms

We should impute the necessary information onto the CPS-PUF file in order to simulate these reforms with Tax-Calculator:

Eliminate the MID
Reduce the limit on acquisition debt
Restrict to X percent the value of the MID
Limit the MID to primary residences
Replace the MID with an X percent tax credit on the first $25,000 of mortgage interest from primary residences only

This NTJ paper by Adam J. Cole, Geoffrey Gee, and Nicholas Turner describes the methodology and results of modeling exactly these proposals with the Treasury model.

This Urban Institute paper may also be helpful.

After the data work is under way, a new issue should be opened in the Tax-Calculator repository describing exactly which changes need to be made to Tax-Calculator in order to accomodate these reforms.

Development of 2010 puf.csv file

My understanding is that the 2010 puf.csv file is currently being (or will soon be) developed.
I have two questions about the new 2010 puf.csv file:

(1) Will it still be possible to run the 2009 puf.csv file through the Tax-Calculator after the 2010 puf.csv file becomes the standard input?

I'm asking this first question because it would appear that when moving from 2008 to 2009, we lost the ability to use the 2008 puf.csv file as Records input. Maybe backward compatibility is not thought to be a desirable goal, or maybe there was not that much thought about this issues when preparing for the 2008-to-2009 transition. I don't know the history and don't have a strong view on this matter. But I hope that in this transition whatever happens is planned.

(2) Will the 2010 puf.csv file contain more sensible ages? The age variables in the 2009 puf.csv file are pretty dodgy as described in taxdata issue #18.

Can someone see that John O'Hare gets a copy of this issue?
His name does not appear to be on the list of GitHub users of the taxdata repository.

@MattHJensen @Amy-Xu @feenberg

CPS File Progress Report

This issue is just an overview of the progress we've made in preparing the CPS-based file for use in Tax-Calculator.

John gave me the files needed to create the CPS file along with an associated weights file that covers the years 2015-2027. The SAS scripts create tax-units from the CPS in the same manner used to create the CPS tax-units that are then merged with the 2009 IRS-PUF file to create the final PUF currently used. After that, the following files are adjusted for top-coding:

Wages and salaries
Taxable interest income
Dividends
Alimony
Business income/loss
Pensions
Rents
Farm income/loss

Then the following are imputed:

Capital gains
Taxable IRA distributions
Adjusted IRA Contributions
KEOGH/SEP plan contributions
Self-employed health insurance deduction
Student loan interest deduction
Charitable contributions
Miscellaneous deductions
Child care expenses
Medical expenses deduction
Home mortgage interest expense
Real estate taxes
Domestic production activity deduction

Finally, the following are targeted at a state level:

Wages
Interest income
Dividends
Business income/loss
Capital gains
Taxable IRA distributions
Pensions
Unemployment
KEOGH
Self-employment health insurance
IRA contribution
Student loan interest
Domestic production activity deduction
Schedule E income

There are some variables that are currently missing from the file as well:

Variable	Description
p23250	Sch D: Net long-term capital gains/losses
p25470	Sch E: Royalty depletion and/or rental depreciation
e09800	Unreported payroll taxes from Form 4137 or 8919
e02000	Sch E rental, royalty, S-corp, etc, income/loss
e62900	Alternative Minimum Tax foreign tax credit from Form 6251
p08000	Other tax credits (but not including Sch R credit)
e58990	Investment income elected amount from Form 4952
e00700	Taxable refunds of state and local income taxes
e03290	Health savings account deduction from Form 8889
e07240	Retirement savings contributions credit from Form 8880
agi_bin	Historical AGI category used in data extrapolation
e19200	Sch A: Interest paid
e27200	Sch E: Farm rent net income or loss
e01200	Other net gain/loss from Form 4797
e03500	Alimony paid
n1821	Number of people over 18 and under 21 years old in the filing unit
e07260	Residential energy credit from Form 5695
blind_spouse	1 if spouse is blind; otherwise 0
p22250	Sch D: Net short-term capital gains/losses
e03220	Educator expenses
e07400	General business credit from Form 3800
f2441	number of child/dependent-care qualifying persons
nu18	Number of people under 18 years old in the filing unit
f6251	1 if Form 6251 (AMT) attached to return; otherwise 0
blind_head	1 if taxpayer is blind; otherwise 0
e03230	Tuition and fees from Form 8917
e03400	Penalty on early withdrawal of savings
e07300	Foreign tax credit from Form 1116
e11200	Excess payroll (FICA/RRTA) tax withheld
e24518	Sch D: 28% Rate Gain or Loss
EIC	number of EIC qualifying children (range: 0 to 3)
e09700	Recapture of Investment Credit
p87521	Total tentative AmOppCredit amount for all students
e09900	Penalty tax on qualified retirement plans
e24515	Sch D: Un-Recaptured Section 1250 Gain
e01500	Pensions and annuities
e87530	Adjusted qualified lifetime learning expenses for all students
e26270	Sch E: Combined partnership and S-corporation net income/loss
e07600	Prior year minimum tax credit from Form 8801
cmbtp	Estimate of income on (AMT) Form 6251 but not in AGI
n21	Number of people 21 years old or older in the filing unit
e18400	Sch A: State and local income/sales taxes
e20500	Sch A: Net casualty or theft loss
MIDR	1 if separately filing spouse itemizes; otherwise 0

nu18, n1821, and n21 can be found, and I'm editing the SAS files to do so. We're waiting for John to get us imputations for state and local taxes as well. I'm also digging more into the CPS to see if there are any other variables that can be found.

@Amy-Xu and I are analyzing the final files to make sure the results after using it in tax-calc makes sense.

I will use this issue to post updates as more progress is made

@martinholmer @MattHJensen @codykallen

Where to find 5x8 file?

Hi everyone, I'm a complete newbie to this taxdata repo and was hoping to just build an instance of puf.csv for the taxcalc library to try and reproduce the basic tutorial. I got the taxdata repo lodaed up, but when running taxdata/puf_data/StatMatch/Matching/runmatch.py I got a missing file error for asec2014_pubuse_tax_fix_5x8.dat . After some time on Google I found the following URL, https://thedataweb.rm.census.gov/pub/cps/march/asec2014_pubuse_tax_fix_5x8.zip , which is now resulting in a 404. It looks like some of the underlying data is coming from http://ceprdata.org/cps-uniform-data-extracts/march-cps-supplement/march-cps-data/ but those files don't appear to be compatible with runmatch.py. Any suggestions on where to find this data?

"Array subscript out of range" error for CPS-RETS13V5.sas

I'm working through the State Database files to rebuild the cps.csv from scratch using the ASEC microdata. Note that I'm using the University Edition of SAS.

The first three SAS scripts -- cpsmar20**.sas -- from NBER work fine.

For the fourth one -- CPS-RETS13V5.sas -- I get the following error:

ERROR: Array subscript out of range at line 2203 column 33

In the log, line 2203 is the %SEARCH2 call

2201 IF( NUNITS GT 1 )THEN 2202 DO; 2203 %SEARCH2 2204 END;

But in the raw code, it's the TABULATE procedure

2203 PROC TABULATE DATA=EXTRACT.CPSRETS&CPSYEAR FORMAT=COMMA12. ;

It's the only error I get, but the summary table generated has a total of 4 returns in it.

Updating underlying projection data to generate new weights file

I just finished updating the CBO_baseline and SOI_estimates files found in the stage 1 directory to reflect the latest data from the CBO and IRS. These files are used in Stage_I.py to create the targets and factors used to blowup and re-weight each record in stage 2 of the extrapolation process. As I test the file I will be producing the following for review:

The difference in baseline TaxCalc calculations using this new file and what is currently in use.
Updated versions of correlation.csv, reform_results.txt, and variable_stats_summary.csv in the comparison directory of TaxCalc.
A comparison of these new results and CBO projections.

These will all be presented for discussion in a TaxCalc PR with the new weights file. If anyone would like a prerelease of the file I'll be happy to share it with you.

If there is any other information you would like me to include in the TaxCalc PR, please let me know.

I will also be opening a TaxData PR to update these files as well as the stage 1 and 2 extrapolation files after I finish reviewing the new weights file.

It should be noted that I will also be creating a new PUF in the near future that is based on the 2010 PUF and 2015 CPS rather than on the 2009 PUF and 2014 CPS as is currently the case. These updates are being done separately so that the source of any potential changes in output are easily found.

@MattHJensen @Amy-Xu @martinholmer @feenberg @GoFroggyRun @codykallen @jdebacker

Impute itemized deduction amounts to non-itemizers

We should impute itemized deduction amounts to non-itemizers so that we can simulate reforms that increase the number of itemizers.

This issue was moved from PSLmodels/Tax-Calculator#230.

@GoFroggyRun recently took on this project. @GoFroggyRun, could you please post an update on your work?

Feel free to link to or attach your and Chi Tran's presentations or any other information you think might be relevant.

Drop e08800 variable from next puf.csv file

The e08800 variable should be added to the list of omitted variables in the final-prep script for the reasons explained in Tax-Calculator pull request 744.

Next version of puf.csv file

@MattHJensen wants to split up the activities mentioned in issue #95 so that a new version of the SAS-generated puf.csv file is available no later than this Thursday. So, the things that need to get done in the next day or so include the following:

remove ad hoc age-difference logic from puf_data/finalprep.py in anticipation of the new python-generated `cps-matched-puf.csv file that contains more sensible ages for taxpayer and spouse (@martinholmer in PR #yy)
add nu18 variable (@hdoupe in PR #93)
replace net casualty losses with gross casualty losses (@martinholmer in PR #94)

In addition, PR #93 and PR #94 will require changes in Tax-Calculator, which I will be happy to handle. In fact, the Tax-Calculator code changes to handle taxdata #94 have already been prepared in taxcalc pull request 1426.

@andersonfrailey, What version of the raw SAS-generated cps-matched-puf.csv file are we currently using? On my computer I have this cps-matched-puf.csv file:

puf_data$ ls -l cps-m*
-rw-r--r--@ 1 mrh  staff  178345362 Mar 21 13:33 cps-matched-puf.csv

puf_data$ awk -F, 'NR>1{n++}END{print n}' cps-matched-puf.csv 
219806

It has eight fewer records than the current version of the puf.csv file.
Is that the file that the stages and finalprep work will begin with this week?
I want to make sure I have the same raw starting file as everybody else.

@MattHJensen @Amy-Xu @andersonfrailey @hdoupe

This Repo: SAS->Python Accomplished!

GH just updated this repo's language tag from SAS to Python:

This was the result of a ton of work by a lot of people over the last couple of years.

Notably, @Amy-Xu led the first translation of stage 1 and stage 2 to Python, and @andersonfrailey led the effort to translate the CPS matching scripts with major contributions from @XueliangWang and @hdoupe.

Congratulations and thanks to everyone who worked on this!

(I'll close this issue tomorrow, as it's really just an announcement/celebration).

Drop e03260 variable from next puf.csv file

The e03260 variable should be added to the list of omitted variables in the final-prep script for the reasons explained in Tax-Calculator pull request 739.

add under5 variable for HC CTC expansion

@codykallen recently implemented a quick version of the Clinton child tax credit expansion for Alex Brill's blog post and op ed on the policy.

I recently asked him how he did it, and here's what he said:
(@codykallen, thanks for permission to reproduce this here)

The code is available at https://github.com/codykallen/Tax-Calculator/blob/ctc_expansion/taxcalc/functions.py

I implemented this using the personal credit, and randomly assigning whether a child is under 5. The necessary changes to functions.py are as follows:

In the AGI function:
-          personal_credit = II_credit[MARS - 1]
-          if II_credit_prt > 0. and c00100 > II_credit_ps[MARS - 1]:
-          credit_phaseout = II_credit_prt * (c00100 - II_credit_ps[MARS - 1])
-          personal_credit = max(0., personal_credit - credit_phaseout)
+     under5 = 0
+  if n24 > 0:
+         for i in range(n24):
+             if random.random() < 0.273:
+                    under5 += 1
+  personal_credit = II_credit[MARS - 1] * under5
Don’t forget to include n24 in the inputs.
 
In the ChildTaxCredit function:
-          prectc = CTC_c * n24
+     prectc = CTC_c * n24 + personal_credit
Don’t forget to include personal_credit in the inputs.
 
In the AdditionalCTC function:
-          c82890 = ACTC_rt * c82885
+     if personal_credit > 0:
+      c82890 = 0.45 * c82885
+  else:
+      c82890 = ACTC_rt * c82885
Don’t forget to include personal_credit in the inputs.
 
In the IITAX function:
-          _refund = c59660 + c11070 + c10960 + personal_credit
+     _refund = c59660 + c11070 + c10960

You do not need to keep personal_credit in the inputs, but that won’t affect the ability to function (note that it will cause a test to fail).

Since I didn’t have data on which children are under 5, I just applied I random probability for each child being under 5 based on government estimates of child age distributions. This should let you simulate the effect of the CTC expansion for families with children under 5. Under the proposal, children under 5 get a $2000 CTC instead of $1000, the ACTC refundability threshold is lowered to zero, and families with children under 5 get a 45% refundability rate instead of 15%.

@andersonfrailey, could you add an "under5" variable to the puf?

I hope to implement the CTC expansion in TC w/o hijacking the personal credit.

Drop e09400 variable from next puf.csv file

The e09400 variable should be added to the list of omitted variables in the final-prep script for the reasons explained in Tax-Calculator pull request 734.

Preparing a new puf.csv file

There are several enhancements nearing completion that need to be finished before we can issue a new version of the puf.csv file. Coordinating these separate enhancements --- so that there is just one new puf.csv file incorporating all the enhancements --- will minimize the disruption in projects that use the puf.csv file. Here are the separate enhancements:

prepare the raw cps-matched-puf.csv file using Python scripts rather than SAS scripts (@andersonfrailey in issue #92 and PR #xx)
remove ad hoc age-difference logic from puf_data/finalprep.py now that the new cps-matched-puf.csv file has more sensible ages for taxpayer and spouse (@martinholmer in PR #yy)
add nu18 variable (@hdoupe in PR #93)
replace net casualty losses with gross casualty losses (@martinholmer in PR #94)

In addition, PR #93 and PR #94 will require changes in Tax-Calculator, which I will be happy to handle.

Is there anything else? Does this coordination seem worth the benefit in less disruption?

@MattHJensen @Amy-Xu @andersonfrailey @hdoupe

Improve dissemination / deployment related to Public Use File (puf.csv)

Issue - Better track and distribute `puf.csv`

We have found OG-USA regression testing difficult because the Tax-Calculator, OG-USA and B-Tax packages that require a puf.csv public use file do not maintain metadata about which version of that puf.csv is okay.

We have discussed the idea of coming up with a taxpuf private Anaconda package to help ensure installations of Tax-Calculator, B-Tax, OG-USA get the right puf.csv. Each of the packages may require a version of taxpuf just like requiring a specific version of numpy.

There are a few ideas about how to do this with minimal interruption of existing code in Tax-Calculator, B-Tax and OG-USA.

Idea 1 - `write_puf`

At the top of our modules in B-Tax, Tax-Calculator, and OG-USA, just put the following:

from taxpuf import write_puf
write_puf()
# the rest of the import statements

write_puf() would write puf.csv to the top level of the repo where it is currently expected, but would not write it if the md5sum indicated no change was needed.

Idea 2 - `from taxpuf import PUF`

from taxpuf import PUF then PUF is the string contents of the CSV file puf.csv. This would require some modifications in the related repositories where currently a CSV path is expected. I don't think we want from taxpuf import PUF to import a path to a CSV because the path may need to be processed differently, depending on whether it is a path within an egg.

Auto-build `taxpuf` package

In either case of Idea 1 or 2 above, I have some helper scripts started for automatically building a new taxpuf package whenever we need to distribute a new puf.csv. It is a one line package creation script used like:

python build_puf_package.py puf.csv "testing it out" 0.0.4

The above would take puf.csv from the current directory and make it the 0.0.4 version of taxpuf package, adding the metadata "testing it out".

Thoughts on Idea 1 vs 2?

Need to smooth out age distribution

My recent suggestion to Amy about how to create the new age_head and age_spouse variables in the puf.csv file turns out to have been a bad suggestion. My suggestion produces too lumpy an age distribution and that is significant problem. The reason we know that the lumpy age distribution is an important problem is that Sean has not been able to replicate closely the JCT ten-year revenue estimate for the President's EITC reform proposal (see the discussion of Tax-Calculator pull request #687). That proposal does two things: (a) raises the EITC amount for those with no EITC-eligible children, and (b) broadens the range of eligible ages from [25,64] to [22,66]. It seems as if we are OK on reform provision (a), but underestimate the increase in EITC payments caused by reform provision (b). Why? Because my age-imputation suggestion has almost nobody in the [22,24] and [65,66] age ranges. I'm sorry that I did not anticipate this in my original suggestion, but we need to fix the current lumpy age distribution if we are going to get anywhere close to the JCT estimates for this EITC reform.

Suggestions about how best to fix this problem are welcome.

Below is some information that shows the logic now being used to impute the age variables and the resulting lumpy age distributions.

First the logic:

def age_consistency(data):
    data['age_head'] = np.where(data['agerange'] == 0,
                                data['age_head'],
                                (data['agerange'] + 1 - data['dsi']) * 10)
    data['age_spouse'] = np.where(data['agerange'] == 0,
                                  data['age_spouse'],
                                  (data['agerange'] + 1 - data['dsi']) * 10)
    data['age_head'] = np.where(data['age_head'] == 0,
                                1, data['age_head'])
    data['age_spouse'] = np.where(data['age_spouse'] == 0,
                                  1, data['age_spouse'])
    return data

Next, the weighted percent of MARS!=2 filing units by age:

$ $ awk -F, 'NR>1&&$11!=2{n[$1]+=$102;w+=$102}END{for(a in n)print a,100*n[a]/w}' puf.csv | sort -g
1 0.00732623
2 0.00138556
3 1.02869e-05
4 0.00247824
6 0.00403875
7 0.00236135
9 0.00299131
10 1.71464
12 0.00109087
13 0.00647901
15 0.0498515
16 0.0555537
17 0.047421
18 0.0887731
19 0.0683616
20 21.201
21 0.08258
22 0.108786
23 0.121159
24 0.113252
25 0.131894
26 0.124089
27 0.0984781
28 0.095542
29 0.115429
30 18.1627
31 0.0771069
32 0.0907871
33 0.0757789
34 0.0556818
35 0.058926
36 0.0752401
37 0.0618501
38 0.0812238
39 0.0886577
40 14.5644
41 0.0750913
42 0.0809567
43 0.113003
44 0.109459
45 0.0931774
46 0.0899742
47 0.0926781
48 0.132744
49 0.113896
50 13.9241
51 0.132727
52 0.126706
53 0.149162
54 0.128235
55 0.152665
56 0.198669
57 0.178046
58 0.183642
59 0.175279
60 9.93342
61 0.174545
62 0.189469
63 0.177964
64 0.232844
65 0.212767
66 0.253918
67 0.285979
68 0.181379
69 0.155409
70 10.279
71 0.231158
72 0.190056
73 0.184346
74 0.186603
75 0.195947
76 0.208272
77 0.203494
78 0.193818
79 0.147369
80 1.04536
85 1.24946

And finally, the weighted percent of MARS==2 filing units by head age:

$ awk -F, 'NR>1&&$11==2{n[$1]+=$102;w+=$102}END{for(a in n)print a,100*n[a]/w}' puf.csv | sort -g
15 0.000887838
16 0.00316868
17 0.00617273
18 0.00836973
19 0.014477
20 2.41965
21 0.0112784
22 0.0330074
23 0.0368318
24 0.0381903
25 0.0722793
26 0.0758865
27 0.0417912
28 0.0511723
29 0.132077
30 11.9244
31 0.119456
32 0.144295
33 0.151419
34 0.138137
35 0.113667
36 0.137673
37 0.101989
38 0.148469
39 0.112593
40 18.6935
41 0.175016
42 0.15626
43 0.163024
44 0.182924
45 0.16911
46 0.179188
47 0.212036
48 0.173383
49 0.194097
50 20.7905
51 0.194294
52 0.189732
53 0.160748
54 0.184649
55 0.173933
56 0.193347
57 0.180913
58 0.215426
59 0.254128
60 17.8391
61 0.20988
62 0.253684
63 0.269371
64 0.273034
65 0.272802
66 0.264854
67 0.24753
68 0.246847
69 0.242218
70 17.5743
71 0.302369
72 0.222986
73 0.304332
74 0.216955
75 0.286676
76 0.239517
77 0.189752
78 0.19819
79 0.17896
80 0.805275
85 0.487893

@Amy-Xu @MattHJensen

Translate CPS-PUF match to Python

The CPS-PUF matching scripts are currently written in SAS and reside in https://github.com/open-source-economics/cps-puf. These should be translated into Python and incorporated into this repository.

Drop p87482 variable from next puf.csv file

The p87482 variable should be added to the list of omitted variables in the final-prep script for the reasons explained in Tax-Calculator issue 742 and Tax-Calculator pull request 747.

Drop e05100 and e05800 variables from next puf.csv file

The e05100 and e05800 variables should be added to the list of omitted variables in the final-prep script for the reasons explained in Tax-Calculator pull request 743.

Versioning TaxData

Spinning off of conversations I've had with @MattHJensen and @hdoupe, we should consider some kind of versioning system for TaxData. There are a couple of reasons this would be helpful:

Different users may have different versions of the PUF. As we move to the 2011 PUF, there may be users who have the 2009 PUF and do not want to purchase the 2011 PUF. Similarly, I've been talking to a set of users who have the 2012 PUF and helping them set up TaxData to accommodate that. I believe we should make it easy for them to work with TaxData no matter what year of the PUF they have and this would help.
We have been adding a lot of variables to the CPS at the requests of our users. It would be helpful if there was a better way to track the history of the file than having to dig through old PR's and issues.
On point one, my idea for the versioning system would be similar to what we use for Tax-Calculator. We can track major (new PUF or CPS files used in the match), minor (adjusting the weights or growth factors), and patch (adding variables) changes. For example, if we were to version the current PUF, it could be 1.0.0. Then if we were to update to the latest CBO projections, we'll have version 1.1.0. When we update to the 2011 PUF, we will then go to 2.0.0. We can specify in Tax-Calculator and Policy Brain which version we're using.

One of the problems I foresee is supporting multiple versions. If we were to update to the latest CBO projections, we'll want to do that for all versions, not just the latest. @MattHJensen mentioned there may be a relatively efficient way to manage this that involves how we checkout and merge branches. I'm not familiar enough with versioning to talk with any authority on this though and would appreciate other's ideas.

With regards to point two, major changes involve changing the years of the CPS files, minor changes would be reweighing, and patches would be adding variables.

There is an issue with how we handle having both the CPS and PUF files in the same repo though. In my opinion, it would be very difficult and confusing to manage the versions of two files in the same repo, so it might be worth splitting TaxData into two repos. There would be some overlap with the stage one process, but other than that the processes for creating the two are isolated. Or, we could decide that it's not worth versioning the CPS because it's included in the taxcalc package and users likely won't be creating it on there own anyway.

I'm interested in gaining the perspective of others on this. Any opinions regarding the merit of the versioning idea or how to best go about it would be appreciated.

cc @martinholmer @Amy-Xu @GoFroggyRun

Python Translation Initial Results

I just finished running the python-produced PUF file through tax-calc. Here is a notebook containing the results. Is there anything else people would like added to the notebook?

Most of the results are almost exactly the same. There is a slight dip in the number of returns and tax units with zero/negative income and combined tax liability. The latter is connected to the former because if you look at the number of units with zero/negative tax liability as a percent of the total returns they are similar. The drop in total number tax-units is also almost entirely non-filers, the sum of which is cut roughly in half.

The top decile also sees a noticeable increase in tax liability. I haven't pinned down the root cause of this yet. If you look at the distribution of tax liability it's actually almost exactly the same across the board.

There's one change I need to make to the scripts that adds a variable containing the number of dependents under 18 to address Tax-Calc issue #1409. After that, I'll open a PR to merge everything in when I'm finished making changes to puf_data/finalprep.py and puf_stage2/stage2.py.

@MattHJensen @martinholmer @Amy-Xu @hdoupe

Why extreme differences in the ages of a couple?

I don't understand why there are some couples (that is, filing units in puf.csv with MARS==2) with very large age differences. Most (about 90 percent of the sample weight of all couples) have the same age as shown by the following results:

$ awk -F, 'NR>1&&$11==2{n[$2-$1]+=$102;w+=$102}END{for(ad in n)print ad,100*n[ad]/w}' puf.csv | awk '$1==0'
0 90.2593

And even more (over 97 percent of couples) have a spouse-minus-head age difference in the [-9,+9] range:

$ awk -F, 'NR>1&&$11==2{n[$2-$1]+=$102;w+=$102}END{for(ad in n)print ad,100*n[ad]/w}' puf.csv | sort -g | awk '$1>=-9&&$1<=9{p+=$2}END{print p}'
97.5775

But I don't understand why the logic generates a few extremely large age differences as seen in these results:

$ awk -F, 'NR>1&&$11==2{n[$2-$1]+=$102;w+=$102}END{for(ad in n)print ad,100*n[ad]/w}' puf.csv | sort -g > ad
$ head ad
-84 0.0372232
-79 0.0712607
-78 0.00594574
-77 0.0163171
-76 0.0220896
-75 0.0283592
-74 0.0201275
-73 0.00479531
-72 0.0440797
-71 0.0132917
$ tail ad
27 0.0125647
28 0.000265776
29 0.000887838
30 0.00696035
33 0.000555454
35 0.000655087
36 3.40456e-05
38 0.000389501
42 0.0031806
45 0.0077123
$

I'm aware of the concept of a "trophy wife", but filing units where the age of the spouse is more than seventy years less than the age of the head seem a little extreme.

Does anybody understand why we are generating these few extreme age differences?

@Amy-Xu @MattHJensen

Components of interest deduction

In functions.py in Tax-Calculator, I believe the only interest deduction we use is e19200, defined as Sch. A: total interest deduction.

Realistically, there are different components, and it could be useful to differentiate between them. These include mortgage interest on homes, interest on various types of business debt, and investment interest paid (only deductible from investment income).

Is it possible to separate these different inputs to the total interest deduction amount we use now? If so, this could be useful in two ways:

We could separately estimate the effective marginal subsidies on different types of debt.
Potential reforms may eliminate some deductions but not others. For example, the Ryan-Brady plan from 2016 would eliminate the net interest deduction for pass-through entities (which could include interest on business debt) but it would not eliminate the mortgage interest deduction. It would be useful to be able to separate these different sources of interest income if possible.

I recently mentioned this to @andersonfrailey. I'd love to know if this request is possible to implement in taxdata.

@MattHJensen @martinholmer @Amy-Xu

Extrapolate welfare data

The third task outlined in this issue is to develop an extrapolation routine for welfare data in CPS tax unit dataset. An initial thought is to assume for each program, participation and benefit grow at respectively X and Y percent each year, where X and Y are derived from historical data. (If official projection targets are available, then we could use those targets directly.) Then could use the same logit regression for imputation to meet the targets for participation growth and then apply an uniform ratio to everyone in order to blow up total benefit.

Many details need to be considered, but for now the most tricky part is whether to do this extrapolation on tax unit or original program benefit unit (individual/household) in raw CPS. Individual or household level is natural since all projection or historical data would be available at these level; however, this will create enormous difficulty afterwards because raw CPS needs to go through tax-unit creation process. The weights of records do not stay the same over time and thus extrapolation based on 2014 raw CPS weight cannot guarantee hitting the targets in later years. On the other hand, extrapolating the data at tax-unit level would make later steps easier, but there isn't any targets or historical welfare data at tax-unit level.

Other things to consider:

Should we consider population growth the same way as tax-data extrapolation?
Non-filer targets? Since many benefit programs would require non-filer numbers to be accurate.

Any thoughts? @MattHJensen @martinholmer @andersonfrailey @hdoupe

Drop e15360 variable from next puf.csv file

The e15360 variable should be added to the list of omitted variables in the final-prep script for the reasons explained in Tax-Calculator pull request 746.

Impute wealth from SCF for consumption tax transition

When we model a consumption tax using the sources method we will need wealth data from the SCF to distribute the transition burden. Additionally, we may want a path of savings/dissavings so that we can distribute the transition burden across years.

Useful docs

This is moved from PSLmodels/Tax-Calculator#224

Incorrect income percentiles

The distribution of income quintiles in the PUF appears incorrect.

Using the Tax-Calculator, I found the following values for median tax unit income. The numbers in parentheses are the nominal median household incomes from the Census Bureau.
2013: $27,250 ($52,250)
2014: $28,022 ($53,657)
2015: $28,877 ($55,775)

The numbers calculated with TC are more consistent with median individual incomes, but AGI includes income from the primary and secondary earner. I think the only major discrepancy should come from married couples filing separately.

The Tax Policy Center also has some percentile distributions. Their numbers are in parentheses next to those estimated using TC (for 2013).
20th: $5894 ($21,000)
40th: $18,537 ($41,035)
60th: $38,659 ($67,200)
80th: $77,975 ($110,232)

If our percentile distributions are too far off, we can't do a distributional analysis of a tax plan.

Link to my workbook

cc @martinholmer @MattHJensen @Amy-Xu @andersonfrailey

Create private conda package for each PUF file

It has become tough to remember which version of the PUF goes with which version(s) of OG-USA, Tax-Calc, B-Tax and it has also limited automation.

Fix:

Make a private conda package ~~ospc-puf~~ taxpuf, then versions of that package for each puf.csv.gz we have used.
Keep ~~ospc-puf~~ taxpuf in this organization
https://anaconda.org/opensourcepolicycenter/
Make a new version of ~~ospc-puf~~ taxpuf each time a new puf.csv is made
In the OG-USA, Tax-Calc, B-Tax repos be sure that the conda.recipes/meta.yaml names specific versions of the ospc-puf package to install.

Should the CPS-PUF matching scripts available in TaxData?

My instinct is that the SAS scripts for the CPS-PUF match should be in TaxData. Once the Python scripts are available, they can replace the SAS scripts. Any reason not to keep them here?

cc @andersonfrailey @martinholmer @Amy-Xu

Need to change n1821 to n1820 in cps.csv and puf.csv files

The poor naming and documentation of this variable is confusing our users.
See, for example, Tax-Calculator issue 1669.

Tax elasticities on giving

Hey everyone. I am writing on behalf of the research team at the Indiana University Lilly Family School of Philanthropy with a request. We are developing a white paper on the effects of certain proposed policy changes to both charitable giving and overall tax revenue. Prior literature on this has generally assumed a single elasticity for the price of giving, but we have the ability, through use of the Philanthropy Panel Study (PPS), part of the Panel Study of Income Dynamics (PSID), to calculate more specific elasticities for income brackets and itemizing status. We hope to apply these different elasticities to the Tax Calculator, so that when we run our policy changes (see the open issue here: PSLmodels/Tax-Calculator#1236) we have a more accurate and refined estimate of the effect on charitable giving and revenue.

In particular, we are hoping to apply elasticities of charitable giving with respect to the marginal tax rate on charitable giving, varying by both itemizing status and across the income categories. Our current specifications use imputed AGI ranges of <$20k, $20k-$50k, $50k-$100k, and $100k+. These can be easily adjusted though (with the exception that super-high incomes are not reliably measured in the PSID). We talked to @MattHJensen about gaining access to the puf.csv file for the purpose of imputing charitable deductions for non-filers based on this work with the PSID.

We would greatly appreciate any help in regards to this request.

Check tax-unit benefit data (CPS)

During the development of extrapolation routine for benefit data, Martin spot this SSI imputation error from the tabulation on participants number per tax unit. It would be great if we could put together a checking routine, or a testing script for this dataset. Due to the lack of official tax-unit benefit statistics, I think we could start brainstorming a checking list and then automate the checking process. I imagine this is a job parallel to the development of extrapoation routine and hopefully we could get a draft version done before making UBI analysis available to the public on TB.

To this point, I think

program aggregates (benefit & participation)
tax unit participation cap

have already proven to be useful and therefore should be included in the list. Would like to hear more suggestions/comments/discussion on this issue.

@martinholmer @MattHJensen @hdoupe @andersonfrailey

Adding state identification to the PUF

One of the most frequent requests that I have heard from users, especially during the TCJA debate, has been for the capability to analyze the impact of federal tax reform by state.

Don Boyd (@donboyd5) and I have been discussing an approach to do this, and he has given me permission to move our conversation onto GitHub. I will reproduce the conversation to date in the next comment.

Another advantage of the approach that we discuss is that it would provide a PUF-based dataset that could support for state-level calculators.

Add PULINENO for head and spouse

Add PULINENO for head and spouse to cps records on puf.csv and cps.csv

(Recording in issue form a request from @evtedeschi3)

Where in taxdata repo is program that writes WEIGHTS.csv file?

@andersonfrailey, Which program writes the final version of the WEIGHTS.csv file?

child/dependent care expenses on puf.csv vs cps.csv

I see from our tax-calculator documentation that we have child/dependent care expenses for qualifying persons from form 2441 (e32800) on the taxdata_cps file. Is that from the CPS directly or is it imputed? Do we have values for filing units that don’t claim the credit (because they don’t have income tax liabilities or for other reasons)?

A potential tax-calculator user is interested in analyzing the CDCTC under current law and expansion reforms, and I am wondering if they can safely do that with the cps file. My understanding is that he could not analyze expansions with the puf.csv file.

@andersonfrailey

facilitate user feedback and understanding of the database by publishing summary stats

The extrapolation routine is somewhat difficult for outsiders to dig into right now. The methodology is daunting and you can't interact with the extrapolation routines without the PUF.

I think we could make it much easier for people to understand the data behind TaxBrain if we published a table with summary statistics of every variable for every year.

Here are the columns that come to mind.

year
e-code
English_description
weighted_sum
weighted_mean
count_pos
sum_pos   
count_neg
sum_neg
sum_agi_dec1
sum_agi_dec2
sum_agi_dec3
sum_agi_dec4
sum_agi_dec5
sum_agi_dec6
sum_agi_dec7
sum_agi_dec8
sum_agi_dec9
sum_agi_dec10

Do others think that this would be useful? If so, are these the right columns? If not, what would be more helpful to allow users to understand the data?

cc @feenberg, @martinholmer, @Amy-Xu, @johnfohare, @aviard

There should be a unique id.

@GoFroggyRun and I just discovered that the recid is not unique for CPS records. It would be useful if there were a unique id, whether recid or something else.

"FileNotFoundError: File b'cps-matched-puf.csv' does not exist"

I cloned the Tax Calculator repo and created the taxcalc-dev conda environment as described here. I activated that conda environment and running all the tests, validation and otherwise (they all succeeded after a little tweaking). Then, still in the conda environment, I tried to generate a sample data file from a clone of this repo, taxdata, by running ./csvmake puf 1. I receive the following stack trace and messages:

(taxcalc-dev) jeff@jbb-lenovo:~/javeriana/taxdata$ ./csvmake puf 1
Wed Dec  6 16:03:12 -05 2017 : puf_data START
Traceback (most recent call last):
  File "finalprep.py", line 555, in <module>
    sys.exit(main())
  File "finalprep.py", line 15, in main
    data = pandas.read_csv('cps-matched-puf.csv')
  File "/home/jeff/installs/miniconda3/envs/taxcalc-dev/lib/python3.5/site-packages/pandas/io/parsers.py", line 705, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/jeff/installs/miniconda3/envs/taxcalc-dev/lib/python3.5/site-packages/pandas/io/parsers.py", line 445, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/jeff/installs/miniconda3/envs/taxcalc-dev/lib/python3.5/site-packages/pandas/io/parsers.py", line 814, in __init__
    self._make_engine(self.engine)
  File "/home/jeff/installs/miniconda3/envs/taxcalc-dev/lib/python3.5/site-packages/pandas/io/parsers.py", line 1045, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/jeff/installs/miniconda3/envs/taxcalc-dev/lib/python3.5/site-packages/pandas/io/parsers.py", line 1684, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 391, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 710, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'cps-matched-puf.csv' does not exist
ERROR: executing /finalprep.py script
NO 'git diff' OUTPUT ==> NO CHANGES IN FILES
(taxcalc-dev) jeff@jbb-lenovo:~/javeriana/taxdata$

I find no file named anything like that, in either the Tax-Calculator or taxdata repos:

(taxcalc-dev) jeff@jbb-lenovo:~/javeriana$ ls
Tax-Calculator  taxdata
(taxcalc-dev) jeff@jbb-lenovo:~/javeriana$ find . -name "*matched*.csv"
(taxcalc-dev) jeff@jbb-lenovo:~/javeriana$

Missing variables in CPS file

To allow people outside the OSPC to use B-Tax (beyond the limited capabilities of the webapp), they would need to be able to use the CPS instead. It is simple to modify B-Tax to call Records.cps_constructor instead of Records(), but the CPS is missing several critical variables for B-Tax.

B-Tax requires marginal tax rates on 10 different variables from Tax-Calculator, but the CPS file is missing 4 of them:
e02000: Total Sch E income or loss
e26270: Income or loss from a partnership or S corporation
p22250: Short-term capital gain or loss
p23250: Long-term capital gain or loss

The missing variables e02000 and e26270 would cause incorrect calculations for METRs, METTRs and cost of capital. The missing variables p22250 and p23250 cause errors that prevent B-Tax from running (unless test_run = True, in which case the preset hardcoded values are used.

Would it be reasonable for someone to add impute these variables for the CPS file?

@MattHJensen @andersonfrailey @Amy-Xu @martinholmer

Imputations for Expanded Income

Our distributional tables currently use AGI as the tab variable. Instead, they should use an "expanded income" measure that more accurately reflects economic well being.

Our first step should be to get close to JCT's measure of expanded income. We should be able to get the data from the source listed in the parenthesis.

Expanded income =
Adjusted gross income (puf)

tax-exempt interest (puf)
workers' compensation (cps)
nontaxable social security benefits (puf: gross-taxable)
excluded income of U.S. citizens living abroad (?)
value of Medicare benfits in excess of premiums paid (?)
minimum tax preferences (puf or OTA paper)
employer contributions for health plans (aligned MEPS + CPS)
employer contributions to life insurance (?)
employer share of payroll taxes (puf)

The main outstanding items are

Aggregate all of the items from the puf
Impute minimum tax preferences from OTA paper
Impute employer contributions for health plans
Find and exploit sources for the ?s

Useful docs:

Note that Tax Foundation uses AGI as a tab variable.

This is moved from PSLmodels/Tax-Calculator#222

Obtain or create code for generating CPS weights file

In #108, @andersonfrailey said:

This is because the weights file we have for the CPS only goes to 2026. John gave us the code he used to create the file, but not the code to create the weights file so I can't generate a new one that goes to 2027 on my own.

I will also email John to see if he has a new weights file available to goes to 2027.

Given that we need to be able to generate the weights file ourselves, shouldn't we be asking John for the code to create the weights file or guidance on how to rewrite that rather than a new weights file?

I am putting this in its own issue to reflect its high priority and intrinsic importance. This seems exceptionally important to me since it is blocking several other enhancements at this point, and we also can't truly vouch for the CPS file (or even say that it is open source) if we can't build it from scratch.

Please let me know if I am missing anything or can do anything to assist in obtaining this file.

@Amy-Xu @hdoupe @martinholmer @andersonfrailey @GoFroggyRun

why is n24 capped at 3 on cps.csv?

Why are there no values of n24 over 3 in the cps.csv file? SOI censors the PUF at 3, but why is the CPS-based file capped?

cc @andersonfrailey @martinholmer @evtedeschi3

Updating the PUF

TaxCalc currently uses a PUF file made from a statistical match of the 2014 CPS and 2009 Public Use Tax File from the IRS. I am currently working on updating the PUF with new variables, new weights, and the 2015 CPS/2010 PUF in the following steps:

Add dependent ages from the CPS to the PUF for use in TaxCalc PR #976 and remove p87482 (PR #37 from @martinholmer). This should cause no change in the TaxCalc output. However, I'm having problems creating an exact replica of our current PUF, which is resulting in slightly different outputs. This will be further addressed in a forthcoming TaxData PR.
Add targets for the number of filers with AGI above $5 million to the SOI estimates used during Stage 1 of the data extrapolation to get a more accurate distribution of income.
Update the CBO forecasts used in Stage 1 of the data extrapolation for more accurate growth rates.

(Steps 2 and 3 will result in more accurate weights for each of the records.)

Use the 2010 PUF and 2015 CPS datasets to create the match.

The reason for doing this step-by-step rather than as one big project is to document how each change affects TaxCalc's outputs.

I'd love any feedback on this plan or other opinions on what the next steps should be.

@martinholmer @Amy-Xu @talumbau @GoFroggyRun @MattHJensen

Wishlist for CBO data

This issue is meant for a discussion of wishlist for supplemental baseline data from CBO.

cc @martinholmer @feenberg @Amy-Xu @andersonfrailey @salimfurth @hdoupe @jdebacker @rickecon @codykallen @kerkphil

pslmodels / taxdata Goto Github PK

taxdata's Introduction

About taxdata Repository

Installation

Data-Preparation Documentation and Workflow

Example

Contributing to taxdata Repository

Disclaimer

Contributors

Citing TaxData

Release Notes and Change Log

taxdata's People

Contributors

Stargazers

Watchers

Forkers

taxdata's Issues

Issue - Better track and distribute puf.csv

Idea 1 - write_puf

Idea 2 - from taxpuf import PUF

Auto-build taxpuf package

Recommend Projects

Recommend Topics

Recommend Org

Issue - Better track and distribute `puf.csv`

Idea 1 - `write_puf`

Idea 2 - `from taxpuf import PUF`

Auto-build `taxpuf` package