uc-macss / persp-model-econ_w19 Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 48.0 16.77 MB

Course site for MACS 30150 (Winter 2019) - Perspectives on Computational Modeling for Economics

TeX 0.06% Jupyter Notebook 99.85% Python 0.09%

persp-model-econ_w19's People

Contributors

Stargazers

Watchers

Forkers

zeyuxu1997 jsgenan tianxinzheng siyuanpengmike liu431 ruixili mkjang17 jtschoi keertanavc haowenshang sunying2018 dongchengecon sixueliu96 rickecon ericroh jeffishy135 disancke fulinguo qianqianxiong caarolin dtracht alison-wang joyceli19 fujiewang caiy0233 nt546 jingyangl boyangqu lnsongxf uta-bolt u200915986 tpellet ddhkim rebekahanne wouter-vdw snowdj vinsonyz kent0417 weixu94 ruthyu-pku daniel-schaefer yalingtsui mahyar-eb jmiao18

persp-model-econ_w19's Issues

LSQUnivariateSpline requires strictly increasing x

In the PS7 Q2, you might encounter this error for creating the linear/cublic spline object:

ValueError: x must be strictly increasing
The LSQUnivariateSpline enforces input array to be strictly increasing since 2017 (Github Issue). Here are three ways I have thought of to get around:

Use the older Scipy version before the check is implemented. It seems this is what some last years' students did (Github issue). I didn't think this is an efficient approach.
Implement the spline function from scratch. I did the linear spline and am working on cubic spline. However, the question requires to use LQUnivariateSpline.
Grouping the data by the age with average Coolness
df2=df.groupby('Age').mean()
df2['Age']=df2.index
The new dataframe will look like this:

Age	Cool

11.0 | 10.110237
12.0 | 9.365623
13.0 | 10.015882
14.0 | 11.747109
15.0 | 15.434739

This solves the problem.

Here is the link to my interactive plot:

https://plot.ly/~lliu95877/4/coolness-index-against-age/

PS #4, question 1, section a

In that question, we've been asked to set 'normed = True' option. But when I checked about the function online here, it is given that "normed : bool, optional. Deprecated; use the density keyword argument instead." Also, in MLE notebook density option is being used. So do we still go ahead and use normed?

Boosting for Hitters Data

@rickecon
I tried using boosting to reduce MSE. It works well for Hitters data.

For trees with max_depth=3, min_samples_leaf=4,

Decision Tree
mean_squared_error(y_test, hit_tree2.predict(X_test))
MSE= 146493
Random Forest
rfr = RandomForestRegressor(n_estimators=100, max_depth=3, min_samples_leaf=4, random_state=25)
rfr.fit(X_train,y_train)
mean_squared_error(y_test, rfr.predict(X_test))
MSE= 110005
Gradient Boosting
clf = GradientBoostingRegressor(n_estimators=100, max_depth=3, min_samples_leaf=4,random_state=25)
clf.fit(X_train,y_train)
mean_squared_error(y_test, clf.predict(X_test))
MSE= 102931
AdaBoost
ada = AdaBoostRegressor(hit_tree2, n_estimators=100,random_state=25)
ada.fit(X_train,y_train)
mean_squared_error(y_test, ada.predict(X_test))
MSE= 107001

PS #6 Q1 Part (f)

In this part, we've been asked to predict the mpg values from the model that was built. The given model year for prediction is 1999. In the auto data that's been given, the variable 'year' takes values between 70 and 82. So shouldn't we be using 99 and not 1999 for prediction?

(Possible) Minor typo in PS 8

Hi,

I think this is a typo, but I thought I would confirm anyway. So in Q2, you have:

Create a binary variable mpg high that equals 1 if mpg high≥ median(mpg high) and equals either 0 if mpg high< median(mpg high).

Shouldn't we have median(mpg) instead of median(mpg_high)? Thanks!

Potential typo on PS3

In exercise 5.16, the question says that the epsilon's mean equals mu, but the distribution is N(0, sigma). I don't know if somebody has the same question, but I think the center of the distribution should also be mu? Hopefully the issue is not too late.

Possible minor typos in PS9

@rickecon
Hi Rick,

In (b) logistic regression, we should tune the parameters penalty and C instead of max depth, min samples split, and min samples leaf.

In (c) random forest, n_estimators are the number of trees in the forest. I think it makes more sense to use
sp_randint(10,200) instead of [10,200]. If using [10,200], we are only alternating between two values. However, this will add much more computing complexity and not guaranteed to get the best score because of the randomness.

In (d) SVC, shrinking need to be a boolean value, so it could be either [True, False] or [1,0] (this is corrected in class).

PS #4, question 2, data

In the data given for Q2, the values given for the number of children are all fractional (float type) - do we go ahead and use this value only or do we truncate the decimal part for accuracy (you can't have fractional # of children IRL)?

PS #6 Q1 Part (b)

In the PS for plotting the data, we've been asked to use this this package:

from pandas.plotting import scatter_matrix 
scatter_matrix(df_quant, alpha=0.3, figsize=(6, 6), diagonal='kde')

But the plot quality isn't that great. The seaborn package does a much better job at producing the same plot:

import seaborn as sns
sns.pairplot(auto_df, dropna=True)

Can I go ahead with the seaborn version? Ty!

Some issue regarding PS4

Hi all,

I have completed grading PS4. There is some issue about a common mistake I want to bring to your attention:

When calculating the estimated variance covariance matrix of the estimates, results.hess_inv is not a ndarray but a scipy.optimize.LbfgsInvHessProduct object, which will interpret '*' as dot multiplication instead of pointwise multiplication, thus generating irregular VCV result. Here we should use results.hess_inv.todense() * OffDiagNeg, i.e., transform it to a dense matrix, and then do the pointwise multiplication. @rickecon

Please don't hesitate to send me an email or come to my office hour if you have any question regarding your PS4's grades.

Best,
Winston

Typo in Confusion Matrix

In the notebook LogitKNN, we have the following for confusion matrix:
[[180, 32],
[ 48, 96]]),
where 180 is True Positives, 32 is False Positives, 48 is False Negatives, and 96 is True Negatives.

So the matrix is telling us that 180 and 96 are the numbers of correct predictions.
The model predicts 32 passengers who actually died as survived and 48 passengers who actually survived as died.

Questions on PS6 2(e)

In the question, if we set K=2, then observation 4 and 6 will be in the neighborhood, but they are red and green respectively. How should we handle situations like this?

Question about problem set 2

Problem 4 in ACME: Numerical Differentiation says the data is stored in the file plane.npy, where can we get the file?

Composite Simpson’s Rule

Here is a detailed discussion with Dr. Evans @rickecon about playing around different expressions of the composite Simpson's Rule.

In the ACME Numerical Integration Notebook, the equation is:

In the Jupyter Notebook, the equation is:

In the Numerical Analysis textbook (Burden and Faires, 9th ed), the equation is:

There are several differences:

a. Denominators are 3(N+1),6N,3n respectively.

b. To have even intervals, 1 and 2 uses 2N and 3 uses n. So it is inconsistent in 1 and 2 to have N intervals and nodes x_0,x_1,...x_2N. It should be 2N intervals in 1 and 2.

c. Minor things: 1 and 2 uses g(x) and 3 uses f(x). 3 also has an error term which could be ignored for now.

Example:

Suppose we use four intervals (n=2N=4) to approximate the integral,

In 1 and 2, 2N=4, N=2, so the term in squared bracket is f(x_0)+2*f(x_2)+4*(f(x_1)+f(x_3))+f(x_4)

In 3, n=4, the term in squared bracket is g(x_0)+2*g(x_2)+4*(g(x_1)+g(x_3))+g(x_4)

So they are the same!

My code implementation using Equ 2 (with 2N):

N=1000000
a,b=-10,10
g=lambda x: 0.1*x**4-1.5*x**3+0.53*x**2+2*x+1
intersim=[a+i*(b-a)/(2*N) for i in range(2*N+1)]
appsim=(g(intersim[0])+g(intersim[-1])+\
            4*sum(g(i) for i in intersim[1:-1:2])+\
             2*sum(g(i) for i in intersim[2:-2:2]))*(b-a)/(6*N) 
appsim

The result will be 4373.333333333138

Conclusion: Follow the equation in the Jupyter notebook will yield correct answer. It's the typo N intervals which confuses me in class. Changing it to 2N will agree with the textbook formula. Hope this helps.

Reference:

Numerical Analysis, 9th edition, by R. Burden and J. Faires

Numerical Integration by ACME at BYU

Numerical Integration notebook by Dr. Evans