Code Monkey home page Code Monkey logo

persp-model-econ_w19's People

Contributors

rickecon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

persp-model-econ_w19's Issues

LSQUnivariateSpline requires strictly increasing x

In the PS7 Q2, you might encounter this error for creating the linear/cublic spline object:

ValueError: x must be strictly increasing
The LSQUnivariateSpline enforces input array to be strictly increasing since 2017 (Github Issue). Here are three ways I have thought of to get around:

  1. Use the older Scipy version before the check is implemented. It seems this is what some last years' students did (Github issue). I didn't think this is an efficient approach.

  2. Implement the spline function from scratch. I did the linear spline and am working on cubic spline. However, the question requires to use LQUnivariateSpline.

  3. Grouping the data by the age with average Coolness
    df2=df.groupby('Age').mean()
    df2['Age']=df2.index
    The new dataframe will look like this:

Age Cool

11.0 | 10.110237
12.0 | 9.365623
13.0 | 10.015882
14.0 | 11.747109
15.0 | 15.434739

This solves the problem.

Here is the link to my interactive plot:

https://plot.ly/~lliu95877/4/coolness-index-against-age/

PS #4, question 1, section a

In that question, we've been asked to set 'normed = True' option. But when I checked about the function online here, it is given that "normed : bool, optional. Deprecated; use the density keyword argument instead." Also, in MLE notebook density option is being used. So do we still go ahead and use normed?

Boosting for Hitters Data

@rickecon
I tried using boosting to reduce MSE. It works well for Hitters data.

For trees with max_depth=3, min_samples_leaf=4,

  1. Decision Tree
    mean_squared_error(y_test, hit_tree2.predict(X_test))
    MSE= 146493

  2. Random Forest
    rfr = RandomForestRegressor(n_estimators=100, max_depth=3, min_samples_leaf=4, random_state=25)
    rfr.fit(X_train,y_train)
    mean_squared_error(y_test, rfr.predict(X_test))
    MSE= 110005

  3. Gradient Boosting
    clf = GradientBoostingRegressor(n_estimators=100, max_depth=3, min_samples_leaf=4,random_state=25)
    clf.fit(X_train,y_train)
    mean_squared_error(y_test, clf.predict(X_test))
    MSE= 102931

  4. AdaBoost
    ada = AdaBoostRegressor(hit_tree2, n_estimators=100,random_state=25)
    ada.fit(X_train,y_train)
    mean_squared_error(y_test, ada.predict(X_test))
    MSE= 107001

PS #6 Q1 Part (f)

In this part, we've been asked to predict the mpg values from the model that was built. The given model year for prediction is 1999. In the auto data that's been given, the variable 'year' takes values between 70 and 82. So shouldn't we be using 99 and not 1999 for prediction?

(Possible) Minor typo in PS 8

Hi,

I think this is a typo, but I thought I would confirm anyway. So in Q2, you have:

Create a binary variable mpg high that equals 1 if mpg high≥ median(mpg high) and equals either 0 if mpg high< median(mpg high).

Shouldn't we have median(mpg) instead of median(mpg_high)? Thanks!

Potential typo on PS3

In exercise 5.16, the question says that the epsilon's mean equals mu, but the distribution is N(0, sigma). I don't know if somebody has the same question, but I think the center of the distribution should also be mu? Hopefully the issue is not too late.

Possible minor typos in PS9

@rickecon
Hi Rick,

In (b) logistic regression, we should tune the parameters penalty and C instead of max depth, min samples split, and min samples leaf.

In (c) random forest, n_estimators are the number of trees in the forest. I think it makes more sense to use
sp_randint(10,200) instead of [10,200]. If using [10,200], we are only alternating between two values. However, this will add much more computing complexity and not guaranteed to get the best score because of the randomness.

In (d) SVC, shrinking need to be a boolean value, so it could be either [True, False] or [1,0] (this is corrected in class).

PS #4, question 2, data

In the data given for Q2, the values given for the number of children are all fractional (float type) - do we go ahead and use this value only or do we truncate the decimal part for accuracy (you can't have fractional # of children IRL)?

PS #6 Q1 Part (b)

In the PS for plotting the data, we've been asked to use this this package:

from pandas.plotting import scatter_matrix 
scatter_matrix(df_quant, alpha=0.3, figsize=(6, 6), diagonal='kde')

But the plot quality isn't that great. The seaborn package does a much better job at producing the same plot:

import seaborn as sns
sns.pairplot(auto_df, dropna=True)

Can I go ahead with the seaborn version? Ty!

Some issue regarding PS4

Hi all,

I have completed grading PS4. There is some issue about a common mistake I want to bring to your attention:

When calculating the estimated variance covariance matrix of the estimates, results.hess_inv is not a ndarray but a scipy.optimize.LbfgsInvHessProduct object, which will interpret '*' as dot multiplication instead of pointwise multiplication, thus generating irregular VCV result. Here we should use results.hess_inv.todense() * OffDiagNeg, i.e., transform it to a dense matrix, and then do the pointwise multiplication. @rickecon

Please don't hesitate to send me an email or come to my office hour if you have any question regarding your PS4's grades.

Best,
Winston

Typo in Confusion Matrix

In the notebook LogitKNN, we have the following for confusion matrix:
[[180, 32],
[ 48, 96]]),
where 180 is True Positives, 32 is False Positives, 48 is False Negatives, and 96 is True Negatives.

So the matrix is telling us that 180 and 96 are the numbers of correct predictions.
The model predicts 32 passengers who actually died as survived and 48 passengers who actually survived as died.

Questions on PS6 2(e)

In the question, if we set K=2, then observation 4 and 6 will be in the neighborhood, but they are red and green respectively. How should we handle situations like this?

Question about problem set 2

Problem 4 in ACME: Numerical Differentiation says the data is stored in the file plane.npy, where can we get the file?

Composite Simpson’s Rule

Here is a detailed discussion with Dr. Evans @rickecon about playing around different expressions of the composite Simpson's Rule.

In the ACME Numerical Integration Notebook, the equation is:

image

In the Jupyter Notebook, the equation is:

image

  1. In the Numerical Analysis textbook (Burden and Faires, 9th ed), the equation is:

image

There are several differences:

a. Denominators are 3(N+1),6N,3n respectively.

b. To have even intervals, 1 and 2 uses 2N and 3 uses n. So it is inconsistent in 1 and 2 to have N intervals and nodes x_0,x_1,...x_2N. It should be 2N intervals in 1 and 2.

c. Minor things: 1 and 2 uses g(x) and 3 uses f(x). 3 also has an error term which could be ignored for now.

Example:

Suppose we use four intervals (n=2N=4) to approximate the integral,

In 1 and 2, 2N=4, N=2, so the term in squared bracket is f(x_0)+2*f(x_2)+4*(f(x_1)+f(x_3))+f(x_4)

In 3, n=4, the term in squared bracket is g(x_0)+2*g(x_2)+4*(g(x_1)+g(x_3))+g(x_4)

So they are the same!

My code implementation using Equ 2 (with 2N):

N=1000000
a,b=-10,10
g=lambda x: 0.1*x**4-1.5*x**3+0.53*x**2+2*x+1
intersim=[a+i*(b-a)/(2*N) for i in range(2*N+1)]
appsim=(g(intersim[0])+g(intersim[-1])+\
            4*sum(g(i) for i in intersim[1:-1:2])+\
             2*sum(g(i) for i in intersim[2:-2:2]))*(b-a)/(6*N) 
appsim

The result will be 4373.333333333138

Conclusion: Follow the equation in the Jupyter notebook will yield correct answer. It's the typo N intervals which confuses me in class. Changing it to 2N will agree with the textbook formula. Hope this helps.

Reference:

Numerical Analysis, 9th edition, by R. Burden and J. Faires

Numerical Integration by ACME at BYU

Numerical Integration notebook by Dr. Evans

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.