Code Monkey home page Code Monkey logo

pml-book's Introduction

"Probabilistic machine learning": a book series by Kevin Murphy

 

Book 0: "Machine Learning: A Probabilistic Perspective" (2012)

See this link

Book 1: "Probabilistic Machine Learning: An Introduction" (2022)

See this link

Book 2: "Probabilistic Machine Learning: Advanced Topics" (2023)

See this link

pml-book's People

Contributors

colcarroll avatar mjsml avatar murphyk avatar patel-zeel avatar ringger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pml-book's Issues

Figure 1.7 mismatch between images, captions, and section text

Edition: Dec 31, 2020
Print pages: 9-10
PDF pages: 39-40

§1.2.2.2 says:

In Fig. 1.7(b), we see that using D = 2 results in a much better fit.

However, the Fig. 1.7(b) image is a polynomial of degree 14, or at least that's what the title above it says. Assuming the figure images are arranged as intended, I think the section text meant to refer to 1.7(a), which is of degree 2, and is not currently referenced by the text.

Also, the caption under Fig. 1.7 reads, in part:

(a-b) Polynomial of degrees 1 and 14"

The polynomial in 1.7(a) is of degree 2 (and is labeled as such above the graph). I assume the caption should instead read:

(a-b) Polynomial of degrees 1 2 and 14

Another reason why Bayes is relevant in the “big data era” and "deep learning era"

In section 2.4.6, page 40

There is another reason as to why Bayes is still relevant.

I encountered this in my professional career: if the amount of data is huge (and the features aren't very predictive), Bayes can beat other forms of ML because it's much quicker to calculate the probabilities during evaluation. (This surprised me!)

This is also true for Bayes vs. Deep Learning which can be painfully slow, even with GPUs.

Should attribute Kant quote to his actual work (maybe)

On page 7, last paragraph, there is a quote from Kant. It would be better to also mention which work it came from. It seems that various sources (articles on Poker) claim that this quote is from:

Critique of Pure Reason

But it doesn't appear to be here: https://www.gutenberg.org/files/4280/4280-h/4280-h.htm

So, I don't know where it really comes from or if it is indeed something that Kant said. I've searched these works (from their online PDFs) and couldn't find it:

  • Critique of Pure Reason
  • Critique of Practical Reason
  • Metaphysical Foundations of Natural Science
  • Critique of Judgement

From December 31, 2020 edition.

Numerical application - mistakes

p.113 Fig 5.5): Mistakes about matrix
b = [-14,-6] (in your formalism you have + b.theta)
k(A) = 30.234 for A = [20; 5; 5; 2], and k(A) = 1:8541 for A = [20; 5; 5; 16]

broken cross-ref in 17.5.7

"A second problem is that the magnitude of the fc’s scores are not calibrated with each other (see ??), so it is hard to compare them."

Minor typo (Release December 31, 2020)

Hi, there is a minor typo on page 753 (line 4) in the release as of December 31, 2020:

A basis $B$ is a set of lienarly...
Should be, of course, linearly :)

Thank you for uploading the book! It's incredible. How often are you going to reupload the corrected version of the PDF?

Release 2021-01-06 Typos

  • Chapter 2, Fig. 2.4, likelihod -> likelihood
  • Chapter 3, Eq. (3.10), you can comment on the name of the function is softplus
  • Chapter 3, the text right after Eq. (3.54), f_{\sigma} \in R^* -> f_{\sigma} \in R^+ (or R_+, in notation chapter, you use R^+, but Z_+)
  • Chapter 3, Page 56, this makes a good default choice in many cases -> This
  • Chapter 3, Fig. 3.8a. in the legend, Student(dof 1) -> Student (dof 1)
  • Chapter 3, Eq. (3.89) and (3.90), replace x by y
  • Chapter 3, the paragraph right above Eq. (3.92), the duplicate of 'may'
  • Chapter 3, page 67, last paragraph, p(y|z) = N(y|Wz, \Sigma_v) -> p(y|z) = N(y|Wz, \Sigma_y)
  • Chapter 3, page 71, the paragraph right after Eq. (3.122), c^m -> c^
  • Chapter 3, page 76, Eq. (3.132), p(y_4| y_3, 2 , y_1) -> p(y_4|y_3, y_2, y_1)
  • Chapter 5, Fig. 5.1, Caption: a some saddle points -> some saddle points
  • Chapter 5, the paragraph right above Eq. (5.41): the current iterate -> the current iteration
  • Chapter 6, page 169, the paragraph right before Sec. 6.3.9: similarlt -> similarly
  • Chapter 7, the paragraph right above Eq. (7.4): regionm -> region
  • Chapter 7, the sentence right above Eq. (7.7): z = f(x) = y^2 -> z = f(y) = y^2
  • Chapter 7, Eq. (7.12) lacking '=' after Bin(y|N, \theta)
  • Chapter 7, "summarizing the posterior" section: everythiing -> everything
  • Chapter 7, Eq. (7.43), Beta(\theta| 30, 20) -> Beta(\theta | 50, 20)
  • Chapter 8, Eq. (8.31), lack open '(' in E[h-a)^2|x]
  • Chapter 9, Eq. (9.1), denominator: y = c -> y = c'
  • Appendix: B.4.2.2: the sentence right above Eq. (B.59) sith -> with
  • Appendix: C.5.2, Eq. (C.166), lacking '=' between V V
  • Appendix: C.6, the sentence right below Eq. (C.192), q_1 and q_2 span the space of { a_1, a_2} -> q_1 and q_2 span the space of a_2
  • Appendix: C.7.2, the first sentence, 'indepoendent' -> independent
  • Appendix: D.5.5, Eq. (D.71), p(x_1=j)p(x_2 = j-k) -> p(x_1=k)p(x_2 = j-k)
  • Appendix: E.2.3.2, Eq. (E.21), \partial l -> \partial l^2, Eq. (E.23), 1/(2v^2) -> -1/(2v^2)

Local minima problem isn't mentioned in Exploration-exploitation tradeoff

Page 14, section 1.4.1

The problem of local maxima isn't mentioned.

E.g., In chess, a piece sacrifice could be the right first step to a winning combination, but if the reward function is weighted too heavily for material advantage, the sacrifice path might not ever get explored. On the other hand, if reward function is weighted too lightly for material advantage, the RL would waste too time exploring blunders.

log(p=p/1 - p) is a bit unclear

P47, section 3.1
In the last sentence between equations 3.16 and 3.17

log(p=p/1 - p)

is a bit unclear.

It's obviously: p/(1-p), but it did give me pause. (Easy fix in LaTex.)

Wish: table of notations and symbols

Your book does seem to use standard notation and symbols, so I'm not having trouble.

However, for utter newbies, it will be harder going for them without a table of notations and symbols used in your book.

Bayesian stuff

A few points about Bayes stuff section 2.2 and 2.2.1, pp 21-23

  1. I think you should mention the Frequentist approach to probability. One aspect is that Bayes will hazard an estimate to an event that has no prior examples, where a Frequentist wouldn't dare. This would be something like election results.

  2. Sometimes, the terms in the Bayes formulas are estimates, so there can be a lot of uncertainty in the answer. E.g. how do you get an accurate value for the P(H=1)? This can be hard.

  3. It would be nice to show the numeric values substituted in for the equations 2.6 and 2.9.

  4. The thing that trips up people about Bayes is differences in prior probability can radically change the result. A would be a good to recalculate the COVID example in 2.2.1 for different values of (PH=1). For instance, early in the COVID pandemic, the P(H=1) might be 0.01 or 0.001. Then the probably of FP of a test will become very large. This unintuitive result catches a lot of people by surprise. [added] I do see you show this in Exercise 2.1.

I calculate for p(H=1)=0.01 that the probability of infection drops to 26%. If p(H=1)=0.001 (the disease is very rare), then probability of infection drops to 3%.

This is true even though the test is highly accurate.

Add other problem with labeling is inaccurate labeling

page 11, section 1.3, 3rd paragraph, the text says:

"need to collect large labeled datasets for training, which can often be time consuming and expensive".

From my experience the third under appreciated factor is the accuracy of labels. This is especially true if managers are cutting corners by rushing the labeling and/or using cheap and inexperienced labor. The labeling inaccuracy is often not easily estimated. It is too easily to be blind-sided by label inaccuracy and end up with a poor model without realizing it until it's too late.

This problem can be magnified if some classes are vastly under represented.

Release 2020-12-31 Typo

print page 392 pdf 422 "convolutional neural networks (CNN), which are designed to work with variable-sized and images;"

print 393 pdf 423 "However, suppose we replacing replace the Heaviside function"

print 400 pdf 430 "it can be used to model financial data, we as well as the global temperature of the earth"

Typo KL Divergence (Release 31-12-2020)

Page 161
Equation 6.42

The definition for the forwards KL divergence in equation 6.42 on page 161 shows the reverse KL divergence (which is shown in equation 6.43).

Thanks for releasing the updated book :)

Entropy is also related to information

Section 6.1, 153 (183 of the PDF) Draft 2020-01-03

says:

The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack
of predictability, associated with a random variable drawn from a given distribution, as we discuss
below.

For myself, I am always amused by the irony that this lack of predictability (entropy) is also related to information. That is something that is completely predictable has no information.

Since the chapter is entitled "Information Theory", I think how Entropy is related to Information deserves an explanation.

In fact, even though the Chapter about Information Theory, the chapter doesn't describe much what Information is. This will puzzle the newbie who would go by the popular definition of "information" and not the very precise one you intend.

(perhaps it's better described in your second volume?)

In 1.3, (unsupervised) doesn't discuss the problem of guessing how many clusters there are

In section 1.3, pp 11-13, does not discuss the problem of guessing how many clusters there.

"then we might want to split the top right into (at least) two subclusters." (p 12, section 1.3.1).

This is (obviously) a hard problem. Maybe it's mentioned somewhere else in the book (so there should be a reference to that), but I haven't finished reading this very fine book yet!

12/30/2020 version.

Typo in Appendix

Book Name: Probabilistic Machine Learning: An Introduction
Book date stamp: 2020-12-28
pdf page number: 749
print page number: 719

In section A.2.1 first sentence, last word should be $\mathcal{X}$ instead of $\mathcal{Y}$.

"""
A.2.1 Functions
A function f: X->Y ... for each x \in Y.
"""

Add "feature engineering" to "feature preprocessing"

Page 9, section 1.2.2.2

"This is a simple example of feature preprocessing". I think it'w worthwhile mention that it's also called "feature engineering", which isn't mention anywhere in the book.

Also add the phrase "feature engineering" to the index.

Some typos in Chapter 1 of Book 1

  • Page 8: the line right above Eq. (1.12): If we have "multiple inputs" => "multiple features"
  • Page 9: Eq. (1.15): should not include w^T
  • Page 11: paragraph 2: "That is, we just get observed outputs D = { y_n: n = 1 : N } without any corresponding inputs x_n." => "That is, we just get observed inputs D = { x_n: n = 1 : N } without any corresponding outputs y_n."

Constrained optimization section

Not sure what version, but downloaded and printed around 1st Feb. Caveat: I didn't know much about this topic, so some suggestions are more to do with my understanding than anything being wrong. Some are really pedantic as well. Change or ignore as you see fit.

Section 5.5 boolean --> Boolean
Section 5.5.1.1 Here, you are minimizing \theta_1^2+\theta_2^2-1 but in the figure just \theta_1^2+\theta_2^2. I know the solution is the same, but its nonetheless inconsistent.
Section 5.5.3 The description of how to convert to standard form could use some extra work. It's not make explicit where \mathbf{A} comes from (presumably an aggregation of equations 5.106 and 5.107 but it wouldn't do any harm to say that). Plus you do not explain how to make $\bftheta \geq \mathbf{0}$ or why this is important.
Section 5.5.3.1 In the worse case --> In the worst-case scenario
Section 5.5.3.1 "There are various..." This sentence is a bit of a non-sequitur and should probably be connected to the previous sentence to identify that you are saying that there do exist methods that are more efficient than the Simplex method.
Section 5.5.4 "From the geometry of the problem..." Well, this is true, but what about when we are in 100 dimensional space and the geometry is not obvoius?
Section 5.5.6 It took me a while to figure out that NLL meant negative log likelihood having just dropped into the book here.
Section 5.6 It wasn't clear to me why we scale by \eta
Equation 5.120 The function f() is not defined. Possibly you mean \mathcal{L}? The $z$ that you are argmin-ing over is also not defined.
Figure 5.15 I might be mistaken but I think you are maximizing the function in this figure vs. minimizing in Figure 5.14 which is a bit confusing.
Equation 5.131 I did not understand why there is a factor of a 1/2 on the RHS
Equation 5.132 I think the first case should be \theta-\lambda. Something wrong anyway as this is not symmetric.
Section 5.6.3 I thought the discussion of the straight-through estimator seemed a little out of place and broke the flow. Consider moving to end of section, shortening or dropping completely.

Realistically, I probably can't read the whole document with this level of detail, but if you know there are sections that are not well read (perhaps near the end of the book or new parts that you have added) then send me a message and I'll try to find time to focus on these.

Add ref to KL Divergence

p84, 114 of PDF, 2020-01-03 Draft, last sentence of the page:

minimizes the KL divergence

should have a reference to 8.1.6.1, p 241 (271 of PDF)

Also add the ref to KL divergence on p84 to the index

XGBoost may support Categorical variables directly some day

on page 575, it says: "XGBoost assumes the user has preprocessed them into one-hot vectors"

This may change by the time you go to print. Categorical variables are being experimented with:

https://github.com/dmlc/xgboost/releases/tag/v1.3.0

Experimental support for direct splits with categorical features
Currently, XGBoost requires users to one-hot-encode categorical variables. This has adverse performance implications, as the creation of many dummy variables results into higher memory consumption and may require fitting deeper trees to achieve equivalent model accuracy.
The 1.3.0 release of XGBoost contains an experimental support for direct handling of categorical variables in test nodes. Each test node will have the condition of form feature_value \in match_set, where the match_set on the right hand side contains one or more matching categories. The matching categories in match_set represent the condition for traversing to the right child node. Currently, XGBoost will only generate categorical splits with only a single matching category ("one-vs-rest split"). In a future release, we plan to remove this restriction and produce splits with multiple matching categories in match_set.
The categorical split requires the use of JSON model serialization. The legacy binary serialization method cannot be used to save (persist) models with categorical splits.
Note. This feature is currently highly experimental. Use it at your own risk. See the detailed list of limitations at #5949.

In addition, the user doesn't have to explicitly use one-hot encoding for XGBoost (at least with the H2O.ai version). It gets converted behind the scenes:

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html

categorical_encoding: Specify one of the following encoding schemes for handling categorical features:

auto or AUTO: Allow the algorithm to decide. In XGBoost, the algorithm will automatically perform one_hot_internal encoding. (default)

one_hot_internal or OneHotInternal: On the fly N+1 new cols for categorical features with N levels

one_hot_explicit or OneHotExplicit: N+1 new columns for categorical features with N levels

binary or Binary: No more than 32 columns per categorical feature

label_encoder or LabelEncoder: Convert every enum into the integer of its index (for example, level 0 -> 0, level 1 -> 1, etc.)

sort_by_response or SortByResponse: Reorders the levels by the mean response (for example, the level with lowest response -> 0, the level with second-lowest response -> 1, etc.). This is useful, for example, when you have more levels than nbins_cats, and where the top level splits now have a chance at separating the data with a split.

enum_limited or EnumLimited: Automatically reduce categorical levels to the most prevalent ones during training and only keep the T (10) most frequent levels, and then internally do one hot encoding in the case of XGBoost.

I haven't gotten through your book, so I'm not sure if you mention that one of the problems with one-hot encoding, is that it causes an explosion of features, which can cause an effective dilution of the data, in that the information that the one-hot encoded values are mutually exclusive but not "seen" by many (all?) algorithms.

Confusing statement about posterior probability (release 2021-12-31)

Hello,

(Thankyou for the great resource)

I would like to point out the statement about posterior distribution on Page 22, Print Version (Just before Covid example). Since, Bayes and posterior are very important concept for the whole book, it this might be confusing for some people(like me which are not so clever),

It says

Multiplying the prior by the likelihood for each value of H, and then normalizing so the result
sums to one, gives the posterior distribution p(H = h|Y = y); this represents our new belief
state about the possible values of H

I may be completely wrong but I am not sure how this is true. We multiply likelihood by each value of H to get marginal likelihood (denominator, P(Y=y)) instead and then divide by this marginal likelihood to get posterior of some p(H = h|Y = y). We dont " multiply by each value of H in numerator(likelihood*prior) and then normalize"

Multiplying the prior by the "likelihood for each value of H", and then normalizing so the result
sums to one gives us

I mean, for some specific h, to get posterior for this h i.e p(H = h|Y = y), we do not need to multiply all value of H before normalising, am i right?

Release 2021-01-03 Typos

Print Page 589, PDF Page 619: "A natural approach to transfer learning meta-learning"

Print Page 590, PDF Page 620: "N-way K-shot classification , in which the system is expected to learn to classify KN classes using just NK training examples of each class.".
Based on the figure 19.7 and the next example I think it should be rewritten as suggested.

Give a nod to Gauss in 1.2.2.1

It can be argued that Gauss was the first person to do Machine Learning, since he developed Least Squares to predict astroid orbit:

https://en.wikipedia.org/wiki/Least_squares#The_method

I think it would be nice to add a historical foot note to this achievement.

It may be useful to mention that LS has the advantage of having a closed form solution (because quadratic is differentiable), where as L1 does not, which is why LS been held in favor for so long despite having trouble with outliers.

Pls clarify that the aligned DNA sequences are horizontal and other issues

Fig 6.2a p154, pdf p184 Draft 2020-01-03

People who don't know much about DNA may not realize the bases are sequential horizontally and that the columns are to compare similarity. To clarify, I suggest changing (1st paragraph, 2nd sentence)

(e.g., from different species)

to

(e.g., each row is a sequence from a different species)

In addition, it would help if you had the sequence numbers at the bottom of the figure (but I'm not sure how to squeeze in 10, 11...)

a t a g c c g g t a c g g c a
t t a g c t g c a a c c g c a
t c a g c c a c t a g a g c a
a t a a c c g c g a c c g c a
t t a g c c g c t a a g g t a
t a a g c c t c g t a c g t a
t t a g c c g t t a c g g c c
a t a t c c g g t a c a g t a
a t a g c a g g t a c c g a a
a c a t c c g t g a c g g a a
1 2 3 4 5 6 7 8 9

You might even want to color the letters in 6.2a to match the colors in 6.2b

I confessed I was confused by text for Fig 6.2b:

The overall
vertical axis represents the information content of that location measured in bits (i.e., using log base 2).
Deterministic distributions (with an entropy of 0) have height 2, and uniform distributions (with an entropy
of 2) have height 0

I thought that an entropy of 0 had no information (completely deterministic), but you have bits (height) being 2. (e.g. position 3 is always A). On the other hand, I do understand that for a position to be highly conserved (i.e. doesn't change so that the function of the motif is preserved through evolution) is provides "information" about the importance of the position in the sequence. E.g. positions 3,5, and 13 are critical (no variation), and positions 10 and 15 are highly important (mostly the same base.). I do see that this is explained a bit more on the next page (155), first paragraph, where you mention that the hight of the bar is 2-H_t. Maybe because information content isn't really defined in this chapter (as far as I can see.)

Finally, I note that fig 6.2b is not color blind friendly. Prof. Jerome Friedman redid the figures in later printings of Elements of Statistical Learning to account for that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.