allendowney / thinkbayes2 Goto Github PK

View Code? Open in Web Editor NEW

1.8K 69.0 1.5K 205.81 MB

Text and code for the forthcoming second edition of Think Bayes, by Allen Downey.

Home Page: http://allendowney.github.io/ThinkBayes2/

License: MIT License

Makefile 0.03% Python 1.79% Jupyter Notebook 92.30% TeX 2.83% Stata 0.34% CSS 0.01% HTML 2.69% Shell 0.01%

jupyter python bayes bayesian-statistics

thinkbayes2's Introduction

Think Bayes 2

by Allen B. Downey

The HTML version of this book is here.

Think Bayes is an introduction to Bayesian statistics using computational methods.

Print and electronic versions of this book are available from Bookhop.org, Amazon, and O'Reilly Media.

For each chapter, there is a Jupyter notebook, below, where you can read the text, run the examples, and work on the exercises.

You can read the free version of the book by following the links on the left.

If you are looking for solutions to the exercises, follow the links on the left.

Think Bayes is a Free Book. It is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), which means that you are free to copy and modify it, as long as you attribute the work and don’t use it for commercial purposes.

Other Free Books by Allen Downey are available from Green Tea Press.

Run the notebooks

Download the notebooks as a Zip file

Or use these links to run the notebooks on Colab:

thinkbayes2's People

Contributors

Stargazers

Watchers

Forkers

jomurmiranda asjchae benkahle mdelrosa mattwis pmkeene cypressf ankurdas aclairekeum phiaseitz gabriellee jenwei jasper-chen davelkan-zz erinpierce cdiehl6 acrease77 danielleong eguthrie cameronanderson manickyoj louis78717 maorbernstein emocallaghan mtkwock evandorsky youngjaewoo perrygrossman pramodatre code-11 jmorrow1000 aenfield wcxherbert svaksha revanth83 ospreyx rlabbe antiface quietcoolwu sven-sun ashhher3 worldpress speedcoder5 drmingle alvarofierroclavero exhale11 hrishikeshvganu krishnavemuri luckyf00l sguo1991 sandy4321 dangdangsister kygibbs marcgameroff jeffwdoak vkarihaloo atremblay rtbarber wenliangz viktorpi jimstone52 rlugojr leetschau mrplane pyrichie andrgig lhy5507 onurbabat johnomeara drarli mikeharkin bigredwill patrickchow2016 nicohiggs dsaint31x s-carey oldwain mingsterism hynekb baizhongzhang666 unif2 joerischasfoort zhecanjameswang ksoltan cmoser96 nathanyee achyutjoshi kghite wolfd apan64 maxinesgp mafaldaborges umadesai pruksmhc cbrooke47 ajackso rttilghman mortier matthewbeaudouinlafon balu-varanasi

thinkbayes2's Issues

Chapter 10 Solutions - Definition / Implementation mismatch for the logistic

In the final exercise for Chapter 10, there is a slight mismatch between the definition of the 3-parameter logistic function and its implementation. In the definition, we see that the parameter a is multiplying the other term in the exponent of the exponential.

However, in the implementation in the next cell, we see that the variable a is dividing the rest of the terms (in the assignment to x).

The Wikipedia link also uses the convention of having the parameter a multiply the terms. I believe that the easiest solution would be to change the value of a to 1/100 and change it to multiplication in the definition of x.

Typo in Theorem 1 from Laws of probability

Thanks for your amazing books dear Allen.
I found this typo by reading the early version at O'Reilly site (multiplying instead of dividing over P(B)):

Key result in chapter 10 sensitive to jittering

This issue pertains to Chapter 10 and its source code in variability.py, which estimates distributions for the mean and standard deviation of male and female heights, then uses the distributions to compute distributions for the coefficient of variation for males and females. A key result seems to be that the coefficient of variation for females is higher than that of males. However, if you remove the jittering that gets applied to the original heights, this result seems to be reversed.

variability.py line 462 applies "jittering" to the list of heights.

I also modified line 266 to print the label for the posterior mean being printed.

If you run the script with jittering, you see that the coefficient of variation for females is greater than that of males, which matches the book's result.

$ python variability.py
...
female CV posterior mean 0.04379422911488041
male CV posterior mean 0.04151490569938492
...
female bigger 1.0000000000000628
male bigger 0

The resulting plot also matches that the book:

Now if you comment-out line 462 (the jittering), and re-run the script, you see that the mean coefficient of variation is non-negligibly higher for males.

$ python variability.py
...
male CV posterior mean 0.042135070189436574
female CV posterior mean 0.039877437544664336
...
female bigger 0
male bigger 1.0000000000000615

The resulting plot reflects this result.

My instinct is to trust the second result, as it uses the data in its raw form. Still, it would be nice to understand how this simple jittering can cause such a drastic difference in the coefficient of variation.

I'll post back if I can think of any solution or explanation to this problem.

Lincoln Index - Three parameter model

I am receiving the following error when I attempt to compute the likelihood in the three parameter model:
for N, p0, p1 in joint3_pmf.index:

ValueError: not enough values to unpack (expected 3, got 2)

I think the issue is that when I attempt to use make_joint using p1 and the joint2_pmf, I do not get the triplet, but rather only the pair, mean p1 is not attaching and thus the multi-index only has 2 items instead of three.

Do you have any suggestions about how best to fix?

Thank you

Why are we subtracting 1 from each bowl (Chapter 2 Exercise: cookie problem)

In the likelihood function, the data (i.e. the cookie flavor) is being subtracted from both bowl 1 and bowl 2 to obtain the new bowl/pmf data. I thought that after the first draw, there would be two scenarios.
1). The cookie was drawn from bowl 1, in which case, the data would only be subtracted from bowl 1 values.
2). The cookie was drawn from bowl 2, in which case, the data would only be subtracted from bowl 2 values.

Theses instances would then result in two different pmf data right?
1).
Hist({'vanilla': 29, 'chocolate': 10})
Hist({'vanilla': 20, 'chocolate': 20})

2).
Hist({'vanilla': 30, 'chocolate': 10})
Hist({'vanilla': 19, 'chocolate': 20})

Links in the book should be collected together online

The book uses shortened URLs, which in aggregate waste the time of readers. I did not see online or in the book a link to a bibliography, so I would suggest using this repo wiki for the purpose and request others to help contribute.

This will also help identify if there are references in the book out of date, or not explicit as intended.

Solution/Pracitce Chapter Notebooks out of Sync

It looks like chap0 in notebooks/ folder corresponds to chap1 in soln/ folder.

M&M problem in chapter 2: Bayes' theoreom

After thinking through and solving the problem successfully, I looked at solution.
A subtle (and even trivial, may be) change in defining hypothesis and data could be

Hypotheses A: one M&M from 94, another from 96
Hypotheses B: one from 96, another from 94
Given: one is yellow and one is green

Here, formulation of hypothesis is based on the information in the problem. (one from each bag; it could be of any color)
Data or Given is yellow from one and green from another, which changes the likelihood based on the color.

If we state hypothesis as "green from 1994 and yellow from 1996", it seems like hypothesis itself is defined based on the Data :). It took me a while to figure it out.
Is my understanding correct here?

Two 'utils.py' files in /code directory

Hello
I am going through the notebook for Chapter 1, and there seems to be an issue with importing write_table:
from utils import write_table
write_table(table, 'table01-01')
produces an ImportError. From what I can tell, it's because /code has two files named utils.py: one in /code, the other in /code/soln/. The function write_table is defined in the latter, and import never checks there because it looks in the utils.py that is in /code, the same directory as all the notebooks. Could you please take a look?

Chapter 5 Prison Sentences

@AllenDowney, thanks for your book!

I believe that the answer provided for the average of the time remaining for prison sentences is incorrect.

I think the correct answer is to take the average of the distribution of remaining prison sentence times. For a distribution of prison sentences among existing prisoners that is proportional to the sentence, i.e. $ P(y) ~ y $, the distribution of time remaining, $x$, is
$ P(x) = sum_{y=x}^3 P(y)/y $. In other words, the fraction of prisoners with one year (or less) to serve of their sentence is $ P(x=1) = P(y=1) + P(y=2)/2 + P(y=3)/3 $; and between 1-2 years left is $ P(x=2) = P(y=2)/2 + P(y=3)/3$; and between 2-3 years left is $ P(3)/3 $.

Thus the distribution for time remaining is 1/2, 1/3, and 1/6, respectively (which is the reverse of the distribution of sentences). The mean of this distribution, $ sum_x P(x) x $, comes out to 5/3 or 1.67.

What is the reasoning for taking 1/2 of the mean of the sentence distribution, which comes out to 7/6 or 1.17?

Book and code out of sync in Chapter 7

For example, in Chapter 7, the prior (Gaussian) distribution has a mean of 2.8 (p. 65). However, on p. 66, the Python code and subsequent sentence uses a mean of 2.7. In the end, Figure 7-1 matches if one uses the prior Gaussian mean of 2.8.

When is v2 going to be released ?

TypeError: '<' not supported between instances of 'Pmf' and 'Pmf'

When running redline.py's RunMixProcess() the following error is shown:

CI z (3.1, 12.483333333333333)
Writing redline0.pdf
Mean z 7.775313741407206
Mean zb 8.896772000693552
Mean y 4.448386000346783
Writing redline2.pdf
Writing redline3.pdf
20 0.8964230763420535 2
14 3.9218230371662175 7
20 2.918634295466948 9
25 1.0969132651473539 2
2 2.9606902100440924 5
25 11.794483884166665
Average arrival rate 2.119634928117617
Mean posterior lambda 0.03674032163607093
Writing redline1.pdf

TypeError Traceback (most recent call last)
in
----> 1 RunMixProcess(OBSERVED_GAP_TIMES)

in RunMixProcess(gap_times, lam, num_passengers, plot)
34
35 if plot:
---> 36 wme.MakePlot()
37
38 return wme

in MakePlot(self, root)
27
28 # plot the MetaPmf
---> 29 for pmf, prob in sorted(self.metapmf.Items()):
30 cdf = pmf.MakeCdf().Scale(1.0/60)
31 width = 2/math.log(-math.log(prob))

TypeError: '<' not supported between instances of 'Pmf' and 'Pmf'

Using Python version 3.7.3
Anaconda Version :

"fn": "anaconda-2019.03-py37_0.tar.bz2",
"installed_by": "Anaconda3-2019.03-Windows-x86_64.exe"

OS Windows 10

Thanks.

The correctness of EvalExponentialPdf method in thinkbayes2.py

Dear professor Downey,

In chapter 7.6, you would like to compute the time between goals and the code in EvalExponentialPdf method is:

return lam*math.exp(-lam*x)

but I think the probability between two events is:

return math.exp(-lam*x)

The first term lam variable should be dropped.

Here comes my justification:
Assume T is the time interval between two goals. The Pr(T > x ) means the probability in a certain time interval x should not occur scoring. so we take the k=0 to the Poisson distribution, then we got math.exp(-lam*x).

Although I know the EvalExponentialPdf only be called by MakeExponentialPmf, and the lam variable will be canceled out after Normalize function. so the result stays the same.
But I still want to confirm with you the correctness of the method.
Thanks in advance!

Is the solution for cookie problem without replacement (cookie3.py) incomplete?

Hi!

First, let me thank you for creating all this, it is really a nice resource for developers like me to learn how to solve problems with bayesian methods.
Now the my question: I was surprised by the output of the code on cookie3.py. Am I right to think the code in that file is an incomplete solution to the "cookie problem without replacement"?
If it were a complete solution I would have expected the code to construct a growing set of hypothesis as cookies go away, to consider the scenarios in which the cookie removal could be evening out between the bowls, which the current code is not doing.

Say for example, that you eat 21 vanilla cookies ( no replacement ;-) )

    for x in range(0, 21):
         suite.Update('vanilla')

Then the cookie3.py will printout:

Hist({'vanilla': 9, 'chocolate': 10}) 1.0
Hist({'vanilla': 0, 'chocolate': 20}) 0.0

Reaching the conclusion that there in no chance (0.0) the 21th cookie came from bowl2, when it could have been the case that 10 of those cookies came from "bow1" and 11 from "bowl2" making the probability of getting a 21th vanilla cookie from bowl2 non zero.
Or maybe I missed some simplifying assumption in the text of the book?

Thanks

Formulas Not Displaying Properly in github.io version

Formulas in the github.io version of the book are not displaying properly—they are still in their raw LaTeX syntax.

Exmaple: https://allendowney.github.io/ThinkBayes2/chap02.html

exercice 1.2 in Chapter 1 formulation

Hi Allen,

I'm trying to solve exercice 1.2 in Chapter 1.
I have trouble understanding what you mean by "You ask if either child is a girl and they say yes." (I'm not a native speaker, so this could be the reason I have a hard time understanding that sentence).
I have come up with a way of disambiguating this sentence for me as:
"They have Child 1 and Child 2. You ask if Child 1 is a girl and they say yes." or "They have Child 1 and Child 2. You ask if Child 2 is a girl and they say yes."
Is this the correct interpretation?

Thanks for your help.
Regards,
Florian

TypeError: Series.init() got an unexpected keyword argument 'normalize'

Hi, Professor. I am facing this problem when I am practicing probability mass function in gss.hdf5. I am following exactly what have you teach in the slides. May I know why is this happening? Thank you.

import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Pmf

gss = pd.read_hdf('gss.hdf5','gss')
educ =gss['educ']
pmf_educ = Pmf(educ normalize=False)

Traceback (most recent call last):
File "c:\Users\User\Desktop\Study\Data_Camp\courses\Exploratory Data Analysis in Python\Probability_mass_functions.py", line 30, in
pmf_educ = Pmf(educ, normalize=False)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\empiricaldist\empiricaldist.py", line 45, in init
super().init(*args, **kwargs)
TypeError: Series.init() got an unexpected keyword argument 'normalize'``

Chapter 2 - nomenclature adjustment

It's not a code issue...

Since we are applying the Diachronic Bayes, I think it is more readable to keep the H letter instead of B, in page 21 - Bayes Tables:

I call the result unnorm because these values are the "unnormalized posteriors". Each of them is the product of a prior and a likelihood:

$$P(H_i)~P(D|H_i)$$

which is the numerator of Bayes's Theorem.
If we add them up, we have

$$P(H_1)~P(D|H_1) + P(H_2)~P(D|H_2)$$

which is the denominator of Bayes's Theorem, $P(D)$.

'Compile' book from code

This is not exactly an issue per se, just a question of mine.

Is it possible to create this book's pdf version from source? I'm just asking because I was trying (by running make all inside the book folder), but got too many errors, so I was wondering if there's some extra steps needed, or if I'm not supposed to be able to compile the book 😅

Chapter 20 Counting Cells Measurement Error Specification

I love your work and own a hard copy of Think Bayes 2! I know you got the cell counting example from another fantastic content creator. I have been pondering inferred vs. known error in my models and I think that the pymc cell counting model can be improved for posterity. Mainly, I think that there are a bunch of parameters that are inferred in pymc3 that should be fixed. For example:

with pm.Model() as model:
    yeast_conc = pm.Normal("yeast conc", 
                           mu=2 * billion, sd=0.4 * billion)

    shaker1_vol = pm.Normal("shaker1 vol", 
                               mu=9.0, sd=0.05)
    shaker2_vol = pm.Normal("shaker2 vol", 
                               mu=9.0, sd=0.05)
    shaker3_vol = pm.Normal("shaker3 vol", 
                               mu=9.0, sd=0.05)

should be:

import theano
shaker_vol_mu = theano.shared(9.0)
shaker_vol_sd = theano.shared(0.05)

with pm.Model() as model:
    yeast_conc = pm.Normal("yeast conc", 
                           mu=2 * billion, sd=0.4 * billion)

    shaker1_vol = pm.Normal("shaker1 vol", 
                               mu=shaker_vol_mu, sd=shaker_vol_sd)
    shaker2_vol = pm.Normal("shaker2 vol", 
                               mu=shaker_vol_mu, sd=shaker_vol_sd)
    shaker3_vol = pm.Normal("shaker3 vol", 
                               mu=shaker_vol_mu, sd=shaker_vol_sd)

I'm interpreting the mean and sd of yeast conc as prior values to be inferred, and the mean and sd of shaker volumes as fixed values based on information from outside of the model inference (e.g., the values listed on the package of a pipette or something). I think this is also consistent with ABC portion of the analysis present just after. Other lines of code that would be added include the following:

shaker_transfer_mu = theano.shared(1.0)
shaker_transfer_sd = theano.shared(0.01)

chamber_vol_mu = theano.shared(0.0001)
chamber_vol_sd = theano.shared(0.0001/20)

associated pm.Normal() calls would be updated.

My motivation for writing is the amount of time it took for me to figure out how and when to specify fixed vs. inferred parameter values in my models. Thanks!

Chapter 3 exercise on socks - question and answers not aligned

in the exercise on sock drawers, a single question is currently asked:
"What is the probability that the socks are white"

Oddly two answer spaces are given and neither of the given answers seem to match the question.

I would cautiously propose that there is a 30% chance that the socks are white because a white pair occupies 1/8 of the "sock pair space" and matching pairs in total occupy 2/8 + 3/18

with a common denominator those fractions become 9/72 and 18/72 +12/72
which implies there is a 9 in 30 chance or 3 in 10 chance of a matching pair being white. this bears out my code's answer

hypos = ['WW', 'XX', 'XY'] probs = [1/8,(1/8)+(3/18),1-(1/4)-(3/18)] prior = Pmf(probs,hypos) prior likelihood = [1,1,0] posterior = prior * likelihood posterior.normalize() posterior['WW']

result: 0.30

How tall is A?

Greetings!

I have a few questions regarding ThinkBayes2/solutions/height_soln.ipynb

Why do we update the suit with 0? Why not 'A'?

'suite.Update(0)'

Also, the question asks:

Suppose I choose two residents of the U.S. at random. A is taller than B. How tall is A?

We know A is taller than B, yet in Posterior marginal distributions, we see the mean height of B is greater than A.

How do we interpret the results? Should we say A is 164 cm tall since the question asks how tall is A?

Thanks!

Typo

Hello and thanks for Your interesting book! I am trying to use the Python3 version of the code and I found the following typo:

Shouldn't line 516 in "thinkplot.py" be:
xs = np.delete(xs, 0)
instead of:
xs = xp.delete(xs, 0)
?

If I am mistaken, how should the correct variant be?

Oliver Problem - Chapter 6 - General Question

Brief Background

Working through this excellent book with a colleague and he noticed something interesting on the Oliver's Blood problem.

Original solution, assuming 60% of local pop have type O and 1% have type AB:

like1 = 0.01
like2 = 2 * 0.6 * 0.01

likelihood_ratio = like1 / like2
post_odds = 1 * like1 / like2
prob(post_odds) # 0.454545

The interesting point mentioned by my colleague was this:

we could try the problem with a range of different values of % of local population having type AB and still get the same final probability:

import numpy as np

def prob(o):
    return o / (o+1)
  
  
o = 0.6 # we fix type o proportion in local pop

# iterate over various ab vals
for ab in np.linspace(0.02, 0.40, 20):
  like1 = ab
  like2 = 2 * o * ab

  assert round(prob(like1 / like2),4) == 0.4545

Clearly the ab cancel one another out in each likelihood calculation.

Curiosity

My interpretation of this is that the information on percent of local population with type AB does not matter for this problem. Is there additional insight or context you could add to this? My colleague & I were both a bit surprised, but the math is clear enough.

Another thought:
Perhaps we have to assume the possibility that the proportion of O would shift as AB changes, which would indeed change the final probabilities (even though there are more blood types than O and AB)

As AB becomes more likely, and O becomes less likely, then the evidence is increased that Oliver is one of the people that left blood (in this problem).

Again, no bug found - just curious about any additional commentary on the above. We are very much enjoying the book and exercises.

Empiricaldist module

Seems like there is a module missing; "empiricaldist".
Where can I find it?
Thanks

Chapter 1 download data code isn't working

I found that the original code seems to be outdated since it seems to use a different repository name (BiteSizeBayes) and it uses a csv file whereas the current data is in a gss_bayes folder which I don't know how to read/use (normally I use csv files). Can anyone give guidance on how to read in this data (e.g. docs) or a pointer on where to find the original GSS data? Thanks in advance.

Here's the original code for context:

`# Load the data file

from os.path import basename, exists

def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)

download('https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.csv')`

discussion on ex 9.1

Hello!

DISCLAIMER: this is not an "issue", rather a discussion about ex 9.1 (as there is no "correct" solution)
[of course feel free to close the issue if it's inappropriate, I didn't find any other place where to discuss this]

Here is my worked out example and solution.

I think that the interesting point in the example in the book is that the assumption that "the opponent is like a rotating turret, equally likely to shoot in any direction", i.e. theta has a uniform distribution, implies that x is not uniform by construction (x ~ tan(theta)). I guess the key point is that P(theta) = uniform implies P(x) ~ 1/StrafingSpeed.

In my solution, to go beyond that assumption, I simply introduced a non uniform distribution of theta, which I made dependent on alpha and beta (long story short: if the shooter is close to a wall, he/she is more likely to shoot away from the wall). In order to implement this in the solution, I made: P(x) ~ P(theta)/StrafingSpeed.

What did other people do? What is the form P(x) in your more realistic case?

Question(s) about exercises in chapter 4

Hello!
I have some problems with the exercises from chapter 4: I found a solution for ex. 4.1 (here it is), but it doesn't match exactly the actual solution, and I can't understand the solution proposed for ex. 4.2.

ex. 4.1
Given this statement:

suppose there is a probability y that an actual heads is reported as tails, or actual tails reported as heads.

I assumed the following:

P(data=T|coin=H) = y, which implies P(data=T|coin=H) = 1-y
P(data=H|coin=T) = y, which implies P(data=T|coin=T) = 1-y
Given these probabilities I computed the new likelihood, but my result is swapped wrt/ the proposed solution, i.e. I get this (the full solution is in the above link):

if data=='H':
    return (1-y)*x + y*(1-x)
else:
    return y*x + (1-y)*(1-x)

Did I misinterpret the assignment? Or what?

ex. 4.2
Here I tried to follow the same reasoning to compute the likelihood, but I had to give up since I was not getting it. I looked at the proposed solution and I am not really understanding it.

Given the text in the solution:

Each article has a quality Q, which is the probability of eliciting an upvote from a completely reliable redditor.
Each user has a reliability R, which is the probability of giving an upvote to an item with Q=1.

I would write it as follows in terms of probability:
Q = P(vote=UP|R=1)
R = P(vote=UP|Q=1)

Now of course the likelihood is (as stated in the solution):

The probability that a redditor with reliability R gives an upvote to an item with quality Q

but I can't understand the solution itself, i.e. I can't obtain it when writing it down in terms of probability:

P(vote=UP|R,Q) = ... ?

What am I missing? Can anyone please help?

PS: since I already wrote a lot, adding these few words won't hurt: I love this book, great work Allen!

Chapter 4 - Proper Prior Probability Initialisation

Hi developer,

The original prior probability prior = Pmf(1, hypos) in Ch.4 is initialised with 1:

It's better to be initialised with prior2 = Pmf(Fraction(1, len(hypos)), hypos), which is a uniform distribution, and its sum adds up to 1.

The results and the final plot are the same under np.allclose(posterior, posterior2):

Also, for a better understanding, the loop inside the function should be given a more detailed explanation:

For example:
The reason we can use a loop to multiply likelihood is that each coin-flipping experiment is independent of others, hence the $P(\theta)$ represents the probability of a coin landing in its head in the Bayesian theorem, $P(D_x|\theta)$ represents $x$-th coin-flipping experiment, where $D_x$ can be 'H' or 'T'. The loop represents the process of $P(D_1|\theta) \times P(D_2|\theta) \times ... \times P(D_n|\theta) = P(D|\theta)$ since they are independent of each other.

Cheers,
Yifan

Example 6-8 number of Outperforming Portfolios

First of all thank you for writing this book. It's been an informative read and the examples are fun and challenging.

That being said I believe there is a typo in Example 6-8:
312 outperforming portfolios must mean there are 430.5 honest members of congress. It is quite funny to say that there is a half-honest politician, but 313 outperforming portfolios does result in a whole number of honest and dishonest members of congress who add up to 538. So it makes more sense if the number of outperforming portfolios is 313.

Chapter 2 Bayes' Theorem

I believe the function update() should return the table, not "prob_data".

def update(table):
"""Compute the posterior probabilities."""
table['unnorm'] = table['prior'] * table['likelihood']
prob_data = table['unnorm'].sum()
table['posterior'] = table['unnorm'] / prob_data
return prob_data

Errors with the Euro examples...

All the examples so far have worked as expected until chapter 4. In the example Euro.py I'm getting the following error message:

ValueError: Normalize: total probability is zero.

The error is generated because 'H' * heads => 0)0.0, ..., 99) 0.0, 100) 7.504617540403191e-09
and then when I try to process 'T' * tails => the suite is all zeros. Thus, the total is zero.

I was able to fix this issue by change this code

dataset = 'H' * 140 + 'T' * 110

dataset = 'H' * 140 + 'T' * 110
dataset = ''.join(random.sample(dataset, len(dataset)))

but I didn't get the same summary values as you did.

I'm all so seeing this same error in with Euro2.py

Environment:
I haveAnaconda 5.0 and VS2017 v15.4.1 installed. I'm running the examples from within VS, but VS is running the code from within Anaconda 5.0.0 (global Default)

Also, I'm getting the same error when I run it inside of the jupyter notebook

dataset = 'H' * 140 + 'T' * 110
suite = Euro(range(0, 101))
dataset = 'H' * 140 + 'T' * 110

for data in dataset:
    suite.Update(data)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-a0973cb6b830> in <module>()
      3 
      4 for data in dataset:
----> 5     suite.Update(data)

C:\Dev\Repos\ThinkBayes2\code\thinkbayes2.py in Update(self, data)
   1410             like = self.Likelihood(data, hypo)
   1411             self.Mult(hypo, like)
-> 1412         return self.Normalize()
   1413 
   1414     def LogUpdate(self, data):

C:\Dev\Repos\ThinkBayes2\code\thinkbayes2.py in Normalize(self, fraction)
    532         total = self.Total()
    533         if total == 0:
--> 534             raise ValueError('Normalize: total probability is zero.')
    535 
    536         factor = fraction / total

ValueError: Normalize: total probability is zero.

Hope this helps.

Installation > Lots of ResolvePackagNotFound errors

Having Conda 4.5.11, and following the guidelines:
conda env create -f environment.yml

Gives me more than 180 packages not found
ResolvePackageNotFound:

bitarray==0.8.1=py36h14c3975_1
docutils==0.14=py36hb0f60f5_0
cairo==1.14.12=h7636065_2
urllib3==1.22=py36hbe7ace6_0 ...

Importing the yml file in Anaconda Navigatore gives me the same issue.
Result is that there is no environment created.

Python packages up to 3.7. are installed

Small typo in solutions to exercise 3 in ch2

Hi Allen,

Thanks for this fantastic resource, it's helping me a lot!

I noticed a small typo in the solutions to the 3rd exercise in chapter 2

If the car is behind Door 1, Monty would have opened Door 2
If the car is behind Door 2, Monty would always Door 3
If the car is behind Door 3, Monty would have opened Door 3

If the car is behind Door 1, Monty would have opened Door 2
If the car is behind Door 2, Monty would always open Door 3
If the car is behind Door 3, Monty would have opened Door 2

If you agree with the suggestion I will be happy to open a PR fixing it.

Have a nice day

gss_bayes.csv

Should gss_bayes.csv (Chapter 1) be in /data ?

(found it at https://raw.githubusercontent.com/AllenDowney/BiteSizeBayes/master/gss_bayes.csv after opening the notebook BTW)

Chapter 8 - unexpected overlapping curves

Working with the notebook of Chapter 8, I needed to change the line

france = prior.copy()

from copy import copy
france = copy(prior)

france = prior.copy(deep=True)
to work properly with the copy method of the Pmf class.

Otherwise, the curves will overlap.

PS.: Initially I thought it was returning a reference, but the object ids are different. In my case:

id(prior): 139718265739216
id(france): 139718228051664

Chapter 1: caseid definition in the gss_bayes dataset

It's not an issue, but a comment.

In Chapter 1 - Fraction of Bankers, page 3:

caseid: Respondent id (which is the index of the table).

It would be clearer to define as:

caseid: Respondent id in each year of survey.

Because different respondents may share the same id, and using it as index of the table sometimes may induce to think it's a primary key.

Chapter 6 Goblin Exercise

Hi,

I'm working through the book atm, and I think I spotted a slight inaccuracy in the phrasing of the Goblin HP problem in Chapter 6 (Odds & Addends):

The phrasing of the question is Suppose you are fighting a goblin and you have already inflicted 3 points of damage. What is your probability of defeating the goblin with your next successful attack?

The solution subtracts a 1d6+3 distribution from a 2d6 distribution and looks at the likelihood of the remainder being < 0, which returns 0.5. However, given the phrasing you have already inflicted 3 points of damage and what is probability of defeating the goblin with your *next* successful attack?, am I right in assuming that the initial hit point distribution is necessarily greater than 3 (since the goblin is apparently still standing)?

Assuming I'm correct, would the hp distribution be more accurately described by

hp = make_die(6).add_dist(make_die(6))
hypos = hp.qs
impossible = hypos <= 3
hp.ps[impossible] = 0
hp.normalize()

Alternatively, phrased as likelihood over the whole array of hypos:

#prior hp
hp = make_die(6).add_dist(make_die(6))
hypos = hp.qs
#likelihood of the goblin standing after 3 dmg:
likelihood = [hypo > 3 for hypo in hypos]
#updating prior given data:
hp *= likelihood
hp.normalize()

This lead me to the following solution

damage = make_die(6).add_dist(3)
remainder = hp.sub_dist(damage)
remainder.le_dist(0)

which returns 0.45454545 (as opposed to 0.5 in the original phrasing of the solution)

Please let me know if I'm barking up the wrong tree here! It's possible I overthought things, but figured I'd post this here.

kidney.py not executing

Hello,

When executing kidney.py I am getting many such errors when plots are saved:

TypeError Traceback (most recent call last)
in
18 rho = 0.0
19 sequences = calc.MakeSequences(100, rho, fit)
---> 20 PlotSequences(sequences)
21
22 calc.PlotBuckets()

in PlotSequences(sequences)
22 yticks=MakeTicks([0.2, 0.5, 1, 2, 5, 10, 20]),
23 ylabel='diameter (cm, log scale)',
---> 24 yscale='log')

~\Desktop\Python\BAYES\thinkplot\thinkplot.py in Save(root, formats, **options)
766 save_options[option] = options.pop(option)
767
--> 768 Config(**options)
769
770 if formats is None:

~\Desktop\Python\BAYES\thinkplot\thinkplot.py in Config(**options)
645 for name in names:
646 if name in options:
--> 647 getattr(plt, name)(options[name])
648
649 global LEGEND

~\Anaconda3\lib\site-packages\matplotlib\pyplot.py in yticks(ticks, labels, **kwargs)
1624 labels = ax.get_yticklabels()
1625 elif labels is None:
-> 1626 locs = ax.set_yticks(ticks)
1627 labels = ax.get_yticklabels()
1628 else:

~\Anaconda3\lib\site-packages\matplotlib\axes_base.py in set_yticks(self, ticks, minor)
3724 Default is False.
3725 """
-> 3726 ret = self.yaxis.set_ticks(ticks, minor=minor)
3727 return ret
3728

~\Anaconda3\lib\site-packages\matplotlib\axis.py in set_ticks(self, ticks, minor)
1706 xleft, xright = self.get_view_interval()
1707 if xright > xleft:
-> 1708 self.set_view_interval(min(ticks), max(ticks))
1709 else:
1710 self.set_view_interval(max(ticks), min(ticks))

TypeError: '<' not supported between instances of 'str' and 'float'

What is considered to be the "data" in the M&M solution? (Ch 2: Bayes's Theorem)

The problem hint indicates the trick is defining the hypotheses and data carefully:

Hint: The trick to this question is to define the hypotheses and the data carefully.

The solution clearly states the "hypotheses":

# Hypotheses:
# A: yellow from 94, green from 96
# B: yellow from 96, green from 94

What is the "data" in this solution?

My guess is the data is the color mixes in the bags (e.g. 1994 bag has 30% brown, etc. 1996 bag has 24% blue, etc.).

With this data definition, the first likelihood would be represented by "the probability of the color mix given that yellow is from 94 and green is from 96". It's not clear to me why that probability would be represented by 0.2*0.2. Maybe I have to think more or maybe I've guessed wrong about the data definition.

include penalties in SAT problem

Hello!

After going through the two modellings of the SAT problem in chapter 12 (in version 1.0.9 of the book), I was wondering if it was possible to do a further step forward and include the penalties in the problem.

Here is what I did (full version is here, just look for "Scenario 3").

To me, the main points in the more advanced modelling proposed in section 12.5 are:

for a given efficacy and difficulty pair the probability of giving the correct answer (ProbCorrect) is computed.
ProbCorrect is used to compute the binary PMF for the single question with sat.BinaryPmf(...). Assuming the raw score to be the same as the number of correct answers, the binary PMF has only values 0 and 1.
The pmfs for all the questions are summed together

Starting from above I simply introduced a different function to compute the "binary" pmf:

def BinaryPmfWithPenalty(p, penalty=-0.25):
    
    pmf = thinkbayes2.Pmf()
    pmf.Set(1, p)
    pmf.Set(penalty, 1-p)
    
    return pmf

I use this new function:

to recalibrate the difficulties
in PmfCorrect(...) to compute the distribution of correct answers given the efficacy

An additional modification was needed due to the fact that the raw score is not the number of correct answers anymore. On the one hand the outcome of Exam.Reverse(...) is the number of correct answers, while the outcome of PmfCorrect(...) is a raw score which now includes the penalties (e.g. for an exam with 53 correct answers and 1 wrong answer the raw score is 52.75). This mismatch is taken care of inside the Likelihood(...) method.

Did anyone try to include the penalties too? If yes, what did you do?

Chapter 8 missing cells in zip file

Thank you for this great resource!

The Colab notebook for chapter 8 has cells to create the france and croatia goal data. The downloaded notebook zip files do not.

Two small ideas from the first chapter

First of all, Great work!

I was reading the first chapter where you introduce the basic concepts of probability and I really liked what you did with the Pandas example and the conditional function.

I think this might be too much of an overhead for new readers, but I thought that it would be nice to overload the __or__ operator ( | ) to compute the conditional of two variables. This, however, requires a new class as the Pandas Series already uses the __or__ operator. Here is a minimal example:

class Prob:
    def __init__(self, series):
        self.series = series
    
    def __or__(self, other):
        if isinstance(other, Prob):
            return Prob(self.series[other.series])
        else:
            raise TypeError("unsupported operand type(s) for | : '{}' and '{}'".format(self.__class__, type(other)))
        
    def __and__(self, other):
        if isinstance(other, Prob):
            return Prob(self.series & other.series)
        else:
            raise TypeError("unsupported operand type(s) for & : '{}' and '{}'".format(self.__class__, type(other)))
        
    def mean(self):
        # So that it works with the prob function
        return self.series.mean()

This is how it looks in practice:

liberal = Prob(gss['polviews'] <= 3)
old = Prob(gss['age'] >= 65)

prob(liberal | old)
prob(liberal & old)

And you can now write the Bayes Formula almost without change:

prob_liberal_given_old = prob(old | liberal) * prob(liberal) / prob(old)

Anyway, I understand that this might be too much complexity for something that is probably not even relevant in the next chapters. My second suggestion was just to change the conditional function to have the second operator named given:

def conditional(A, given):
    """Conditional probability of A given "B".
    
    A: Boolean series
    given: Boolean series
    
    returns: probability
    """
    return prob(A[given])

This might make it easier to read the examples in the text, and understand exactly what is being conditioned on what:

conditional(liberal, given=female)

Linda problem should use conditional probability when it comes to sex.

One thing I feel the book misses multiple times in chapter I is the fact, that Linda's sex is given in the problem description (I picked an exercise as a sample):

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?

Linda is a banker.

Linda is a banker and considers herself a liberal Democrat

According to sample solution to this exercise, we should answer "Linda is a banker" with P(female & banker). However, given it is clear from the description (from the usage of "she"), that Linda is a female, this fact should be considered given, and the solution should be P(banker | female).

Of course, this fact doesn't change the answer (being a female banker is still more probable than being a liberal, democrat, female banker), however I feel like at least the answer to the above exercise needs to be changed, or the fact that multiple (In my humble opinion - valid) approaches are possible here.

Misleading solution for chap06 exercise "honest members"

I believe it is a lucky coincidence that table[312] or so gives a good likelihood. I believe that one should search table[] for the index that gives the posterior that has a max_prob() closest to 312.
I think this becomes clear if one plots the distributions for a few indices, say [0, 180, 360, 530].
Also if one considers that he values 50% and 90% are somewhat arbitrary.

There is no test_env.py script

Hi,
I'm currently working from your latest source and the installation instructions for a local install mention this:

\item To test your environment and make sure it has everything we need, run the following command:

\begin{verbatim}
python test_env.py
\end{verbatim}

Is this still planned? I'm not a beginner but I think this might be important for beginners to give them confidence in their correctly installing the Python stack. That's why I'm reporting on it here.

Thank you for this work of your's Allen!

Regards,
Florian

Chapter 4 - Typo in Plot Legend

Hey developer,

In Chapter 4, the first exercise, the given example should illustrate the posterior distribution of batting rather than the prior distribution:

posterior = prior.copy() should be added before the updating;
posterior.plot(label='posterior') instead of prior.plot(label='prior'), since we are plotting the posterior distribution.

Cheers,
Yifan

I have problem in Cookie solution

In the Update function, I thought at next round, pmf[hypo] should be fixed at 0.5, because the prior won't change no matter how many cookies have been taken out

but in your solution, I saw the pmf[hypo] keeping change every time, am I wrong about Bayes's theorem in this situation?