Code Monkey home page Code Monkey logo

post--visual-exploration-gaussian-processes's Introduction

Development

To install all the dependencies:

npm install

The project supports live reloading with:

npm run serve

To deploy the project to the docs folder:

npm run build
npm run deploy

Important

There are some things that need to be considered when working on the document:

  • The .bib file needs to be free of JabRef @Comment tags, otherwise the bibliography will not work.

post--visual-exploration-gaussian-processes's People

Contributors

grae-drake avatar grtlr avatar jabalazs avatar nbro avatar reiinakano avatar rkehlbeck avatar st-- avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

post--visual-exploration-gaussian-processes's Issues

CLT description seems off

The following description of the CLT seems a little off:

<d-footnote>
One of the implications of this theorem is that a collection of independent, identically distributed random variables with finite variance are together distributed normally.
A good introduction to the central limit theorem is given by <a
href="https://www.khanacademy.org/math/ap-statistics/sampling-distribution-ap/sampling-distribution-mean/v/central-limit-theorem"
target="_blank">this video</a> from <a href="https://www.khanacademy.org" target="_blank">Khan Academy</a>.
</d-footnote>.

Instead of saying "are together distributed normally", I think it would be better to say "have a mean that is distributed normally".

Minor typo in text

There is a minor typo in the text:

We will first explore the mathematical foundation that Gaussian procsses are built on — we invite you to follow along using the interactive figures and hands-on examples.

It should be:

We will first explore the mathematical foundation that Gaussian processes are built on — we invite you to follow along using the interactive figures and hands-on examples.

Review #2

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to review this article.


Labeling figures would have been helpful. Axis labels (even dummy values) on some plots would have been helpful because it's not clear what they refer to on the first pass, such as the first figures in the "Gaussian Process" and "Kernel" sections. It's also not intuitive how to play with the figures.

For diagrams with a shaded portion (first and last figures in article), it's not clear how much "confidence" (standard deviations) they show. Are they showing the mean plus/minus one standard deviation? Two standard deviations?

The definition for regression is not precise (what does "as close as possible" mean?). On the definition of Gaussian Processes, I would say that GPs give a confidence for the predicted value for a given input. GPs can be viewed as a distribution over functions, so it’s not accurate to use the term “predicted function” in the first definition.

For the definition of GPs, I would avoid the use of the term “kernel” and prefer the term “covariance function”. Or include a caveat to avoid confusion between GP kernels and kernel methods (e.g. SVMs).


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 2/5
Outstanding Communication Score
Article Structure 3/5
Writing Style 3/5
Diagram & Interface Style 2/5
Impact of diagrams / interfaces / tools for thought? 2/5
Readability 2/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 2/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 2/5
How easy would it be to replicate (or falsify) the results? 3/5
Does the article cite relevant work? 3/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 2/5

Marginalization and Conditioning figure caption - minor typo

Thank you for a nicely illustrated article. A minor typo or what I think is a minor typo -
"On the left you can see the result of marginalizing this distribution for Y, akin to integrating along the Y axis."
In the above sentence, "akin to integrating along the Y axis" should be changed to "akin to integrating along the X axis".

Article Structure / Explanation Sequence

First of all, thanks a lot for putting in the time and the work to publish this on Distill. I've been looking for an intuitive article on GPs for a while and was all the more delighted to find this.

However IMO the article structure is a bit backwards.

We will first explore the mathematical foundation that Gaussian procsses are built on — we invite you to follow along using the interactive figures and hands-on examples. They help to explain the impact of individual components, and show the flexibility of Gaussian processes. After following this article we hope that you will have a visual intuition on how Gaussian processes work and how you can configure them for different types of data.

This is a classical bottom-up approach, but I think it's not a good fit for the article. One can safely assume that someone looking to apply GP has at least some background in statistics and knows about multivariate Gaussian distributions.

For example I myself have an interest in using GP for estimation of continuous state variables. I think my background would be suitable to understand GPs and how to use them with the right approach, yet I struggle to follow the explanations and grasp the basic ideas in the article, not even after reading half the article.

I would strongly prefer a top-down approach, where the goals (i.e. what problems do GPs solve) are presented first, alongside with how other competing methods might fail to solve these problems. Secondly, the key ideas behind GP should be presented in a summary style, with as little math as possible, explaining, in plain English, what tricks and ideas GPs apply to solve the problems outlined before. Think startup pitch, not math proof :)

With that framework being set the ideas behind GP can be explored in ever more detail, but keeping a top-down approach.

we are interested in predicting the function values at concrete points, which we call test points X. So how do we derive this functional view from the multivariate normal distributions that we have considered so far? Stochastic processes, such as Gaussian processes, are essentially a set of random variables. In addition, each of these random variables has a corresponding index i. We will use this index to refer to the i-th dimension of our n-dimensional multivariate distributions. Now, the goal of Gaussian processes is to learn this underlying distribution from training data.

This is basically where you left me concretely. What does "learning a distribution" mean? I don't think the goal is to determine the parameters of a Gaussian distribution from some sample data, is it? That's what it sounds like though. Again if the article was top-down I'd probably have the right context to interpret your explanations (here and in the rest of the article) and would know where you're going. Not having that I'm just kind of lost.

Confusing notation for training and test data as well as for target variable

I have a minor comment about the mathematical notation in the post.

Throughout the post you have used X to mean test points and Y as training points. E.g. in start of section Posterior Distribution you introduce:

First, we form the joint distribution P_{X,Y} between the test points X and the training points Y.

This was a bit confusing to met at first as in the general ML literature y is usually the target variable. However, here Y is the test data, but I couldn't understand how to go from the observation of independent variables of the test data to the target (unobserved) variable for the test data . I like the notation in one of the notebooks to be easier to read where they say that the distribution is actually of the vector P_{f_a, f_b} which is easier to understand. It also makes it easier to see how we can use it for prediction for unknown values of observed independent variables x.

Also the code in this blog also follows the prediction view with y referring to the target variable.

I have tried to understand GPs before and everytime it has been hard because of the confusing notation and the generalization to infinite dimensional Gaussian which are used by conditioning on the observed data. I think the distill.pub article can really help in breaking the notation issue in explaining GPs.

Finally, this image has been the most descriptive explanation of GPs for me. Source
image

Calculation of Cov(X,X')

Hallo!
Thank you for the very interesting article on Distill! I am pretty new to the topic, so maybe my question is naive, but I would like to know why I would need a pre-selected kernel to calculate Cov(X,X')? Can't I calculate the covariance matrix from the data itself?
I am also a bit uncertain, why we need to have X and Y for test and training data, and why they use different dimensions? In a machine learning setting, X and Y come from the same data pool; the test data set is just separated out according to some sampling strategy.

'independent of' not 'independent from'

Your article is absolutely awesome, thank you so much for providing it!
As a very minor issue I think it should read 'independent of' instead of 'independent from' in the following sentence:

"Note that the new mean only depends on the conditioned variable, while the covariance matrix is independent from this variable. "

See independent-independently-of-from

Some minor revisions to your awesome article

Thank you for your nice article. I have gained a better understanding of Gaussian processes from it. I've listed here some minor errors in your article. Please do let me know if my proposed changes are incorrect.

Format:

  • original: (!) machne
  • proposed changes: spelling: "machine"

Section: Multivariate Gaussian distributions

Error:

  • The multivariate Gaussian distribution is defined by a mean vector \mu and a covariance matrix \Sigma. You can see an (!) interative example of such distributions in the figure below.
  • Spelling: "interactive"

Error:

  • The diagonal of \Sigma consists of the variance \sigma_i^2 of the i-th random variable (!). And the off-diagonal elements (!) \sigma_{ij} describe the correlation between the i-th and j-th random variable.
  • Instead of a "period" and "And", change to "comma" and "and"
  • Change "\sigma_{ij}" to "\sigma_{i}\sigma_{j}"

Error:

  • The (!) standard deviations for each random variable are on the diagonal of the covariance matrix, while the other values show the covariance between them.
  • Change "standard deviation" to "variance"

Subsection: Marginalization and Conditioning

Error:

  • In particular, given a normal probability distribution P(X,Y)P(X,Y) over vectors of random variables X (!) , and Y, we can determine their marginalized probability distributions in the following way:
  • Remove the comma

Section: Gaussian Processes

Error:

  • Each dimension (!) is assigned an index (!) x_i with i \epsilon{1,2}.
  • Change to: "Each dimension x_i is assigned an index i \epsilon{1,2}."

Subsection: Kernels

Error:

  • For each kernel, the covariance matrix has been created from N=25 (!) linear spaced values ranging from [-5,5]. Each entry in the matrix shows the covariance between points in the range of [0,1].
  • Change "linear spaced" to "linearly-spaced"

Subsection: Posterior Distribution

Error:

  • Adding training points (_) changes the number of dimensions of the multivariate (!) Guassian distribution.
  • Spelling: "Gaussian"

Error:

  • In many cases this can lead to fitted functions that are (!) unnecessary complex.
  • Spelling: "unnecessarily"

Error:

  • Through (!) marginalization of each random variable, we can extract the respective mean function value \mu'_i and standard deviation \sigma'i = \Sigma'{ii} for (!) i-th test point.
  • Change to "the marginalization"
  • Change to "the i-th test point"

Error:

  • Without any activated training data (!) , this figure shows the prior distribution of a Gaussian process with a RBF kernel.
  • Remove extra space before comma

Subsection: Combining different kernels

Error:

  • To show the impact of a kernel combination and how it might retain qualitative features of the individual kernels, take a look at the (!) figure below.
  • The link points to the wrong figure

Error:

  • In the (!) figure below, the observed training data has an ascending trend with a periodic deviation.
  • The link points to the wrong figure. The two links were swapped.

Dimension of kernels

I am a bit confused with the dimension of the kernel function.

k: R^n x R^n -> R
Sigma = COV(X, X') = k(t, t')

What is n here? Are we having a 1-dimensional regression problem or n-dimensional? The covariance matrix should be nxn, while the kernel will give just a scalar real value?

Sorry for adding it as an issue. Can't find any way to comment.

And THANKS a lot for this great article, really helpful.

Figure layout optimizations still necessary

Hi authors, one step before publishing that is often overlooked is responsive design for viewing on mobile. We won't ask you to recreate your figures in a responsive manner, but we do at the least require that figures can be scrolled if they don't fit on a display. When I enabled this on your article I noticed that many figures were actually overflowing their containers. I can fix this to my judgement, but I'd prefer if you went in and assigned container sizes for figures that you consider correct.

Here's an example of the kind of behavior I mean:
image

When setting overflow-x: scroll—which is needed to allow scorlling on mobile devices—diagrams like these would still overflow their container and have to scroll even on desktop. Please let me know if you need additional help setting correct container sizes. :-)

Bivariate Conditioning Figure Minor Issues

Spotted a couple minor issues with the figure in the "Marginalization and Conditioning" section.
(1) The title "Conditioning (X=1.2)" doesn't change when X changes
(2) The conditional variance \sigma_{Y\mid X} is always the same as the marginal variance \sigma_Y

Screen Shot 2019-04-05 at 7 35 33 AM

Thanks!

Apparent typo after second footnote

The published document has a space after the second footnote and before the following period:

image

There is no space there in the source but the period is on a new line. I believe that may cause the document to be rendered with that space.

credible instead of confidence

To the best of my knowledge, credible interval would be more suitable. Confidence intervals are used often in frequentist statistics.

FYI: renamed repo

The new format is distill specific: post--visual-exploration-gaussian-processes. You may need to update your git remotes. Thanks for understanding! :-)

A typo in variance calculation

I'm referring to the marginalization and conditioning figure (3rd figure in the article) here.

As per my understanding, $\sigma_{Y|X}^2=\sigma_Y^2-\frac{\sigma_{XY}^2}{\sigma_X^2}$. While I'm trying to change the bivariate distribution of $X$ and $Y$ using the handles, $\sigma_{Y|X}$ (conditioning) and $\sigma_Y$ (marginalization) are showing exactly the same values. To be precise, $\sigma_{Y|X}$ should be reasonably less than $\sigma_Y$ when the covariance between $X$ and $Y$ is strong enough. Plot shows correct variance but text is incorrect.

Review #1

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to review this article.


This article was a delight to read, and hits many of the right notes in what I think makes a good Distill article:

  • The writing is crisp. I enjoyed the brevity of notation and mathematical minimalism --- the essential points are communicated without burdening the reader with unnecessary formalism.
  • The figures are of an exceptional quality and include many design innovations. e.g., the use of toggle-able “pre-set” data-points is very elegant and trumps conventional widgets that allow the user to put in data-points wherever they wish. The diagrams are also technically very executed; the animations and interactions are fluid and work with minimal stutter. The presentation is overall outstanding.

What is unclear to me, however, is if this article breaks new ground in exposition. A lot of space e.g. is devoted to exposition on basic properties of the multivariate Gaussian distributions. Many of the figures correspond to visualizations that I already have in my head (though this is probably a good thing!), and the depth of the material on GPs themselves are not beyond what is covered in introductory courses. What I’m saying is that a reader already familiar with GPs is unlikely to take away anything from this. A new reader, however, who has not developed this visual “muscle”, would likely take away a lot. I will thus leave it to the editors to decide if Distill is an appropriate venue for such an article.

Personally, however, I think it is a solid piece of writing and thus I recommend it for publication.

Specific comments

[minor] based on prior knowledge -> incorporating prior knowledge

[minor] The statement [Gaussian processes assigns a probability to each of those functions] needs a citation.

[typo] symmetric -> symmetric

[nit] I am not a fan of the notation P(X,Y) used to denote the joint distribution of X and Y. This appears to be interchangeable with p_{X,Y}, used elsewhere in the article to refer to the joint density. These two notations refer to essentially the same thing.

[nit] Though it is clear from context what the article means, the article needs to be careful about distinguishing between probabilities and probability densities.
e.g. in statements like “we are interested in the probability of X=x”

[minor] In the section “kernels”, L2-norm should be L2-distance

[minor] Periodic kernels are not invariant to translations but invariant to translations equal to the period of the function.

[major] Kernels can be combined by addition, multiplication and concatenation (see Chapter 5.4 of Mackay, Information Theory, Inference, and Learning Algorithms). The omission of addition as a way of combining kernels is surprising.

[minor] Furthermore, it would be nice to see an explanation of what it means when two kernels are combined. The resulting kernels inherit the qualitative properties of its constituents, but how, and why?

[major] There seems to be a problem in the figure right above “Conclusion”, when the linear kernel is used alone. The linear kernel has rank 2, and thus the distribution after conditioning is not well defined if there are more than 2 points observed. The widget, however, does not seem to complain when more than two points are added. What is going on? I suspect that the regression has been done with a noise model, i.e. with a small multiple of the identity added to fix the singularity. This is completely valid, but it should be discussed explicitly.

[major] To speak more broadly on this point, it is worth mentioning that GP regression is typically done in conjunction with a noise model. This allows the posterior some flexibility such that it does not need to pass through every observation (indeed, this constraint makes GPs nearly unusable). This addition requires very little extension to the GP framework and should be discussed explicitly.

[minor] I do think this is what the authors seem to be getting at in the conclusion, at the line ""If we need to deal with real-world data, we will often find measurements that are afflicted with uncertainty and errors."" The cited article, however, seems to be more sophisticated than what is needed to address the basic problem.

A final thought

I think the article could benefit from a hero figure that is a true “GP sandbox”. It would be nice to be able to combine kernels, change the parameters of the kernels, and add/remove data points in real time, all in one place.


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 3/5
Outstanding Communication Score
Article Structure 4/5
Writing Style 5/5
Diagram & Interface Style 5/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 4/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 4/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 3/5
How easy would it be to replicate (or falsify) the results? 3/5
Does the article cite relevant work? 3/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 3/5

Inconsistency between terms: "standard deviation" and "variance"

Under the "Multivariate Gaussian distributions" section it is correctly stated that,

The diagonal of Σ consists of the variance σ_i^2 of the i-th random variable.

But under the "Kernels" and "Prior Distribution" sections σ itself is incorrectly called the variance.

Then under the "Posterior Distribution" section, the wording is switched to "standard deviation" and it is incorrectly stated that,

we can extract the respective ... standard deviation σ'_i = Σ'_ii for the i-th test point

I would suggest dropping standard deviation (σ) and just use the variance (σ^2) in each case. An explanation can be added that the standard deviation is the square root of the variance.

Numerical instabilities when sampling from Gaussian

We are running into numerical problems when sampling from the Gaussian distribution. The main culprit here is the Cholesky decomposition performed by ml-matrix.

Currently, our solution is to skip the Cholesky decomposition when sampling. This results in samples being scaled incorrectly. It is not really visible to the reader since the samples still revolve around the correct mean, but I think we should find a way to solve this issue.

One way would be to add a small epsilon to the diagonal elements of the covariance matrix. However, this does not guarantee that the decomposition will work. Maybe we can find a numerically more stable Cholesky decomposition on npm that we can use?

Error in formula for covariance

There is a small error in the formula for the covariance in section Multivariate Gaussian distribution.
It should be either
Σ = E[(X - μ)(X - μ)T]
or using the elements
Σij = Cov(Xi, Xj) = E[(Xi - μi)(Xj - μj)]
Note the absence of the transpose in the latter equation.

Conform with distill.pub build process

In b8d8a15 I have committed all of our source code and build system.

We now need to make some smaller changes to our build process. Distill.pub requires a public/ folder that contains the bundled version of our article. We have a similar folder already (docs/), which we need for hosting the article on Github Pages (we used this link to share the article as a pre-print).

It should be sufficient to do this as the last step of the process when we transfer the ownership to distill.pub as well.

Build instructions in README do not work

I can't build using the README instructions, at least on macOS Catalina:

$ npm run build
npm run deploy

> [email protected] build /Users/taliesinb/git/distillpub/post--visual-exploration-gaussian-processes
> webpack

sh: webpack: command not found
npm ERR! code ELIFECYCLE
npm ERR! syscall spawn
npm ERR! file sh
npm ERR! errno ENOENT
npm ERR! [email protected] build: `webpack`
npm ERR! spawn ENOENT
npm ERR!
npm ERR! Failed at the [email protected] build script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
npm WARN Local package.json exists, but node_modules missing, did you mean to install?

npm ERR! A complete log of this run can be found in:
npm ERR!     /Users/taliesinb/.npm/_logs/2020-01-24T18_34_54_003Z-debug.log
npm ERR! missing script: deploy

npm ERR! A complete log of this run can be found in:
npm ERR!     /Users/taliesinb/.npm/_logs/2020-01-24T18_34_54_274Z-debug.log

The solution seems to be npm install webpack. It's quite scary, though, because on Catalina it turns out that webpack dependency fsevents (presumably for live-reloading) doesn't have binaries available, and attempts to build fsevent from source fail with reams of horrible C++ errors (seems like something in Node changed recently). But fsevents is optional.

There is another problem, however, which seems more serious:

$ npm run deploy

npm ERR! missing script: deploy

Here are the available scripts:

➜  post--visual-exploration-gaussian-processes git:(master) ✗ npm rum
Lifecycle scripts included in gau:
  test
    echo "Error: no test specified" && exit 1

available via `npm run-script`:
  build
    webpack
  serve
    webpack-serve --config ./webpack.config.js
  lint-fix
    eslint --config .eslintrc.json --fix --ext .js,.html src

Reposition the word only

I think in the following sentence the word 'only' occurs at the wrong position which changes the meaning of the statement inadvertently:

"Note that the new mean only depends on the conditioned variable, while the covariance matrix is independent from this variable. "

My impression is it should rather be as follows (placing the word 'only' before the words 'the new mean' rather than behind them):
"Note that only the new mean depends on the conditioned variable, while the covariance matrix is independent from this variable. "

Final figure not showing all covariance matrices

Hi there

Thanks for a great blogpost! After reading >5 different blogs/resources about GPs, this might be the first time that the multivariate Gaussian <-> points on a plot has made sense to me.

The last plot (combining kernels) is a bit buggy for me:
It shows the correct covariance matrix only for combinations when RBF is selected (see Screenshot).
With combinations like "only linear" or "periodic + linear", the covariance matrix is blank.

I quickly dug around in the code, but couldn't find anything suspect, except that "this.covMat" was set to different values when the "blank" options (linear/periodic on their own) were selected, but the values were just really small in the covariance matrix.

Also, weirdly, the this.refs.covMat.get().name remains "RBF" no matter what.

Relevant lines:
Covariance matrix calculation and disappearance seems to happen here: https://github.com/distillpub/post--visual-exploration-gaussian-processes/blob/master/src/components/KernelCombinations.html#L321

Thanks again!

Screenshot 2022-04-08 at 17 30 24

Screenshot 2022-04-08 at 17 30 10

System:
Safari 15.4 (17613.1.17.1.13), Chrome Version 100.0.4896.75 (Official Build) (arm64)
MacOS 12.3.1
MacBook Pro 14inch 2021

Review #3

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to Austin Huang for taking the time to review this article.


On Outstanding Communication:

For such a topic that is in such a great need of accessible explanations, I felt that the article’s organization and writing could have been much improved.

More consideration needs to be put into the writing how topics are sequenced.

For example, under “Marginalization and Conditioning”, the high level intuition came after a detailed technical exposition such that, if the reader was able to follow the technical exposition, they wouldn’t need the figures. On the other hand, if the reader did not have the intuition of what marginalization and conditioning illustrated in the figure already in their minds, they would have struggled to follow the last 3-4 paragraphs.

Other issues follow this pattern of being sequenced poorly, starting with the details and ending with the big picture. The section on multivariate Gaussian had two paragraphs discussing positive semidefiniteness, diagonal/off-diagonal elements of the covariance, only to be followed by actually defining the covariance matrix, ending with a visual of what a multivariate Gaussian looks like.

The section on the posterior distribution (arguably the crux of the model) does not read well. It is difficult to follow and is in need of a rewrite.

See section-by-section comments at the end of the review for a detailed discussion of specific communication issues.

On Scientific Correctness & Integrity:

Given the article does not make scientific claims and primarily focuses on the communicating the basic mathematical structure of a particular area of machine learning, most of these don’t directly apply.

This could be seen as one weakness of the article - that it doesn’t take enough of a stance (a good contrast is the T-sne article, another communication of ML methodology which does take positions on their use rather than focusing strictly on the mathematical underpinnings). This article doesn’t comment or help the reader reason about correct or incorrect application of GPs, it only describes the mathematical machinery behind them.

On the topic of limitations, it would be nice if the article wrote about pitfalls / limitations of Gaussian Process models.

Detailed Section Comments:

Introduction

"Even if you have spent some time reading about machine learning, chances are that you have never heard of Gaussian processes."

This assumption will often be incorrect (in some circles, Gaussian processes are well known), but more importantly, misses an opportunity to provide a stronger motivational hook to the article beyond "it's something you have haven't heard of / rehearsing the basics is good".

"Their most obvious area of application are regression problems, for example in robotics, but they can also be extended to classification and clustering tasks [2, 3] . Just a quick recap: the goal of regression is to find a function that describes a given set of data points as closely as possible. "

The phrasing here feels more convoluted than it needs to be. Why bring the overloaded regression terminology followed by an awkward definitional statement after its initial use. Could just skip to using the definition in the first sentence and avoid the "just a quick recap" sentence altogether.

Multivariate Gaussian Distributions

"In particular, we are interested in the multivariate case of this distribution, where each random variable is distributed normally and their joint distribution is also Gaussian."

This section misses an opportunity to introduce a figure that illuminates the gap between the intuition of a Gaussian most people have with the way they're used in GPs.

One of the big stumbling blocks with GPs is a visual one - when students are introduced to Gaussians, they are used to looking at the distribution with each dimension mapped to a visual axis (like the two figures in this section). In GPs, the dimensions of a multivariate Gaussian are visually represented as vertical axes (in effect, sharing a single axis).

This section shows the former, the next section shows the latter, but we’re missing a visual that bridges and links the relationship between the two visual representations. There is an opportunity to provide an illustration that bridges their familiar representation of multivariate gaussian with one dimension per axis (limited between 1D to 3D examples), with the visual representation common to GPs (one dimension per horizontal slice) by showing the correspondence side-by-side in a single interactive visual.

"In general, the multivariate Gaussian distribution is defined by a mean vector \muμ and a covariance matrix \SigmaΣ",

What purpose does the "In general" clause serve?

"symetric" misspelled - should be "symmetric"

“The standard deviations for each random variable are on the diagonal of the covariance matrix … ”

I believe this is an error - the diagonal of the covariance matrix corresponds to the variance for each random variable, not the standard deviation.

The placement of the controls for the bivariate normal figures are counter-intuitive.

The placement bottom handle pointing downwards implies that:

  1. the effect of the control is on the vertical proportions of the Gaussian (since the control goes up and down) and
  2. the effect is somehow opposite the control which points upwards.

Neither of these intuitions turns out to be true of the control's effect on the distribution.

To add more confusion, the most positive correlation is obtained when the slider is at its most negative position and the most negative correlation is obtained when the slider is at its most positive position.

"Gaussian distributions are widely used to model the real world: either as a surrogate when the original distributions are unknown, or in the context of the central limit theorem."

This sentence does not make much sense, unpacking the two clauses:

"Gaussian distributions are widely used to model the real world ... when the original distributions are unknown."

This notion of “assume Gaussian by default” reflects a common misuse of Gaussian distributions and is blamed for many modeling failures. The phrasing here implies that use of Gaussians in conditions where the original distributions are unknown is an acceptable common practice, which a poor modeling practice.

"Gaussian distributions are widely used to model the real world ... in the context of the central limit theorem."

I can guess what this means (that Gaussian distributions are justified if the conditions of the central limit theorem are held) but the phrasing reads very awkwardly.

"Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. This means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable."

Use of “This” is ambiguous (Does it refer to being closed, conditioning, marginalization? etc) Could replace "This" with "Being closed under conditioning and marginalization" to reduce ambiguity.

"Marginalization and conditioning both work on subsets of the original distribution and we will use the following notation"

This section needs more elaboration as there are two key stumbling blocks for those coming from other backgrounds.

  1. The way in which X and Y are used in this context as subsets of a multivariate model as opposed to predictor/dependent variables is going to be confusing for a large fraction of people and it's worth being precise and having a visual aid to cross that gap.

  2. X and Y represent sets of random variables. This is stated in the text, but would be much better if somehow illustrated in a way such that X and Y are shown to capture multiple dimensions, perhaps representing the first equation under "Marginalization and Conditioning" in a more visual way. The breakdown of the \mu and \Sigma into subsets would be better illustrated with a figure.

“The interpretation of this equation is straight forward”

I don’t think it’s helpful to qualify the interpretation as "straightforward" or “not straightforward”, just explain the interpretation without the qualification.

"Now that we have worked through the necessary equations, we will think about how we can understand the two operations visually."

I think it would be beneficial to show the visual first and then present marginalization and conditioning equations as elaborations on the visual.

Gaussian Processes

"First, we will move from the continuous view to the discrete representation of a function: "

This transition sentence is confusing considering continuous views of functions hasn't been discussed. Up to now the main topic has just been the multivariate gaussian distribution, not continuous functions.

Kernels

There's a bug in the figure here the right margin annotation doesn't get updated until after the slider is released. It should get updated as soon as a slider is touched.

"The stationary nature of the RBF kernel can be observed in the banding around the diagonal of its covariance matrix"

This relationship could be made more intuitive visually. Specifically, what exactly is meant by banding around the diagonal and why does it confer stationarity?

Prior Distribution

The "click to start" functionality is awkward, why does the user need to start/pause at all? Why not just have the visual continually running by default?

Posterior Distribution

This section probably needs the most work writing-wise.

"The intuition behind this step is that the training points constrain the set of functions to those that pass through the training points."

This can be a harmful intuition since noise parameters can enable the solution set of functions to also not pass through the training points.

"Analogous to the prior distribution, we could obtain a prediction by sampling from this distribution. But, since sampling involves randomness, we would have no guarantee that the result is a good fit to the data. "

I think I understand the idea that the desire is to obtain a closed form representation, but this is a very awkward choice of phrase. Lots of times random processes are used to find fits to data (e.g. MCMC).

“In contrast to the prior distribution, we set the mean to \mu=0μ=0 , so in this context it is not really of importance.”

Very confusing what concept "it" is referring to here.


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 3/5
Outstanding Communication Score
Article Structure 3/5
Writing Style 2/5
Diagram & Interface Style 4/5
Impact of diagrams / interfaces / tools for thought? 4/5
Readability 3/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 5/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 2/5
How easy would it be to replicate (or falsify) the results? 3/5
Does the article cite relevant work? 3/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 3/5

Minor typo

In the Kernels section:

In order to get a better intution for the role of the kernel

Which should be:

In order to get a better intuition for the role of the kernel

Should we fix a publication date?

Now that we're getting close to having everything ready for the release: would you like a specific publication date? We could do as early as next Monday on the Distill side!

Confusing wording in section on posterior distribution

Hi,

You state that "In contrast to the prior distribution, we set the mean to \mu=0.".

Would this be clearer if it stated "In contrast to the prior distribution where we set the mean to \mu=0, the resulting distribution will most likely have a non-zero mean \mu’ \neq 0"?

Or is there something I'm missing and not understanding correctly?

Explain Marginalization More Intuitively

For someone like me who does not yet possess the visual muscle, this article is incredibly valuable. For example, the easy-to-visualize explanation of conditioning as taking a slice of the distribution is great.

When it comes to marginalization, there is a little disconnect between the process of integrating with respect to one of the variables and the visual representation in the figure. I think it's important for the reader to visualize how the marginalization of a bivariate distribution produces a univariate distribution.

Suggested explanation:

Marginalization can be seen as first performing conditioning for each possible value(probability) of the variable being marginalized out. Each conditioning operation produces a distribution with that variable removed. In the second step, all such (infinitely many) distributions are squashed together by weighing them by their conditioning probabilities and summing them up, or integrating them with respect to the marginalized variable.

The slicing and adding operation can be visualized then to represent a crisp picture of the process.

Usability tweaks

Jochen & Co-authors, this is looking great!

Here's a small list of suggestions after which I'd suggest you could transfer the repository to the Github org distillpub where I'll ensure technical compatibility.

(E.g. We'll need to link against the most current version of the Distill template served by distill.pub. At the moment you may see math type out of place—this is a regression in newer versions of Chrome because the article is currently linking against an outdated build of the Distill template. In order to fix such bugs in the future without your input, we require linking against the hosted version of template. We also have our own webcomponents loader already built into the Distill template, and we need to create the meta tags programmatically to allow DOI uploading and other indexing. We'll also need to remove your Google Analytics snippet unfortunately. EDIT: I'll check about that with the other editors, actually. We haven't historically allowed this, but I wonder whether we have consciously thought about this. No guarantees, but I'll inquire. :-))

Suggestions

Make the draggable area of handles at least 44px big

At the moment it can be extremely hard to hit the handles on the Gaussian distribution diagrams. I'd suggest to do this with an invisible circle of radius 44px that intercepts mouse events the same way the current handles work. Please also set cursor: grab; on those elements to shwo they can be moved.

Maybe introduce Gaussian distribution parameters one-by-one

You already have a nice interactive diagram for a full Gaussian distribution. You could consider having the diagram be visible every time you introduce a parameter, with only that parameter changeable. So e.g. when you talk about "The diagonal of Σ consists of the variance \sigma_i^2σ of the ii -th random variable." it would be great to have the diagram with only the parameters on the diagonal adjustable.

This is just a suggestion to make the very introduction of Gaussian distributions more approachable. In our experience it is often valuable to focus on making the earliest sections of an explanation as accessible to readers as possible.

Add links to set state of figure #Posterior to its figcaption

In the (beautifully done!) figure #Posterior's figcaption you describe different states of the figure. It can help discoverability a lot to supply links on the relevant parts of the figcaption that on click call JS that sets the figure to the described state. Take a look (under) for an example of this behavior.

image
Clicking on the underlined text sets the position of the square in the above diagram to the described feature.

I'd recommend this approach for each of the italicized scenarios:

Without any activated training data, this figure shows the prior distribution of a Gaussian process with a RBF kernel. When hovering over the covariance matrix, the opacity of the gradient shows the influence of a function value on its neighbours. The distribution changes when we observe training data. Individual points can be activated by clicking on them. The Gaussian process is then constrained to make functions, that intersect these data points, more probable. The best explanation of the training data is given by the updated mean function.

(Please also remove at least the left margin of this figure so it aligns with the body text.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.