The annieslasso's discuss from andycasey

be able to initialise models without any (training) data

(so that they can be loaded from elsewhere without having the huge data files)

give Hogg the plots he needs for the APOGEE meeting

Most of what he needs will be satisfied by #36 but he also wants something which shows the non-zero parameters as a function of Lambda. Also something that shows 2-d projections of the 17-d space for the training set, and for a much larger set of stars (if we can!)

Should we support Python 2.x?

There are two reasons I've encountered so far that make me think we should not:

The CannonModel class uses pickle to save and load objects. Python 3 has a newer pickling protocol for saving Python objects, which is incompatible with previous versions. For that reason, any model saved in Python 3 using the highest available protocol (which is the current default in our code base, for many reasons) will not be loadable in Python 2.x. Someone would have to load it in Python 3, then save it again using a lower protocol. The newer protocol has faster read/write times and much smaller file sizes.
See #40, although that is not strictly a 'Python 2' problem but more of an 'BLAS' problem which cannot be fixed by pure-Python code in Python 2.x.

@davidwhogg, @mkness: comments welcome!

set scatter parameters according to Hogg's heuristics

...Hogg supply details here

Robustly identifying stars that are very different to the labelled/training set

In very high label space it may not even be enough to say 'high chi-squared value', if the model is sufficiently flexible.

Ensure @arXiver will show our M15 abundance comparison when the paper goes on arXiv.

continuum normalization

Either we have to implement the @mkness hack or else do something better, but we need something here.

model eq and copy etc

determine rotation, microturb, and macroturb at test time

The idea is to fit the labels and a few more parameters. Hogg has a trivial implementation idea.

add project list to the landing page

This is assigned to @mkness ; Ask for instructions from @andycasey

Should we be adding higher powers of the labels?

For example, can we craft the right question to inform us whether it is worthwhile including Teff^4, or say log(Teff), etc. Regularization can help, but these questions need to be thought out properly so we don't get misinformed by cross-terms.

Bounding constraints on `theta` when solving.

If the vectorizer is scaled, do we have good reason to specify boundary conditions on theta parameters when simultaneously optimizing them with the scatter term? (None of this will matter if analytic derivatives are included, but it is worth pondering)

almost every pixel is masked for many training-set stars!

Here is an example: 2M06123730+4036001

...but there are many more. What is going on? How can ASPCAP give us "good" labels for a star that almost no pixel is good in any sub-visit. Literally not one good measurement of any wavelength at all.

Check that l_bfgs_b output doesn't depend on convergence parameters

Here's the test I want:

Choose a handful of typical pixels.
fit with current defaults.
loosen pgtol by a factor of 10 and re-fit. Do the optimal parameters change by an absolute change of more than 1e-7? If so, we need to tighten pgtol.
return pgtol to normal and loosen factr by a factor of 10 and re-fit. Do the optimal parameters change by an absolute change of more than 1e-7? If so, we need to tighten factr

This PR can be closed when both of these tests pass.

Bonus points: You can also loosen these parameters; we really want them as loose as possible such that the above tests pass.

Sparsity metrics for different Lambdas and scale factors

we need to set the inverse variance values to zero for bad pixels!

@mkness what exactly do we do to set these? Unfortunately this is urgent! We are working with spectra from apStar files.

no polynomials!

We will fit continuum with sums of sines and cosines, not polynomials, for very important reasons!

make a basic The Cannon landing page

This should have

license rider that says cite Ness et al and Casey et al in prep
list of projects that we want to do, with a few sentences about each
list of example projects that have used TheCannon

@mkness and @davidwhogg can help write the second two items.

Implement wavelength censoring

Should this be on vectorizer terms or label names? If it were on label names (e.g., [Al/H]) such that any cross terms with [Al/H] would also be zero. Which conceptual filter were you talking about @davidwhogg: label names or terms or both?

copy data on cannon model instantiation

this is so i dont forget

produce spectral predictions for all APOGEE stars

We need to produce not just labels but also the best-fit prediction for every star. This increases disk space but is a very valuable data product.

regularization at the test step?

consider using ridge regression for regularization at the test step:

\argmin[\chi(l)^2 + \lambda(v(l).w)^2]

Program hangs at fitting stage in parallel

If a CannonModel (or sub-class) is instantiated with a multiprocessing pool, there are situations where the code will hang when trying to fit spectra. This situation is irrecoverable; the program will never finish. However, it will work perfectly fine in serial.

This situation arises because of Accelerate on OSX, and impacts all Python 2+ and 3+ versions. Specifically, np.dot does not work in parallel under certain conditions (e.g., here).

In Python 2 the workaround is to compile against a different BLAS (see numpy/numpy#4776).

In Python 3 you can also resolve this situation by just using multiprocessing in 'spawn' mode. The way to do this (before importing The Cannon) is to use:

import multiprocessing as mp
mp.set_start_method('spawn')

Since this problem is difficult to diagnose, I've opened this issue in case someone comes searching for these symptoms. However this issue can't be fixed by the numpy group or us, so I'm immediately closing it. (It also encourages me to drop Python 2.7 support and enforce the 'spawn' method for Python 3..)

design and make plots of label differences vs SNR

For presenting results.

intro text re lessons learned from homogenization

@andycasey write a couple of paragraphs for our introduction about the things you have learned from Gaia-ESO that suggest the value of TheCannon.

re-identify continuum pixels

I don't like the @mkness list of continuum pixels. This is outside of scope for paper 1, so this is an enhancement

remove all slogdet() calls from code; remove scatter from any possible optimization

Optimization issue when training regularized Cannon models

The following image suggests that the optimization is not converging during the training phase for regularization models with high Lambda values.

The problem appears to be unrelated to ftol or xtol. The above figure was using default ftol and xtol. If the tolerance is decreased (to 3e-5) the same behaviour remains for a single line:

I think this bug is fixed now: Here's the same line in a 17-label model using default optimization tolerances:

And here are 28 pixels in a 3-label model:

I also updated the code to allow for the initial theta to be provided, and that initial theta is passed on between successive steps of Lambda. And I've done a bunch of tests (in a 3-label model and a 17-label model) since fixing the bug, and BFGS works well until we hit high Lambda, at which point BFGS returns a warning. I've implemented the best-of-both: BFGS will run first, and if BFGS exits with a warning then Powell's method is run instead.

explain to Hogg the fork switch

@andycasey how did you switch things to it went from your repo being a fork of mine to mine being a fork of yours? Was that a "setting" change in both repos? Or what? It seems like it must be a github change (not a git change). Anyway, both I and @dfm want to know.

Using GitHub Pages for the landing page

It would be ideal if we could make use of GitHub pages for our landing page at TheCannon.io. That way all the page content would be stored in the gh-pages branch of this repository.

Currently TheCannon.io domain just points to the Read the Docs page.

Only one custom domain can be used with GH pages per user or organization. I already have a GitHub pages account set up for my personal website, which means the GH page URL redirects from https://andycasey.github.io/AnniesLasso/index.html to astrowizici.st/AnniesLasso/index.html

That means we need to create an organization or username then specify TheCannon.io as our custom domain.

So, what organization name should we use, @davidwhogg? Once selected, we will have to move this repository to that new organization.

get smarter about how we use the pixel masks

right now we are not sure how to edit the inverse variance, given the pixel masks

small think-o in scales delivered to the vectorizer

We should offset using a label offset to a fiducial label value (as we do).

But we should scale the K + K * (K + 1) / 2 vectorizer-output terms independently.

Ask @davidwhogg for details.

actually set / choose fiducial f, Lambda for APOGEE DR12

we have everything we need to do this. This is our major stumbling block for paper 0.

Make a comparison of duplicate observations

Sanders informs me that there are stars with the same APOGEE_ID but different ASPCAP_ID values. We stacked and analysed everything based on ASPCAP_ID. That means we can identify duplicates of APOGEE_ID values and also use them to estimate internal precision (and covariances, as Sanders is doing for distance determination).

Move this repository to @AnnieJumpCannon

Jacknife the training set

This is trivial to do and would tell us about the (scale of the) uncertainty in the labels that results from having a finite sample size.

This could be its own project if one wanted to look into the uncertainties of the training set labels, etc.

deliver to Hogg a file with 17 labels plus positions

Ideally, RA, Dec, and our 17 labels for all APOGEE stars, for some setting of f and Lambda.

This is high priority!

re-starts at test time

Right now we do a single optimization at test time. This can't be right, since the optimization is non-convex. Also, Anna Ho finds that for LAMOST there are wrong answers caused by local minima. We should define a set of K initializations and start from all K, and choose the best answer (best in a likelihood sense). This, unfortunately, is high priority...

coefficients/theta and pixel variance

this is a nomenclature question for @davidwhogg:

when trained, a CannonModel currently has attributes model.coefficients and model.scatter.

I'm refactoring the code to calculate the inverse variances and use them in the right way, and in doing so I actually solve for s^2. Should model.coefficients and model.scatter be renamed, and if so, to what?
model.theta and model.pixel_variance? something else?

using The Cannon to measure radial velocities for RRL and other stars
using The Cannon to identify spectroscopic binaries automatically
comparing The Cannon single-pixel internals to ASPCAP for the same
using the LSF (or LSF kernels) at test time
determining rotation, microturbulence, macroturbulence, and radial-velocity shifts at test time
outputting The Cannon spectral prediction for every APOGEE spectrum along with labels
producing DR13-trained output for internal use by the APOGEE team

Allow initial theta to be supplied when training and validating L1RegularizedCannonModel

..

write unit tests for the derivatives

That unit test should be able to test L1, chi-squared, and the whole objective function.

andycasey / annieslasso Goto Github PK

annieslasso's Issues

Recommend Projects

Recommend Topics

Recommend Org