andycasey / annieslasso Goto Github PK

View Code? Open in Web Editor NEW

10.0 5.0 8.0 16.63 MB

The Cannon 2: Compressed sensing edition

License: MIT License

Makefile 0.23% TeX 55.41% Python 44.36%

annieslasso's Introduction

The Cannon

Authors

Andy Casey (Cambridge) (Monash)
David W. Hogg (NYU) (MPIA) (SCDA)
Melissa K. Ness (MPIA)
Hans-Walter Rix (MPIA)
Anna Y. Q. Ho (Caltech)
Gerry Gilmore (Cambridge)

Installation

pip install https://github.com/andycasey/AnniesLasso/archive/master.zip

Getting Started

Let us assume that you have rest-frame continuum-normalized spectra for a set of stars for which the stellar parameters and chemical abundances (which we will collectively call labels) are known with high fidelity. The labels for those stars (and the locations of the spectrum fluxes and inverse variances) are assumed to be stored in a table. In this example all stars are assumed to be sampled on the same wavelength (dispersion) scale.

Here we will create and train a 3-label (effective temperature, surface gravity, metallicity) quadratic (e.g., Teff^2) model:

import numpy as np
from astropy.table import Table

import AnniesLasso as tc

# Load the table containing the training set labels, and the spectra.
training_set = Table.read("training_set_labels.fits")

# Here we will assume that the flux and inverse variance arrays are stored in
# different ASCII files. The end goal is just to produce flux and inverse
# variance arrays of shape (N_stars, N_pixels).
normalized_flux = np.array([np.loadtxt(star["flux_filename"]) for star in training_set])
normalized_ivar = np.array([np.loadtxt(star["ivar_filename"]) for star in training_set])

# Providing the dispersion to the model is optional, but handy later on.
dispersion = np.loadtxt("common_wavelengths.txt")

# Create a vectorizer that defines our model form.
vectorizer = tc.vectorizer.PolynomialVectorizer(("TEFF", "LOGG", "FEH"), 2)

# Create the model that will run in parallel using all available cores.
model = tc.CannonModel(training_set, normalized_flux, normalized_ivar,
                       vectorizer=vectorizer, dispersion=dispersion, threads=-1)

# Train the model!
model.train()

You can follow this example further in the complete Getting Started tutorial.

License

annieslasso's People

Contributors

Stargazers

Watchers

Forkers

davidwhogg peraktong jbirky aframosp yutaozhou sdss dnidever lkampoli

annieslasso's Issues

we need a terminology for labels, parameters and so on

We need to consistently use unambiguously different names for the components of theta, the stellar parameters, the stellar abundances, and labels. And be consistent.

Make a comparison of duplicate observations

Sanders informs me that there are stars with the same APOGEE_ID but different ASPCAP_ID values. We stacked and analysed everything based on ASPCAP_ID. That means we can identify duplicates of APOGEE_ID values and also use them to estimate internal precision (and covariances, as Sanders is doing for distance determination).

Using GitHub Pages for the landing page

It would be ideal if we could make use of GitHub pages for our landing page at TheCannon.io. That way all the page content would be stored in the gh-pages branch of this repository.

Currently TheCannon.io domain just points to the Read the Docs page.

Only one custom domain can be used with GH pages per user or organization. I already have a GitHub pages account set up for my personal website, which means the GH page URL redirects from https://andycasey.github.io/AnniesLasso/index.html to astrowizici.st/AnniesLasso/index.html

That means we need to create an organization or username then specify TheCannon.io as our custom domain.

So, what organization name should we use, @davidwhogg? Once selected, we will have to move this repository to that new organization.

small think-o in scales delivered to the vectorizer

We should offset using a label offset to a fiducial label value (as we do).

But we should scale the K + K * (K + 1) / 2 vectorizer-output terms independently.

Ask @davidwhogg for details.

design and make plots of label differences vs SNR

For presenting results.

Move this repository to @AnnieJumpCannon

add project list to the landing page

This is assigned to @mkness ; Ask for instructions from @andycasey

set scatter parameters according to Hogg's heuristics

...Hogg supply details here

Check that l_bfgs_b output doesn't depend on convergence parameters

Here's the test I want:

Choose a handful of typical pixels.
fit with current defaults.
loosen pgtol by a factor of 10 and re-fit. Do the optimal parameters change by an absolute change of more than 1e-7? If so, we need to tighten pgtol.
return pgtol to normal and loosen factr by a factor of 10 and re-fit. Do the optimal parameters change by an absolute change of more than 1e-7? If so, we need to tighten factr

This PR can be closed when both of these tests pass.

Bonus points: You can also loosen these parameters; we really want them as loose as possible such that the above tests pass.

regularization at the test step?

consider using ridge regression for regularization at the test step:

\argmin[\chi(l)^2 + \lambda(v(l).w)^2]

give Hogg the plots he needs for the APOGEE meeting

Most of what he needs will be satisfied by #36 but he also wants something which shows the non-zero parameters as a function of Lambda. Also something that shows 2-d projections of the 17-d space for the training set, and for a much larger set of stars (if we can!)

Ensure @arXiver will show our M15 abundance comparison when the paper goes on arXiv.

Program hangs at fitting stage in parallel

If a CannonModel (or sub-class) is instantiated with a multiprocessing pool, there are situations where the code will hang when trying to fit spectra. This situation is irrecoverable; the program will never finish. However, it will work perfectly fine in serial.

This situation arises because of Accelerate on OSX, and impacts all Python 2+ and 3+ versions. Specifically, np.dot does not work in parallel under certain conditions (e.g., here).

In Python 2 the workaround is to compile against a different BLAS (see numpy/numpy#4776).

In Python 3 you can also resolve this situation by just using multiprocessing in 'spawn' mode. The way to do this (before importing The Cannon) is to use:

import multiprocessing as mp
mp.set_start_method('spawn')

Since this problem is difficult to diagnose, I've opened this issue in case someone comes searching for these symptoms. However this issue can't be fixed by the numpy group or us, so I'm immediately closing it. (It also encourages me to drop Python 2.7 support and enforce the 'spawn' method for Python 3..)

Allow initial theta to be supplied when training and validating L1RegularizedCannonModel

get smarter about how we use the pixel masks

right now we are not sure how to edit the inverse variance, given the pixel masks

write unit tests for the derivatives

That unit test should be able to test L1, chi-squared, and the whole objective function.

Should we be adding higher powers of the labels?

For example, can we craft the right question to inform us whether it is worthwhile including Teff^4, or say log(Teff), etc. Regularization can help, but these questions need to be thought out properly so we don't get misinformed by cross-terms.

how to choose one, uniform regularization Lambda

DUDE. If we are going to pick one overall Lambda, appropriate for all pixels, and use the same Lambda everywhere, we should use the Lambda that minimizes the SUM of the one-d plots we are making for all the pixels (that we can afford to test).

point thecannon.io to The Cannon documentation

If you log in to Gandi, you should see that you are the admin for thecannon.io. Do your magic.

Robustly identifying stars that are very different to the labelled/training set

In very high label space it may not even be enough to say 'high chi-squared value', if the model is sufficiently flexible.

Bounding constraints on `theta` when solving.

If the vectorizer is scaled, do we have good reason to specify boundary conditions on theta parameters when simultaneously optimizing them with the scatter term? (None of this will matter if analytic derivatives are included, but it is worth pondering)

Continuum

Re-normalize continuum for all stars (training and test), according to the Ness-ish method.

Jacknife the training set

This is trivial to do and would tell us about the (scale of the) uncertainty in the labels that results from having a finite sample size.

This could be its own project if one wanted to look into the uncertainties of the training set labels, etc.

produce spectral predictions for all APOGEE stars

We need to produce not just labels but also the best-fit prediction for every star. This increases disk space but is a very valuable data product.

determine rotation, microturb, and macroturb at test time

The idea is to fit the labels and a few more parameters. Hogg has a trivial implementation idea.

intro text re lessons learned from homogenization

@andycasey write a couple of paragraphs for our introduction about the things you have learned from Gaia-ESO that suggest the value of TheCannon.

Should we support Python 2.x?

There are two reasons I've encountered so far that make me think we should not:

The CannonModel class uses pickle to save and load objects. Python 3 has a newer pickling protocol for saving Python objects, which is incompatible with previous versions. For that reason, any model saved in Python 3 using the highest available protocol (which is the current default in our code base, for many reasons) will not be loadable in Python 2.x. Someone would have to load it in Python 3, then save it again using a lower protocol. The newer protocol has faster read/write times and much smaller file sizes.
See #40, although that is not strictly a 'Python 2' problem but more of an 'BLAS' problem which cannot be fixed by pure-Python code in Python 2.x.

@davidwhogg, @mkness: comments welcome!

compute precisions and return formal errors

This critical

copy data on cannon model instantiation

this is so i dont forget

coefficients/theta and pixel variance

this is a nomenclature question for @davidwhogg:

when trained, a CannonModel currently has attributes model.coefficients and model.scatter.

I'm refactoring the code to calculate the inverse variances and use them in the right way, and in doing so I actually solve for s^2. Should model.coefficients and model.scatter be renamed, and if so, to what?
model.theta and model.pixel_variance? something else?

make issues from things arising at APOGEE meeting

using The Cannon to measure radial velocities for RRL and other stars
using The Cannon to identify spectroscopic binaries automatically
comparing The Cannon single-pixel internals to ASPCAP for the same
using the LSF (or LSF kernels) at test time
determining rotation, microturbulence, macroturbulence, and radial-velocity shifts at test time
outputting The Cannon spectral prediction for every APOGEE spectrum along with labels
producing DR13-trained output for internal use by the APOGEE team

we need to start with un-normalized spectra

...not the spectra normalized by ASPCAP.

actually set / choose fiducial f, Lambda for APOGEE DR12

we have everything we need to do this. This is our major stumbling block for paper 0.

almost every pixel is masked for many training-set stars!

Here is an example: 2M06123730+4036001

...but there are many more. What is going on? How can ASPCAP give us "good" labels for a star that almost no pixel is good in any sub-visit. Literally not one good measurement of any wavelength at all.

re-starts at test time

Right now we do a single optimization at test time. This can't be right, since the optimization is non-convex. Also, Anna Ho finds that for LAMOST there are wrong answers caused by local minima. We should define a set of K initializations and start from all K, and choose the best answer (best in a likelihood sense). This, unfortunately, is high priority...

be able to initialise models without any (training) data

(so that they can be loaded from elsewhere without having the huge data files)

remove all slogdet() calls from code; remove scatter from any possible optimization

explain to Hogg the fork switch

@andycasey how did you switch things to it went from your repo being a fork of mine to mine being a fork of yours? Was that a "setting" change in both repos? Or what? It seems like it must be a github change (not a git change). Anyway, both I and @dfm want to know.

continuum normalization

Either we have to implement the @mkness hack or else do something better, but we need something here.

no polynomials!

We will fit continuum with sums of sines and cosines, not polynomials, for very important reasons!

synchronize github with SDSS svn

we need to have the code duplicated within SDSS to become an official part of DR14. Timescale for completion: mid-2016.

Also, it would be cool to make this very easy to maintain and update.

make a gandi ID and email it to Hogg

for @andycasey ; see http://gandi.net/ .

Sparsity metrics for different Lambdas and scale factors

re-identify continuum pixels

I don't like the @mkness list of continuum pixels. This is outside of scope for paper 1, so this is an enhancement

Optimization issue when training regularized Cannon models

The following image suggests that the optimization is not converging during the training phase for regularization models with high Lambda values.

The problem appears to be unrelated to ftol or xtol. The above figure was using default ftol and xtol. If the tolerance is decreased (to 3e-5) the same behaviour remains for a single line:

I think this bug is fixed now: Here's the same line in a 17-label model using default optimization tolerances:

And here are 28 pixels in a 3-label model:

I also updated the code to allow for the initial theta to be provided, and that initial theta is passed on between successive steps of Lambda. And I've done a bunch of tests (in a 3-label model and a 17-label model) since fixing the bug, and BFGS works well until we hit high Lambda, at which point BFGS returns a warning. I've implemented the best-of-both: BFGS will run first, and if BFGS exits with a warning then Powell's method is run instead.

license rider that says cite Ness et al and Casey et al in prep
list of projects that we want to do, with a few sentences about each
list of example projects that have used TheCannon

@mkness and @davidwhogg can help write the second two items.